Python wordcloud for WordPress

By 0x7df, Sun 10 May 2015, modified Fri 18 September 2020, in category Programming

api, html, python, twitter, wordcloud, wordpress, xml

There is a Python routine available on Github for creating a word cloud, created by Andreas Mueller:

A blog post here, and the Github repo that it goes with (both due to Sebastien Raschka), make it easy to use the Twitter API to download your Twitter timeline (as a CSV file), and then use the word cloud script to produce a word cloud from it.


To add something to this, I did the same thing with my WordPress blog posts. I didn't want to bother fighting with the WordPress API, so I simply exported the blog contents to an XML file, which WordPress allows you to do through the admin interface (so you can archive your blog locally and/or transfer it into a different blog). Hence, this really just ends up being about XML parsing. Here is the source code:


from HTMLParser import HTMLParser
import xml.etree.ElementTree as ET
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

class MLStripper(HTMLParser):
    def __init__(self):
        self.fed = []
    def handle_data(self, d):
    def get_data(self):
        return ''.join(self.fed)

tree = ET.parse('0x7df.wordpress.2015-04-25.xml')

root = tree.getroot()

postwords = []

for child in root.iter():
    if child.tag == 'item':
        if child.find('{}status').text == 'publish':
            postbody = child.find('{}encoded').text
            s = MLStripper()
            postwords += s.get_data().split()

keywords = ' '.join([wd for wd in postwords
    if 'http' not in wd and
    'bg=' not in wd and
    not wd.startswith('$') and
    not wd.startswith('[') and
    not wd.startswith('&')

wordcloud = WordCloud(

plt.savefig('./my_wordpress_wordcloud_2.png', dpi=300)

I used the standard library light-weight xml.etree.ElementTree parser. I get the root of the XML document, and iterate over its children; this is recursive, so it descends down the tree to all nodes. Whenever I encounter a node which has the tag item (which contains the post information), I search amongst its immediate children using the find() method, to find one with tag:


which contains the status of the post, i.e. whether it's published, draft, etc. If it's published (the text that the XML tag contains == publish), then I search again using find() for the tag:


which contains the blog post text. I put this in the postbody variable.

The next few lines use the class defined earlier on in the script - MLStripper() - to strip out the HTML tags from the blog post. (This came from StackOverflow.) The rest of the script is essentially the same as Sebatian Raschka's code for Twitter, tweaked a little where necessary.

The result is:


The font is called Saucer BB, from here.


Add comment