Python wordcloud for WordPress

By 0x7df, Sun 10 May 2015, modified Fri 18 September 2020, in category Programming

api, html, python, twitter, wordcloud, wordpress, xml

There is a Python routine available on Github for creating a word cloud, created by Andreas Mueller: https://github.com/amueller/word_cloud.

A blog post here, and the Github repo that it goes with (both due to Sebastien Raschka), make it easy to use the Twitter API to download your Twitter timeline (as a CSV file), and then use the word cloud script to produce a word cloud from it.

my_twitter_wordcloud_1

To add something to this, I did the same thing with my WordPress blog posts. I didn't want to bother fighting with the WordPress API, so I simply exported the blog contents to an XML file, which WordPress allows you to do through the admin interface (so you can archive your blog locally and/or transfer it into a different blog). Hence, this really just ends up being about XML parsing. Here is the source code:

#!/usr/bin/python

from HTMLParser import HTMLParser
import xml.etree.ElementTree as ET
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

# http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

tree = ET.parse('0x7df.wordpress.2015-04-25.xml')

root = tree.getroot()

postwords = []

for child in root.iter():
    if child.tag == 'item':
        if child.find('{http://wordpress.org/export/1.2/}status').text == 'publish':
            postbody = child.find('{http://purl.org/rss/1.0/modules/content/}encoded').text
            s = MLStripper()
            s.feed(postbody)
            postwords += s.get_data().split()

keywords = ' '.join([wd for wd in postwords
    if 'http' not in wd and
    'bg=' not in wd and
    not wd.startswith('$') and
    not wd.startswith('[') and
    not wd.startswith('&')
    ])

wordcloud = WordCloud(
    font_path='./SaucerBB.ttf',
    stopwords=STOPWORDS,
    background_color='black',
    width=1800,
    height=1800
).generate(keywords)

plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('./my_wordpress_wordcloud_2.png', dpi=300)
plt.show()

I used the standard library light-weight xml.etree.ElementTree parser. I get the root of the XML document, and iterate over its children; this is recursive, so it descends down the tree to all nodes. Whenever I encounter a node which has the tag item (which contains the post information), I search amongst its immediate children using the find() method, to find one with tag:

{http://wordpress.org/export/1.2/}status

which contains the status of the post, i.e. whether it's published, draft, etc. If it's published (the text that the XML tag contains == publish), then I search again using find() for the tag:

{http://purl.org/rss/1.0/modules/content/}encoded

which contains the blog post text. I put this in the postbody variable.

The next few lines use the class defined earlier on in the script - MLStripper() - to strip out the HTML tags from the blog post. (This came from StackOverflow.) The rest of the script is essentially the same as Sebatian Raschka's code for Twitter, tweaked a little where necessary.

The result is:

my_wordpress_wordcloud_2

The font is called Saucer BB, from here.

Comments

Add comment