How to create a word cloud in Python

Word clouds are a nice way to visually summarize what the text is about, this shows how to create them programmatically in python.

Prerequisites

  1. Python
  2. Python Wordcloud Module

Steps

For creating a rectangular Wordcloud see the official documentation.

For creating a masked Wordcloud programmatically, a python script is required. Here is an example script

#!/usr/bin/env python
"""
Masked wordcloud
================

Using a mask you can generate wordclouds in arbitrary shapes.
"""

from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import os
import csv
from operator import itemgetter
 
from wordcloud import WordCloud, STOPWORDS
 
# get data directory (using getcwd() is needed to support running example in generated IPython notebook)
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
 
stopwords = set(STOPWORDS)
 
# Read the whole text.
text = open(path.join(d, 'input_text.txt')).read()
 
clean_circle_mask = np.array(Image.open(path.join(d, "map.png")))
 
wc_rect = WordCloud(background_color="white", max_words=500, width=3000,
                    height=1500, stopwords=stopwords, min_font_size=2,
                    contour_width=3, contour_color='black')
wc_rect.generate(text)
wc_rect.to_file(path.join(d, "wc-rectangle.png"))
 
wc = WordCloud(background_color="white", max_words=1000, width=2000,
               height=1000, mask=clean_circle_mask, stopwords=stopwords,
               min_font_size=2, contour_width=3, contour_color='black')
 
wc.generate(text)
wc.to_file(path.join(d, "wc-masked.png"))

Let’s call this script make-masked-wordcloud.py.

You can control the words that are ignored using the STOPWORDS set, but sometimes this doesn’t work. For the German language it turns out STOPWORDS don’t work as expected. If you have a lot of articles like “der”, “die”, “das” or “zum”, etc, that you want to remove, simply make a copy of the text and delete them from the text directly.

A map is a black-white image that is used to place the words, in this example the map is this:

map

Let’s name this file “map.png” so that the above script works.

For the input text I’ll just copy the text from this article into a .txt file.

When making word clouds from articles or books, do not use the references as input, unless you want to remove years, “vol”, etc. from the text.

With the “input.txt”, “map.png” we can call the make-masked-wordcloud.py like this

?> python make-masked-wordcloud.py

This results in two word clouds, a rectangle rectangle and the masked one masked

The word “method” could be removed from the input.txt, as well as et. al., but that’s not important to know how to use the module.

Everything used to generate the wordcloud is available in word-cloud.tgz (requires access to the SFB’s Confluence).

See also