Word clouds are a nice way to visually summarize what the text is about, this shows how to create them programmatically in python.
For creating a rectangular Wordcloud see the official documentation.
For creating a masked Wordcloud programmatically, a python script is required. Here is an example script
#!/usr/bin/env python """ Masked wordcloud ================ Using a mask you can generate wordclouds in arbitrary shapes. """ from os import path from PIL import Image import numpy as np import matplotlib.pyplot as plt import os import csv from operator import itemgetter from wordcloud import WordCloud, STOPWORDS # get data directory (using getcwd() is needed to support running example in generated IPython notebook) d = path.dirname(__file__) if "__file__" in locals() else os.getcwd() stopwords = set(STOPWORDS) # Read the whole text. text = open(path.join(d, 'input_text.txt')).read() clean_circle_mask = np.array(Image.open(path.join(d, "map.png"))) wc_rect = WordCloud(background_color="white", max_words=500, width=3000, height=1500, stopwords=stopwords, min_font_size=2, contour_width=3, contour_color='black') wc_rect.generate(text) wc_rect.to_file(path.join(d, "wc-rectangle.png")) wc = WordCloud(background_color="white", max_words=1000, width=2000, height=1000, mask=clean_circle_mask, stopwords=stopwords, min_font_size=2, contour_width=3, contour_color='black') wc.generate(text) wc.to_file(path.join(d, "wc-masked.png"))
Let’s call this script
You can control the words that are ignored using the STOPWORDS set, but sometimes this doesn’t work. For the German language it turns out STOPWORDS don’t work as expected. If you have a lot of articles like “der”, “die”, “das” or “zum”, etc, that you want to remove, simply make a copy of the text and delete them from the text directly.
A map is a black-white image that is used to place the words, in this example the map is this:
Let’s name this file “map.png” so that the above script works.
For the input text I’ll just copy the text from this article into a .txt file.
When making word clouds from articles or books, do not use the references as input, unless you want to remove years, “vol”, etc. from the text.
With the “input.txt”, “map.png” we can call the
make-masked-wordcloud.py like this
?> python make-masked-wordcloud.py
This results in two word clouds, a rectangle and the masked one
The word “method” could be removed from the input.txt, as well as et. al., but that’s not important to know how to use the module.
Everything used to generate the wordcloud is available in this