US Presidential State of the Union
Speech Visualization
Felix Gonda
felix.e.gonda@gmail.com/fgonda@fas.harvard.edu
Homework #3 Solution
Visualization (CS 171)Harvard University, Spring 2011
(Main Page) | (Run Applet)
Harvard
School of Engineering
& Applied Science
Instructor: Dr. Hanspeter Pfister Spring 2011
Part I: Acquire and manipulate text data
For my visualization, I decided to go with the recommendation given in the homework problem, which is to
do a comparison of all the US presidential state of the union speeches. I chose this dataset for several
reasons:
• The data is readily available on the web and can be found in a single location; therefore making it
easier to write a data scraping software to programmatically analyze the text and generates the
frequencies table.
• A presidential speech tends to have a specific theme and tone that is reflective of the state the
country. For example, during George W. Bush presidency, the tone of his speeches tends to center
around security while during Obama presidency his speeches centers on economy. So it’s interesting
to see the direction of the country based on the words use by the presidents over the years.
Data Source:
The source for all the speech texts is the American Presidency Project (http://www.presidency.ucsb.edu/sou.php)
which contains a database of all the state of the union addresses given by all US presidents from 1790 to 2011.
Data Acquisition and Code:
I acquired all my data programmatically using a python script that I wrote based on the homework 2 utilities.
I used BeautifulSoup to open links to all the speech documents and scrape and clean the data. Then the script
generates TSV file for each speech and stores it in the data directory of the visualization sketch software.
When the script is run, it outputs processing messages to the console to inform user what text is being process
at the time.
Part II: Visualization with processing
My processing visualization program can be found in the “WordVisualizer” directory of my homework 3 solution.
It contains the WordVisualizer file which is the main program to run.
The visualization presents a histogram of the top 100 frequent words and a CDF graph. On the right hand side
is a set of navigation controls for browsing speeches by year and also for controlling the bin size of the
histogram (the range of the bin size is 1-10). The “Word Tree” button is used to view a tree version of the
visualization for extra credit. Placing the mouse over histogram bars will show the words represented by
each bar as tooltip popup.
Below is a screenshot of my visualization sketch in processing:
References
Research Materials & References:
(1) Visual Thinking: for Design
Morgan Kaufmann; First Edition edition (April 18, 2008)
(2) Processing: A Programming Handbook for Visual Designers and Artists
The MIT Press (September 30, 2007)
(3) Ben Fry.
http://benfry.com/
(4) Processing Language
http://processing.org/
(5) American Presidency Project
http://www.presidency.ucsb.edu/sou.php