A Humble Experiment in Data Mining


Search results for “sugar tariff” from historical newspapers, by date. The tallest bar is 1920-1929.

A few years back the Proquest database folks implemented a nice feature in their search results. In the right-hand column there’s a bar graph showing the number of hits by date for your search term. It’s a search facet, so you can zoom in on a particular decade and it shows you a breakdown by year. I’ve found this feature to be instructive, especially in teaching, because it’s an easy, visual way to chart the significance of some phenomenon (say, like “sugar tariffs”). Out of curiosity, I wanted to know how hard it would be to create such a bar graph for an online archive that did not automatically generate one. My in-house digital humanities consultant sat down with me yesterday to experiment with this. We used Cornell University’s HEARTH, a fabulous online archive of home economics, nutrition, dentistry, and random other topics dating from the 1850s through the 1950s. From an initial search for “sugar,” we got 149847 matches in 4146 records. After a few hours of playing around, this is what we came up with:


Hits Per Year for “Sugar” in the HEARTH Database (click to enlarge). Note that it has the same general shape as the newspaper results for “sugar tariff,” both peaking in the 1910s and 1920s.

I find it quite fascinating (and not at all surprising) to see the peak in hits between 1913 and the mid-1920s, at about the same time and pace as simultaneous debates about sugar tariffs. People in those years cared a lot more about tariffs than people do now. People didn’t glaze over at the mere mention of the t word. The HEARTH archive features nutrition, etiquette, and health books and magazines, sources which do not in themselves discuss the tariff or sugar politics writ large. But these findings offer a kind of confirmation to my hunch that  people talked a lot more about eating sugar at the same time that they had heated debates about the sugar tariff in the 1910s and 1920s.

So how did we make this graph? It may be that some DH wunderkind has some different tools up their sleeves to accomplish this, but here’s our homemade technique.

1. We tinkered with the url from the search results so that all of the records for the “sugar” search showed up on one page. The very end of the unwieldy url goes like this: start=1;size=25 We changed it so that it read start=1;size=5000 It coughed up the results relatively quickly.

2. We copy pasted the results into notepad ++ and went through a number of steps to clean up the data. With its nice a find/replace function, notepad ++ is a great tool for systematically converting/cleaning up text. We essentially removed all text, leaving only the numbers. We did this slowly, one step at a time, until all that was left were pairs of numbers, tab-separated like this:

1902    26
1902    26
1919    12
1921    1
1917    100
1904    1
1919    105

etc., The first number is the year, the second number is the number of hits. Each pair represents one book or magazine that had hits for “sugar.” Thus, as you can see, there were multiple lines for each year. We condensed these in the next step.

3. We copy pasted this list into a google docs spreadsheet. I can’t remember exactly what I did to make it do this, but I did some magical step that merged together all of the hits where the year matched. (update: the magical step may have been converting it to a pivot chart). I tried to make a nice bar graph in google docs, but I couldn’t figure out how to make it label the axes properly. So I ended up cutting and pasting the data into Excel to make the chart. The hardest part was saving the chart out.

Not fancy, and somewhat labor intensive. But mildly entertaining for a Saturday afternoon!

If you have other ideas about how to do this, please feel free to drop us a line.


About aprilmerleaux

I am an Assistant Professor of History at Florida International University. My research and teaching focuses on the 20th century United States in international context. My book, Sugar and Civilization: American Empire and the Cultural Politics of Sweetness was published by UNC Press in 2015.
One Response to A Humble Experiment in Data Mining

  1. Dale says:

    Thanks for putting a link to this mini-project on FB. It’s a great example of putting together curiosity and a bit of tinkering. Best of all, you actually describe the steps you took, even those that involved magic.

    I am so far removed from DH-wunderkind that it’s presumptuous of me even to comment on your methodology, but will do so anyway. Have you ever tinkered with OpenRefine, formerly known as Google Refine? It might have done your cleanup in fewer steps, and would certainly be the way to go for a larger data source. I’ve puttered with it somewhat, but it’s been a while.

