Trending Low-Volume Google Searches - Introducing Gootrude
15 June, 2008

Now for the original post:
The Google Trends project allows you to input search terms like "Myspace", "2008 election", or "Linux", and see how Google tracks how popular these search terms are over time. The resulting graphs can be quite interesting - spikes in search volume can sometimes be correlated against particular news articles and world events, and the Google Trends interface points these out.
This is a handy tool, but there are many search terms that Google Trends does not display any results for. Such terms (such as "Linux Firewalls" - with the quotes) have insufficient search volumes to display graphs according to the error message that Google Trends generates. Fair enough. I suppose that Google sets an internal threshold on search volume, and this threshold could be set for reasons that range anywhere from Google Trends is still experimental to Google not wanting to provide data on how it builds its massive search index for emerging search terms. Either way, I would like a way to see search term trends that Google doesn't currently make available to me.
Although I'm an open source developer and author, search terms related to my projects are not popular enough yet to be displayed by Google Trends. So, I had to roll my own trending mechanism, and this blog post announces the release of a new open source tool Gootrude (see the quick start, source code, and download links) that I wrote to do just this. The basic strategy is to take a collection of search terms defined by the user, automatically query Google for the number of results associated with each of these search terms (this is displayed by Google when doing a web search), and graph these numbers over time with Gnuplot. At this point let me state up front that Gootrude only makes use of data that Google freely provides to everyone with normal web searches, and is meant to be run once per day (so as to not be a pest in terms of the numbers of queries it makes). As an example, if you type in the word "security" into the Google web interface, it will return a string like "Results 1 - 10 of about 1,010,000,000". The "1,010,000,000" number is collected by Gootrude and stored in a file along with the current time.
For the past year, I have sent a set of search terms through Google once per day with Gootrude and the results are displayed below. Visible within the data returned from Google are strange oscillations that vary quite a bit more than I would have expected, and also evidence for what happens when a large site (like linux.com) posts an article about a Cipherdyne project.
First, below is the graph of the fairly unique word "cipherdyne" since late June 2007. The filled-in red curve is the absolute number of search results (taken each day around 1am), and the green line is the 10-day moving average.

Now, here is the graph of the search term "gpgdir":

Finally here is the graph of "single packet authorization":

There are lots of unanswered questions this sort of data brings to mind:
- All of the data for the above graphs was collected from a single Linux system. How different would the results be if several systems in different geographic locations collected the data and the average for each data point was used instead?
- Each data point was collected around 1am every morning. If the data collection time were, say, 1pm, would the results have been significantly different?
- What is the "optimal" time scale for the moving average? Given that Google's own Trends interface seems to show search results on the macro level, would a much longer moving average than 10 days - perhaps on the order of several weeks - be a more accurate reflection of search popularity?
In closing, I would like to mention that Gootrude is just getting started, so there are lots of enhancements that need to be made. Some of the most important features to develop are:
- Integration with the Google Charts API.
- Development of an online web portal for Gootrude so that users don't have to have their own infrastructure to run Gootrude.
- Ability to import search data from different Gootrude collection systems.
- Add support for data collected from additional search engines.
Finally, here are a few additional graphs of search terms over the past year:




