Today’s guest post was written by Sophy Bishop. Sophy is a student at Simmons College Graduate School of Library and Information Science where she focuses on information organization and user experience (UX). She works part time at the HLS Office of Communications and interns at America’s Test Kitchen with the production team.
I first started working on site search analysis at Harvard Business School in February of 2012. My supervisor there, Ravi Mynampaty, had the idea that “clustering” or grouping similar search terms into groups might help in analyzing Google Analytics search data, and, in turn, help us to make adjustments that would improve the “findability” of the content for which the site visitors were most often searching.
I was given a list of search queries from the Harvard Business School Working Knowledge site and was asked to develop a plan using this concept. The findings would be presented that May at Enterprise Search Europe 2012, so the pressure was on.
Where to begin?
It took several weeks to wrap my head around the project and figure out where to begin. My experience with Google Analytics was minimal, and I hadn’t done much data analysis before. Reading Lou Rosenfeld’s book, Search Analytics for Your Site (2011) gave me some ideas.
Rosenfeld stated that it was a valuable practice to look at your search queries and compare them to your site content, however he did not outline a specific way of doing so. Rosenfeld did state that looking at the data over a period of time was helpful. One starts to see patterns and trends within the information that the human brain can recognize easily.
So after staring for a while and beginning to see some trends, I came up with a process for clustering, which is described below.
The process
In this method, it is important to remember that search terms are representations of concepts that people are searching for. In a traditional analytics report, similar or synonymous concepts are separated because they are being represented by different search terms. The act of clustering brings these concepts back together to form a more holistic picture of what people are looking for.
The clustering process is different for each site. It all depends on:
- what kind of site you are looking at,
- how far back in time you want to go,
- what you want to use it for, and
- how much time you want to put in.
While still a work in progress, clustering is starting to show some fascinating and useful information about the way in which users search for and access information. I am now using it to analyze searches on the Harvard Law School website as well. By grouping similar ideas together such as “harvard law school resume” and “hls resume,” it is easier to comprehend and represent in numbers these ideas.
Step 1: Create a Search Query Report
The first step in this process is to create a report from an analytics tool, in both of my cases, Google Analytics. It is important to set parameters on this, or you could end up working on it forever. On the HLS site, I chose to look at one year, from May 1st, 2011 to May 1st, 2012 and exported terms that had 300 searches or more. (If you are interested in looking at search queries related to your content, contact the Office of Communications.)
Step 2: Cluster!
Once these raw search queries have been exported to a spreadsheet, it is time to start clustering. There are different degrees of clustering, and these different degrees will tell give you different information.
On the HLS site, I began with what I call “mid-level” clustering. I standardized the different queries into groups so that one could see concepts searched for. One could assign narrower clusters by simply removing any grammatical or spelling mistakes, or wider clusters (or facets), such as “faculty,” “subject,” or “organization” to see what people are searching for more broadly.
Example of mid-level clustering:
| parking at hls |
hls parking |
471 |
| parking at hls |
hls parking |
333 |
| harvard law school personal statement |
hls personal statement |
374 |
| harvard law personal statement |
hls personal statement |
358 |
| hls registrar |
hls registrar |
5,852 |
| harvard law registrar |
hls registrar |
2,673 |
| harvard law school registrar |
hls registrar |
2,538 |
| registrar hls |
hls registrar |
425 |
| harvard law school requirements |
hls requirements |
4,607 |
| harvard law requirements |
hls requirements |
816 |
| requirements for harvard law school |
hls requirements |
810 |
| harvard law resume |
hls resume |
1,534 |
| law school resume |
hls resume |
1,263 |
| harvard law school resume |
hls resume |
920 |
| law school resume template |
hls resume |
416 |
Once I performed this step, I created a simplified list of clusters by removing the duplicates and adding together the frequencies (or number of times that the term was used to search) to create one line which represents each clustered concept and the frequency for which it was searched.
| harvard law school personal statement |
hls personal statement |
374 |
| hls registrar |
hls registrar |
5,852 |
| harvard law school requirements |
hls requirements |
4,607 |
| harvard law resume |
hls resume |
1,534 |
| harvard law school reunion |
hls reunion |
1,026 |
| harvard law school schedule |
hls schedule |
358 |
| harvard law school scholarships |
hls scholarships |
632 |
| harvard law sfs |
hls sfs |
437 |
| harvard sjd |
hls sjd |
991 |
| harvard spring break |
hls spring break |
661 |
| harvard law school statistics |
hls statistics |
508 |
There are different ways to do this; I performed the task manually and kept the previous list on a separate page so that I could see which queries went into which cluster. At HBS, a colleague created a macro to perform this task automatically and view or hide the information that went into each cluster.
To some degree, this analysis is quite subjective. Some may not agree with the way I clustered terms. For example, they may have put “harvard resume” in a separate category than “hls resume sample”. I chose to put them together because I felt the concept they were searching for was similar. Fortunately, it is not difficult to go back into the list and separate and change terms if you track changes and versions of your documents.
A note about focusing on the “long tail”:
In many search analysis exercises, it is common to look only at the top searches, those that are deemed “significant” and only one standard deviation from the mean. In clustering, the aim is to bring in the “long tail” of the search.
For example, the exact term “Harvard llm” was searched 18,544 times in the raw query report. After clustering however, I discovered that people had search for the concept of “Harvard law school llm” 33,833. To get this number, I grouped together the different ways in which people search for the “Harvard law school llm degree” including term such as “harvard university llm”, “hls llm”, “harvard llm program” and “llm harvard law school”.
By looking at search terms that would normally be deemed “insignificant” and grouping them together, concepts that might have gone unnoticed, but are indeed important, are brought to the forefront.
Step 4: Ways to apply the data
Now we have our clustered list, what to do with it?
As the “clustering” process is flexible in it’s methods, it is also flexible in its applications. At HBS we developed several ways in which the information could be used, depending on what outcomes you would like to see and how flexible you are to adjusting your site.
SEARCH ANALYSIS
1. The first way to use the set of clusters you have created is as an organized base for search analysis.
- Does the list represent what you thought your top searches would be?
- How close is it to the data report that came right out of Google Analytics?
- Does the information on your site accurately match these searches?
- Are you fulfilling user needs?
At HLS, I would say the primary reason for the search analysis and “clustering” is the aforementioned reason; to gain insight into what people are searching for primarily and move forward from there. With the list that I culled, I was able to determine which searches were highly important and what falls to a lower ranking. This is useful in determining front-page content, what needs to be highlighted and what perhaps needs to be edited out.
For example, in the raw query report, “sample cover letters” was searched 4105 times. After clustering, the concept moved up to 19111 searches. It seems that people are very eager to see sample cover letters and gain some advice on the topic! It is useful to look at the top “clusters” and compare them to the structure of the website. Are there any major gaps? How simple is it for users to find the information they are looking for and what comes up when they do?
METADATA and CONTROLLED VOCABULARIES
2. In addition to analyzing search topics, at HBS we determined that clustering could be an informative tool for creating metadata and a controlled vocabulary*. Using the actual terms that people searched is a solid way to begin tagging your content to improve SEO and findability. And while it will not entirely form a new controlled vocabulary or taxonomy, it can certainly inform the creation of one.
INTERNAL SITE SEARCH
3. We also established that if you have the time, using your clustered terms could help improve internal site search engine results (as opposed to improving searches from external search engines).
Once you have establishesd a list of metadata tags for pages and posts, you can re-route people’s searches to fit those tags. For example, if someone searches for any of the following, the search would re-route to the established tag “hls status check”:
| harvard law status check |
| harvard law status |
| harvard law status checker |
| harvard law application status |
| harvard law school status check |
| harvard law school application status |
| harvard status check |
| harvard status checker |
This is work for the programmer and content managers to implement, but it some situations could be very helpful, particularly around oddly named or obscure terms.
AUTOSUGGEST
4. Lastly, clustering could be used to inform autosuggest (a.k.a. incremental search). Top clusters make up a large percentage of what people search for, so one could use the most popular concepts to implement/support autosuggestions, pointing out the material that people are most likely to want to find.
For example, on the HLS website some good autosuggest options might be:
• Office of Career Services
• Cover Letter Help
• HLS Library
• HLS Admissions
• Status Check
Well that is the story on clustering! Hopefully people at HLS will be able to use this process to improve their websites and make them more useful for users.
Questions?
Please post any questions you may have in the comments and we will respond to them there.
Further reading:
Visit SlideShare to see the HBS clustering slides from Enterprise Search Europe.
*Metadata, controlled vocabularies and tagging:
Our current CMS doesn’t support tagging, however, we are working on a project that would allow us to categorize and tag editorial news articles and spotlights, content which would then be available throughout the site on relevant pages.