In this post, I present how query preprocessing can make your on-site search better in multiple ways and why this process should be a separate step in your search optimization. Below I will present the following points:
- What is query preprocessing and why should you use it?
- What is the problem with common structures?
- What are the benefits of externalizing the query preprocessing step from your search engine?
What is Query Preprocessing and Why You Should Use It
Your onsite search is basically an Information Retrieval (IR) System. Its goal is to ensure your customer (the user) is able to get the relevant information from it. In the case of an ecommerce shop this is typically products he searched for or wants to buy. Of course, there are many goals for your website, like using marketing campaigns to increase revenue and so on. However, the main goal is to show your customers the products and information they searched for. The problem is that the user approaches a search in your shop in his or her own personal way. Each customer speaks his or her own vernacular if you will. Therefore, it simply isn’t feasible to force customers to, or imply they should speak the language of your particular onsite-search. Especially, considering the overwhelming likelihood that your search engine will require some kind of technical speak to reach peak performance.
In my experience, there are two extreme examples of why queries do not return the desired search results aside from the shop not stocking the right product or missing information the customer is looking for.
- Not enough information in the query -> short queries like “computer”
- Too much noise in the query -> queries like “mobile computer I can take with me”
In the first case, we expand the query from “computer” to something like: “computer OR PC OR laptop OR notebook OR mobile computer”, to get the best results for our users.
In the second case, we first have to shrink the query by removing the noise from “mobile computer I can take with me” to “mobile computer”, before expanding to something like: “laptop OR notebook OR mobile computer” to get the best results for our users.
Of course, these aren’t the only query preprocessing tasks. The following is an overview of typical tasks performed to close the gap between the user's language and the search engine to return better results:
- Thesaurus and Synonyms entries
- Stemming - reducing words to their root parts
- Lower Casing
- Stop-Words handling - eliminating non-essential words like (the, it, and, a, etc.)
The Problem with Common Information Retrieval Structures
The preprocessing described above is normally carried out and configured within your search engine. The following graphic shows an overly simplified common onsite search structure:
- Users search using their own language and context regarding your products. This means that they will not intuitively understand the language most preferable to your Information Retrieval (IR) System.
- In a nutshell your onsite search is a highly configurable IR which currently performs all preprocessing.
- The raw data used by your IR for searching.
In addition to optimizations like correctly configuring fields and their meanings, or running high-return marketing campaigns, most optimizations to your onsite search are done by query preprocessing.
So here’s my question: does it really make sense to do all this pre-processing within a search engine?
Have a look at this overview of potential obstacles when pre-processing is handled within the search engine:
- A deep knowledge of your search engine and its configuration is necessary.
- Changing to a new onsite search technology means losing or having to migrate all previous knowledge.
- Onsite search is not inherently able to handle all your pre-processing needs.
- Debugging errors within a search result is unwieldy, then it’s necessary to audit both pre-processors as well as related parts of the onsite search configuration.
The Benefits of Extracting the Query Preprocessing Step from Your Onsite Search Engine
Having illustrated what query preprocessing is, and which potential problems you could face when running this step inside your search engine, I want now to make a case for the benefits of externalizing this step in the optimization process. Have a look at the graphic below for a high-level illustration of the concept when preprocessing is done outside your search engine.
- The effort it takes to configure your onsite search engine when migrating from one search vendor to another, can be dramatically decreased as a result of having externalized the query preprocessing. This also has the following benefits:
- less time spent trying to understand complex search engine configurations.
- Lower total cost of onsite search ownership
- Your query preprocessing gains independence from your search engine’s main features.
- Externalizing means you can cache the query preprocessing independently of your search engine which has a positive impact on related areas like total cost of ownership, the environment, and so on. Take a look at this article for more information.
- Debugging search results is easier. The exact query string, used by the search engine, is always transparently visible.
Now you know the benefits of query preprocessing and why it could make sense to externalize this step in your data pipeline optimizations.