Search Quality In Practice

Wilson Wong
Practical AI Coalition
28 min readJul 1, 2020

--

This article was written and originally published in October 2016. I have reviewed and refreshed the content to reflect changes with the tools used in this article in 2020. The source code and data used and mentioned in this article are available from https://github.com/wyswilson/jobsearch-engine

Good search is about ensuring that users find what they need. It can be something as simple as the contact details of a restaurant that someone needs to make a booking to something more life-defining such as exploring for new jobs. The areas of search engineering and quality are as much an intellectual pursuit as it is a technological one.

Photo by Kayla Farmer on Unsplash

Often, people think about building a search system purely from a technical perspective. However, this only accounts for half of what is actually required to design, build and maintain a user-centric search system. In this article, we look at search quality in practice. We will go through the steps taken to set up a non-production ready search system for the purpose of demonstrating the iterative, empirical nature of search quality.

Different users go after different things from a search system or product. Tracking, measurement, evaluation and monitoring form a big part in setting up and maintaining a search system that is able to better understand the things that matter to the users, and use the understanding to improve the system to better serve them. Not tracking and not using data to inform search improvement is one of the six pitfalls when it comes to building and maintaining search engines.

These are search quality matters that we will describe in Steps 7 to 10, which are covered in the second half of this article. Before that, we begin with Steps 1 to 6 by looking at some basic engineering work required to get a search system up and running. For this exercise, we simply rely on Elasticsearch, MySQL and Apache web server, and using languages such as PHP, Perl and/or Python as the glue to build a basic search system for job ads crawled from a free-to-post website.

Step 1 — Prepare the content

The first step is to get the content ready and available for indexing by Elasticsearch. A web crawler was assembled quickly in Perl using the LWP and JSON modules. The crawler works in two stages: scraping and downloading the HTML content, and extracting pieces of information of interest and store them into a database. For the purpose of this exercise, we use the free job posting site called postjobfree.com. I had a read through the terms of use to ensure that the site does not prohibit crawlers from downloading the job ads and that they can be used for non-commercial, demo purpose. The crawler was set to visit the site and it downloaded 7,005 job postings between 4–5 October 2016. Using the extractor, we parse the HTML and picked up five main attributes: doctitle, doctext, doccompany, doclocation, docdate. These values are stored in a MySQL table. A unique identifier is generated by hashing the full HTML file. The crawling and extraction of text from HTML pages can be easily done in Python as well. At this stage, we have some content for the next step.

Step 2 — Set up an Elasticsearch instance

Next, we download Elasticsearch and unpack the files into a folder. The latest version that is used in this exercise is 7.8.0, released in June 2020. As I am on a Windows machine, the elasticsearch.bat in the bin folder was used to get Elasticsearch up and running. If you need an interface to query and interact with Elasticsearch, you can consider API development and testing tools such as Postman or Insomnia.

There are two main parts in the JSON object for creating an index: the settings and the mappings of properties. There was previously a top-level key called type that encapsulates the properties but that has been removed since Elasticsearch 7.0. Since Elasticsearch is running locally for this exercise, it resides at http://127.0.0.1 where the default port for the endpoints is 9200. Next, an index is created in Elasticsearch using the PUT method as shown below with the index name jobs. As part of that, we define five properties covering the title, text, date, company and location of the job ads. The settings is where we outline the specifics of the index such as the number of shards and replicas, which are not discussed here.

PUT /jobs
{
"settings":{
....
},
"mappings": {
"job": {
"properties": {
"docdate": {
"type": "date",
"format": "yyyy-MM-dd"
},
"doctitle": {
"type": "text"
},
"doctext": {
"type": "text"
},
"doccompany": {
"type": "keyword"
},
"doclocation": {
"type": "keyword"
}
}
}
}
}

Previously, there was only the string type where if you do not want the string to be broken down and analysed for full-text search, you would set index to not_analyzed. Now, that distinction is made at the text vs keyword type level. For fields of type text, we can apply a set of rules called analyzer to process the text. An analyser performs basic morphological and lexical transformations on the text to improve searchability. Things such as tokenising and stemming are done here and a wide range of languages are supported. As for properties of type date, you can define the format that suits your data and more info can be found here. Hence, we will set doctitle and doctext to type text, and for the other fields where the strings need to be searched in their entirety, we set the type to keyword. We do that for doccompany and doclocation.

Step 3 — Write a feeder to index documents

At this stage, we have an Elasticsearch instance with an empty index and a database of content. Next, a feeder is built in PHP to send content off to the index. The documents are indexed using the POST method to the endpoint http://127.0.0.1:9200/jobs/_doc.Each document requires a unique identifier, which can be provided explicitly or one will be generated if not provided. There are meta-fields that are created by default and one of them is _id. This field will be populated with the unique identifier if it is provided, so there is no need to create a field just for unique identifiers for each document in the index. Below is an example call to the endpoint with reference to the index jobs and an explicit identifier fb4b50b8fdf075db3ce859038d88fe7d for the document:

POST /jobs/_doc/fb4b50b8fdf075db3ce859038d88fe7d
{
"docdate": "2016-10-04",
"doctext": "Summit Management is currently seeking a highly...",
"doclocation": "New Britain, Connecticut, United States",
"doccompany": "Summit Management Corp.",
"doctitle": "Firearm Industry Sales Representative"
}

As for the feeder, the PHP script reads from the MySQL table of structured job content, and organises the values into JSON format as above which are then sent to Elasticsearch. I have also built in the ability in the feeder to recognise documents which have already been indexed so that the next run of the feeder will not attempt to push the same content to the index again. This is to allow incremental addition of documents to the index.

Step 4 — Write a search endpoint to interact with Elasticsearch for keyword search

At this stage, we should have an Elasticsearch index with several thousand job ads that are searchable. Next, we create a PHP back-end called search.php is created to receive a number of search parameters from the search UI (which at this stage still does not exists) and constructs the JSON query for Elasticsearch. The parameters that we will look at initially are keywords and location. In this step, we explore the different ways of constructing the JSON query for keyword search for Elasticsearch. The query is sent to the endpoint http://127.0.0.1:9200/jobs/_search as such:

GET /_search
{
"from": 0,
"size": 5,
"query": {
"match": {
"doctitle": "Product Manager"
}
}
}

Our first iteration involves the most basic keyword search. We think about how we can structure the query with a 2-word search term “Product Manager” that comes from the UI. We begin by using the match query as shown above and target the doctitle field. This would return documents that match any of the query words. In other words, the default Boolean operator OR is applied between words in the query, which can be re-written as below. In our corpus of 7,005 documents, both the above and below queries returned 554. We set the size to 5 in order to limit the number of results per batch.

GET /_search
{
"from": 0,
"size": 5,
"query": {
"match": {
"doctitle": {
"query": "product manager",
"operator": "or"
}
}
}
}

The query above can also be expressed differently using bool query and the should clause. There are four types of clauses for a bool query — must, should, filter and must_not. They are pretty self-explanatory. It is worth noting that in cases where a bool query does not contain must or filter, one or more should clauses must match. In other cases, the should clauses are optional, which only contribute to the scoring of the individual matches for ranking.

GET /_search
{
"from": 0,
"size": 5,
"query": {
"bool": {
"should": [
{
"term": {
"doctitle": "product"
}
},
{
"term": {
"doctitle": "manager"
}
}
]
}
}
}

We know that this approach to search especially in a vertical search context is sub-optimal. It tends to return too many results, many of which are irrelevant. Instead of matching on just any words in the search term, we need to make sure that only documents that contain all words words (or both in the context of “Product Manager”) are returned. The JSON query is revised to use the must clause instead of should as shown below. Alternatively, we can set the operator from OR to AND if the query was written using the match clause. The results count became 17.

GET /_search
{
"from": 0,
"size": 5,
"query": {
"bool": {
"must": [
{
"term": {
"doctitle": "product"
}
},
{
"term": {
"doctitle": "manager"
}
}
]
}
}
}

In order to find more potentially relevant results, we want to broaden the search to also include other fields. Sometimes, the desired words may not exist in the title of the document for various reasons, and we may need to also look in other fields especially the text content. This time, we combine the bool and the match queries as shown below. What the query says is that we want to look for documents that contain both the words (in any order or distance apart at this stage) in either the doctext or the doctitle field. When we ran the query below in Insomnia, the retrieval size increased to 290 from 17 when we were only searching in the title field.

GET /_search 
{
"from": 0,
"size": 5,
"query": {
"bool": {
"should": [
{
"match": {
"doctext": {
"query": "product manager",
"operator": "and"
}
}
},
{
"match": {
"doctitle": {
"query": "product manager",
"operator": "and"
}
}
}
]
}
}
}

What if we want to add another mandatory field to lookup. For instance, say that we want to not only look for documents containing both the words “Product” and “Manager” but they also have to be in a certain location. In this case, we cannot just append a must clause after the should clause in the JSON above. The reason is, as mentioned earlier, with the presence of the must clause, the matches to the doctitle and doctext fields become optional. We need to instead wrap another bool query around the JSON above and group the existing bool query with a new match query that points to the doclocation field. As shown in the query below, using “New York” as an example location and the same two words “Product Manager”, the result size reduced to 12, down from 290 job ads. This indicates that there are only 12 job ads with the exact location “New York” that have both the words “Product Manager” in them.

GET /_search 
{
"from": 0,
"size": 5,
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"match": {
"doctext": {
"query": "product manager",
"operator": "and"
}
}
},
{
"match": {
"doctitle": {
"query": "product manager",
"operator": "and"
}
}
}
]
}
},
{
"match": {
"doclocation": {
"query": "New York"
}
}
}
]
}
}
}

The inner bool query above that match “Product Manager” against the doctitle and doctext fields can also be rewritten using the multi_match query as shown below. The most_fields type is set so that if the two words match multiple fields, the scores across those fields are added.

GET/_search 
{
"from": 0,
"size": 5,
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "product manager",
"fields": [
"doctitle",
"doctext"
],
"type": "most_fields",
"operator": "and"
}
},
{
"match": {
"doclocation": {
"query": "New York"
}
}
}
]
}
}
}

There are times especially with vertical search that we would want to have a view of how the results are distributed across certain dimensions. These dimensions are informative to the users and can be offered as facets for filtering. In the case of our corpus, assume that we need to use the doccompany field as a facet for filtering. To make this happen, we add an aggregations or aggs object to the JSON above to return the unique values in the designated field and the count of documents next to them. We give the aggregation the name doccompanies as shown below.

GET /_search 
{
"from": 0,
"size": 5,
"query": {
...
},
"aggs": {
"doccompanies": {
"terms": {
"field": "doccompany"
}
}
}
}

The results returned would now contain aggregation of the doccompany field as shown in the JSON response below. These key-doc_count pairs can be used by the UI to show options to the users for filtering. Note that we previously explained the purpose of the analyzer in Elasticsearch. If we initially allowed the doccompany field to be analysed or set the type to text, the buckets that come back would appear as groupings of the individual words in the company names instead of the complete names.

{
...
"aggregations": {
"doccompanies": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Gap Inc.",
"doc_count": 3
},
{
"key": "HSBC",
"doc_count": 2
},
{
"key": "Nigel Frank",
"doc_count": 2
},
{
"key": "Goldman Sachs & Co.",
"doc_count": 1
},
{
"key": "Oracle",
"doc_count": 1
},
{
"key": "RedHat Inc",
"doc_count": 1
},
{
"key": "Scholastic",
"doc_count": 1
},
{
"key": "The Forum Group",
"doc_count": 1
}
]
}
}
}

Step 5 — Create UI for keyword search via the endpoint

We now have a basic endpoint that receives keywords and location, constructs the corresponding query in JSON and accepts the response from Elasticsearch. The JSON response at this stage contains the individual hits and an aggregation around the doccompany field as shown below.

{
"took": 7,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 12,
"relation": "eq"
},
"max_score": 9.481624,
"hits": [
{
"_index": "jobs",
"_type": "_doc",
"_id": "b25d9d2887f77d0c051d4166537673a7",
"_score": 9.481624,
"_source": {
"docdate": "2016-10-05",
"doctext": "...",
"doclocation": "New York",
"doccompany": "Gap Inc.",
"doctitle": "Assistant Manager, Merchandising Gap...",
"type": "job"
}
},
...
]
},
"aggregations": {
"doccompanies": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Gap Inc.",
"doc_count": 3
},
...
]
}
}
}

Next, a basic search UI called index.php is created with two input boxes organised into a collapsible left panel as shown. It uses jQuery AJAX methods to interact with search.php to get search results. You can of course easily create this as a React app nowadays. The coding work for this article was done 4–5 years ago, so excuse the backwardness. What and how we code is besides the point of this article. In the index.php search UI, a query is also issued to the Elasticsearch instance to fetch the document count for display in the header as show below in Figure 1.

Figure 1. Search home page with two search boxes on the left panel and the document count in the header.

Upon submission, the values of the keywords and the location fields are sent to search.php, which is our search endpoint. At this stage, we extend the search endpoint to parse the JSON response from Elasticsearch. It loops through the individual hits, storing the key fields that we want to display in the UI. Similarly the key-doc_count pairs in the aggregation are parsed and return to the UI to be displayed as the companies facet as shown in Figure 2 below. Notice that we also have to handle the pagination since each page only shows five results. We do this by adding page number as another parameter that we pass between the UI, the search endpoint and Elasticsearch. In the endpoint, we use the from and size attributes in the JSON query to achieve that. The screenshot below show how the interface would look like using the same “Product Manager” in “New York” search query. As you can see the results count is also reflected, which came down from 7,005 to 12 jobs.

Figure 2. (Left) The keywords and location search boxes alongside the company filter from the doccompanies aggregation. (Right) The 12 results for a search using the keywords “Product Manager” and “New York” as location when the left search input panel is collapsed.

Step 6 — Extend the UI and the search endpoint for filtering and sorting

So far, we are only using keyword and location to determine what is retrieved and how to score (using out-of-the-box configuration) based on keyword matching. There will be cases where certain search criteria are only used for slicing and dicing the retrieved results and do not contribute to scoring and hence ranking. The difference between retrieval and ranking are discussed in detail in Precise Retrieval For Tuning Ranking.

In this exercise, we use the companies facet selection to perform filtering of the result set. To do this, we further extend the parameters that get passed from the UI to the search endpoint. We revise the JSON object by appending after the must key inside the bool query with filter context as shown below. If the “Gap Inc.” option is selected by the user, it will be added as the third parameter to the search criteria, in addition to “Product Manager” and “New York” as shown below.

GET /_search
{
"from": 0,
"size": 5,
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "Product Manager",
"fields": [
"doctitle",
"doctext"
],
"type": "most_fields",
"operator": "and"
}
},
{
"match": {
"doclocation": {
"query": "New York"
}
}
}
],
"filter": [
{
"term": {
"doccompany": "Gap Inc."
}
}
]
}
},
"aggs": {
"doccompanies": {
"terms": {
"field": "doccompany"
}
}
}
}

The search UI index.php is also extended by allowing the end users to click on the companies facet in the left panel as well as the company value against each job ad in the results. Figure 3 shows those changes that are visible. We converted the company attribute into a clickable button that, upon being clicked, will perform another search but this time with the company value as the filter.

Figure 3. The companies facet have been extended to become clickable and the company attribute against each job ad in the result is also clickable to filter the search results.

Assuming that we click on the “Gap Inc.” option, a new search is performed and the search results are refreshed as in Figure 4 below. The filtering by company has brought down the number of results from 12 to 3.

Figure 4. The results for the search using the keywords “Product Manager” and “New York” as location with the company filter “Gap Inc.” applied.

By default, so far, the results are sorted by a score that is computed based on out-of-the-box configuration for keyword matching. If we want to make it explicit, we can append at the end of the latest query shown above with the sort parameter and sort the results using the _score meta-field which is keyword relevance — the default. The users may want to also re-sort the results differently. In the case of our content, the users may want to see newer job ads first. If we need to see the results sorted by our only date field, we use the field docdate instead as shown below.

POST /_search
{
"from": 0,
"size": 5,
"query": {
...
},
"aggs": {
....
},
"sort": {
"docdate": {
"order": "desc"
}
}
}

In the search UI, we add the option to allow users to sort by either relevance or date as shown below in Figure 5. By now, we have five search parameters implemented from the UI all the way to the JSON query for Elasticsearch — keyword, location, company, page number and sort mode. In Figure 5 below, note that we have removed “Gap Inc.” company filter, which brings the results count back up to 12. The results however are now sorted by the posting date of the job ad in Figure 5.

Figure 5. The date-sorted results for the search using “Product Manager” keywords and “New York” as location with the company filter “Gap Inc.” removed.

Step 7 — Add tracking the search and results

Many people would consider search as done by now but this could not be further from the truth. Looking at the screenshots in Figures 4 and 5, there are several quality-related concerns such as:

  • The results do not look overly relevant.Typically, the two main reasons are — either there are no available job ads matching the search criteria, or the keyword matching and sorting can be improved.
  • How do we know if the users find these results relevant or useful?
  • Do we know if users are seeing most of the results that they are meant to see?

In order to come up with answers to the questions above and more, it is important that we introduce proper tracking. Getting the right tracking should not be difficult though. There are Javascript-based tools out there that can be used to track. For this exercise, a custom tracker was quickly put together. As tracking is an integral part of search quality and the things that we need to track can be very fine-grained, we need to be mindful about adopting off-the-shelf solutions. From my experience, many of the free tools out there are designed for SEO and product analytics purposes. Hence, they tend to be more suitable for aggregated metrics regarding visits, users, page views and sessions. The awareness of the different types of data and how they are used for data science or analytics need to be front of mind when it comes to deciding on the technology or tooling, as discussed in Four Hurdles To Creating Value from Data.

Broadly speaking, there are four types of data that we need to track in this section. First is the search criteria, second the results that are shown to the users, third is the unique identifiers for the user and the search, and lastly the interactions with the results. Some of the tracking can be done in the back-end by the search endpoint while others need to be in the UI. At the moment, the search criteria, the results which are shown, and the identifiers are all recorded (where some are generated as well) in search.php when a search is performed. Since the interactions are front-end events, they can only come from the UI. Whenever someone clicks on a result, an event is fired. This will trigger a request off to a different endpoint to log them. I created another endpoint logger.php which is called asynchronously and accepts the document identifier, the position and type of interaction and a timestamp. Currently, these events are stored in another MySQL table. Obviously, like most things set up for this exercise, they are not fit for production.

Step 8 — Introduce basic metrics to gauge success

After having the right tracking in place, we can now look at how well the search results perform when we present them to the users. As discussed in length in a previous article Can’t We Just Talk, search is about helping users satisfy information needs. Whether we are doing that or not varies depending on the users.

Until the users tell us explicitly what they are after, most of the times it is a guessing game on our end when it comes to figuring out what success looks like for them. However, with properly tracked data and metrics, we can make more educated guesses and assumptions about success, advance our understanding of user needs and continuously improve search in an empirical way.

For the following examples, we continue to use “Product Manager” as the search term. Assume we have three users and each of them has different information needs which should manifest in their interactions with the results. The needs and circumstances are summarised below:

  • User 1 is looking for product management opportunities in the online tech space in any location.
Figure 6. Pages 1 and 2 of the search results for “Product Manager” without locations or companies criteria with default relevance sort. The user gets 290 results.
  • User 2 wants to find a new technology product management role in New York.
Figure 7. Pages 1 and 2 of the search results for “Product Manager” in “New York” with default relevance sort. The user sees 12 job ads.
  • User 3 has been searching for product management roles everyday.
Figure 8. Pages 1 and 2 of the search results for “Product Manager” without locations or companies criteria, re-sorted by date so that they can see the most recently posted ones first.

The screen shots above in Figures 6, 7 and 8 show the results that the three different users would see. We use the most recent JSON query in Step 6 to perform those searches. We can consider that the base algorithm. With the tracking now in place, we will have visibility over the job ads that were presented to the users and which of them were clicked on. It is worth noting that this stage that we will be naively using clicks on the results as the indicator of success. This however can be flawed as previously discussed in 6 Common Pitfalls In Building And Maintaining Search Engines. In the job search scenario, a more appropriate signal could be job apply or if clicks are to be used, it has to be supplemented with additional signals such as the time users spend on reading a job ad and so on.

Going back to the three screenshots above in Figures 6, 7 and 8, let us assume that User 1 visited only the first two pages, hence 10 results were presented to the user (i.e., impression). User 1 also clicked the jobs in position 4 (by GoDaddy.com) and the one in position 8 by Amazon. As for User 2, nothing was clicked on which is not surprising considering none of the jobs were relevant despite being in New York. We did not register any clicks as well from User 3 since none of the jobs were product management specific. Both Users 2 and 3 paginated all the way to page 3 with the hope of finding something relevant but to no avail. Figure 9 below summarises the impression and the click data for the three users and the calculation of two metrics that we will naively use to represent success for the purpose of demonstrating the empirical nature of search improvement.

Figure 9. Summary of the result set size (nhits), impressions, clicks and metrics for the base algorithm.

The first metric impression depth (ID) is not really a success metric. We just included it there to quantify how deep the user has gone with inspecting the results. If the ID metric is consistently low, then it does inform us that users do not generally go far down in the results set and should make us think of the value of returning that many results. This is especially true if many of those in the “long tail” are irrelevant. The second metric is click-through rate (CTR), not in the traditional online advertising sense. CTR captures, in a search, how many results were clicked on out of those that were “seen”. A low CTR can mean that the results that we present to the users may not be overly relevant. At the same time, it can also indicate other things, which is why we need to be wary of using clicks directly in measurement. Some of those concerns have been discussed in the following two articles. For Users 2 and 3, the CTR is both 0%, which means none of the results that were shown to them attracted interest of any kind. As for User 1, the 20% CTR indicates that, on average, one out of the five jobs was clicked on. The average 6.7% CTR is very poor.

The third metric is precision at lowest click (PLC). This metric uses the position of the lowest click as an indicator of the last result that was inspected by the user. Using the position of the lowest click, this metric captures how many results between the 1st and the lowest position were clicked on (or were “precise”). For the base algorithm, this metric was only slightly better than CTR but still not good.

Step 9 — Improve precision of retrieval

With tracking and a very basic form of measurement in place, we are now better placed to switch our attention to the quality of the results. It is from this point onward where understanding search quality and improving search can truly happen, and building search engines has just graduated from merely a technical pursuit.

Now informed by metrics, we can look at why the search results are performing poorly for Users 2 and 3. By reproducing the results in Figure 7 and 8 using the tracked data, we can very quickly figure out that these two sets of search results have way too many irrelevant results in them. Why were those documents returned in the first place? To answer this, we have to look back at the bool query which we constructed to retrieve documents based on keyword matches on the doctext and doctitle fields, as shown below.

Figure 10. A snippet of text from the first job add seen by User 3 in Figure 8. Note the distance between the closest co-occurrence of the two words “manager” and “product” that were matched on.

Clearly, this document came back not because the two words exist in the doctitle field. When we look at the doctext field in Figure 10 above, there was one occurrence of the word “Product” and multiple occurrences of “Manager”. The closest pair that exists was too far apart, resulting in an out-of-content match. When the user provides “Product Manager” as keywords, they clearly do not see them as just a bag of words. However, our search logic does and this resulted in irrelevant results. This was discussed in detail in Faceted Search Needs Precise Retrieval.

The most quickest way to fix this is to restrict how far apart the word matches can be. In Elasticsearch, we change the type of the multi_match query to phrase and use slop instead of the AND operator. The original JSON query is updated as shown below and the slop operator to 1.

GET /_search
{
"from": 5,
"size": 5,
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "Product Manager",
"fields": [
"doctitle",
"doctext"
],
"type": "phrase",
"slop": 1
}
}
],
"filter": [
....
]
}
},
"aggs": {
....
},
"sort": {
....
}
}

We applied the change in the search endpoint and repeat the same searches again. This time, instead of 12 product management jobs in New York, none was returned for User 2. As for Users 1 and 3 who search using “Product Manager” without any location, the results dropped from 290 to 19. Overall, many of the irrelevant results that were previously shown to the users are no longer being retrieved especially for Users 2 and 3 as shown in Figures 12 and 13 below. As for User 1, since the job ads at the top of the original result set prior to applying the slop restriction were relatively relevant anyway, the differences in top results before and after the restriction when the retrieval size fell from 290 to 19 are not so apparent — try comparing Figures 6 and 11. This is because the irrelevant results when sorted by relevance are often pushed down the rank and hence not visible to us. Depending on the content and the vertical that the search operates in, we will need to think about the right balance between being overly restrictive and potentially missing out on relevant results, and too relaxed and bringing back irrelevant ones.

  • User 1 still finds the jobs in position 7 and 8 relevant and clicked on them, similar to the previous experience in Figure 6.
Figure 11. The search results for “Product Manager” search term without other criteria, sorted by default relevance. User 1 finds the same job ads in position 7 and 8 interesting and clicked on them.
  • User 2 now has no results as opposed to the 12 job ads previously in Figure 7 that were all irrelevant.
Figure 12. No results for the search using “Product Manager” keywords with “New York” as location after imposing slop=1 to restrict distance between words permitted during matching.
  • User 3 only sees 19 results similar to User 1 with the difference being all the top results are now relevant when sorted by date unlike the previous experience in Figure 8.
Figure 13. The results for the search using the keywords “Product Manager” re-sorted by date after restricting the distance between keyword matches. User 3 now finds the job ads in Position 2 and 5 relevant and clicked on them.

To quantify the impact of the keyword match restriction change, we update the metrics chart as shown in Figure 14 below. The ID metric increased which indicates that more results returned by the search engine were seen. This happened because our change to the algorithm cut the retrieval set by quite a lot. Having more results seen by the users is not enough. We also need the results to be clicked on. The CTR is now 20.0% which has risen from the base algorithm’s 6.7%. Similarly, the PLC improved from 8.3% to 21.7%. Both the metrics suggest that we are doing something right for the users with the recent update to the search engine.

Figure 14. Summary of impressions, clicks and metrics for the revised algorithm with tightening of word proximity

Step 10 — Improve recall via synonym expansion

In this last step, we look at another potential improvement that we can be doing, informed by the metrics in Figure 14. First and foremost, we know that different people can sometimes refer to and express the same things differently. The search engine by default does not know this. As a result, the search may miss some documents that might be relevant to the query just because they are expressed differently. We can use query or synonym expansion in this case to attempt to bridge that gap. These search improvement opportunities especially in the context of vertical search have been discussed in depth in Search Is Not Solved Yet.

In this final exercise, we assume that “Product Manager” and “Product Lead” refer to the same job title. These synonyms can be curated by domain experts or learned from logs. The intent is for such data asset to be used in Elasticsearch to improve recall of the retrieval set. In other words, we want to reduce the chances of the search engine and hence the users potentially missing out on relevant job ads. In our case, when the users search for “Product Manager”, they would also be getting results that have the word “Lead” in them when “Manager” is absent.

We need to make a change to the analyzer and the filter components of the index. We create a new filter of type synonym that points to the actual file containing the synonyms. The file synonym.csv is placed in the config folder in the root elasticsearch-7.8.0 folder. We call this filter jobsynonyms. As part of this JSON, a new analyser called synonym_analyzer is also created to apply the jobsynonyms filter during analysis to perform synonym expansion. We would first have to close the index before applying this change to settings and re-open after that.

POST /jobs/_close 
PUT /jobs/_settings
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "whitespace",
"filter": ["lowercase","jobsynonyms"]
}
},
"filter": {
"jobsynonyms": {
"type": "synonym",
"synonyms_path": "synonym.csv",
"updateable": true
}
}
}
}
}
}
POST /jobs/_open

In order to use synonym_analyzer for the specific fields that we want to be synonym expanding on the fly, we need to change the mappings of the fields or properties. The JSON query below is sent to the _mapping endpoint. We will only apply the analyzer to the doctitle field just to limit the effect of expansion.

PUT /jobs/_mapping
{
"properties": {
"doctitle": {
"type": "text",
"analyzer": "standard",
"search_analyzer": "synonym_analyzer"
}
}
}

After the changes above to Elasticsearch have been applied, we repeated the same searches by Users 1, 2 and 3. For User 2 who have no results previously after restricting the keyword matches, the user now sees 1 job ad returned as shown in Figure 15. The search results for Users 1 and 3 have been relatively the same as prior to synonym expansion. This is expected as there is adequate number of jobs that match the user’s original query of “Product Manager” for the results to dominate the top.

  • The search results increased to 21 but the experience for User 1 remained relatively unchanged.
Figure 15. The results for User 1 increased from 19 to 21 due to the synonym expansion. The top results however remained relatively the same, hence the user is still finding the job ads relevant to them in position 7 and 8.
  • User 2 has some results after synonym expansion.
Figure 16. Instead of no results, User 2 now sees 1 job ad due to the synonym expansion.
  • The experience for User 3 was also not affected by the synonym expansion despite seeing two more results.
Figure 17. The top results sorted by date for User 3 remained relatively unchanged and the results that were clicked on previously prior to synonym expansion are still in the same positions.

Synonym expansion and the maintenance of the associated data asset is an ongoing process. As new synonyms are added, we want them to take effect immediately. Since the key updateable was set to true when we created the synonym filter, we can easily reload the search_analyzers using the query below.

POST /jobs1/_reload_search_analyzers

Before we end, we investigate the effect of synonym expansion on user interaction with our results and the metrics. Across the board, there were one or two more job ads being returned depending on whether the “New York” location criterion was applied or not. In the case of Users 1 and 3, the results set increased from 19 to 21 as shown in Figures 15 and 17. Synonym expanded results are often weighted lower than original keyword matches, something that can be configured. Since the results prior to synonym expansion that Users 1 and 3 found relevant and clicked still appear in their original positions, the search quality for these users were not impacted. The CTR and PLC metrics remained the same for these two users. As for User 2, due to the location filter, only one job was returned by synonym expanding instead of two. Assuming that User 2 finds the product lead job ad in Figure 16 somewhat relevant, we have potentially improved the recall. For more information about the concept of recall, read 8 Out of 10 (Brown) Cats.

Through synonym expansion, we have improved search quality for User 2. The overall effectiveness of the system has improved from 20.0% to 53.3% for CTR, and 21.7% to 55.0% for PLC as shown in the last Figure 18 below.

Figure 18. Summary of the metrics for the revised algorithm with synonym expansion.

Conclusion

Building and maintaining a search system or product that is set up for success goes beyond just having the right technology. In this article, we discussed the steps for setting up a basic search engine for job ads using Elasticsearch for the purpose of illustrating search quality in practice. We looked at the tracking of essential data on the search system to calculate basic metrics and enhance our understanding of user needs based on their searches and interactions with the results. We started off with a base algorithm and investigated two improvements. We discussed how the two changes which were informed by metrics have subsequently further advance the metrics in the right directions as summarised below.

Figure 19. A summary of the metrics from the base algorithm and the two improvements.

We relied a lot on clicks in this article for the sole purpose of demonstrating the importance of setting up the right foundation to improve search. As mentioned time and time again, be mindful when you are using clicks. The actual technology aside, hope this article has shed some light on what it takes to build a strong search engine and product. Beyond the example improvement opportunities explored and implemented in this article, there are many other ways to further optimise the search experience for the users.

--

--

Wilson Wong
Practical AI Coalition

I'm a seasoned data x product leader trained in artificial intelligence. I code, write and travel for fun. https://wilsonwong.ai