Locayta Search - Technical Overview

>> Locayta Search feature-set

It is not an exaggeration to say that site search on many websites can be a poor experience for the user.  There are a number of well-known problems associated with search engines, which include the following:

The user's lexicon differs from data 
The search terms used by a user in their search query needs to match the terms used in the product index to get good results. So for example, a search for jumper might not find cardigan, or a search for bouquet might not find flowers.

Locayta is different from most search engines in that it is a statistical model that uses formal mathematical definitions to index data. So whilst Locayta doesn't pretend to understand the actual meaning of the words, it does understand the statistical importance of the words being used in the search query.

This means that the user doesn't need to know how to structure their search query as Locayta works out what are the statistical important terms in their search query.

Boolean dilemma (AND vs. OR)
When someone uses 2 or more words in their search query on the site, the in-site search engine connects the words together to run the search using Boolean connectors such as “AND” or “OR.”

The problem with this, is if you do a search for “olive oil” for example and the Boolean connector is set to “AND”  then you may only find products that contain both “olive” and “oil.”  You mat not find “extra virgin oil” for example, yet extra virgin oil is olive oil.

To try to overcome this problem, most web-sites will set the Boolean connector to “OR.”  But by setting it to “OR” a search for “olive oil” will find anything with “olive” and anything with “oil.”  

This is the dilemma, if the operator is set to “AND” you will get too few results and miss things, but if it is set to “OR”  then you will get too many results, much of which isn't strictly relevant to you have searched for.

Spelling mistakes and typos 
Another problem relates to misspelling which is a very frequent occurrence. Something like up to 50% of search queries can be misspelled. Just a single missing character, or extraneous character or characters transposed etc; may produce a zero search result.

Research on eCommerce sites suggest that a lot of misspelling is actually miskeying. In other words, the user knows how to spell the product name, they've just miskeyed the word.

Traditionally there are two approaches to the misspelling problem:
One is take the same approach as some of the Internet search engines, which is to look at how users who have previously misspelled an item, corrected the spelling. This is fine for an Internet search engine because the sample set of users is so great, but this retrospective approach doesn't really work for individual sites with much smaller sample sets.

Another approach is to use a dictionary, which requires creating and maintaining a dictionary specifically for the site. The problem with this approach, apart from the maintenance headache, is that you have to assume that the first few characters of the misspelled word are correct and that any misspelling occurs further along the word (note: if you don’t make this assumption, then any misspelled 5-letter word, could actually be any 5-letter word in the entire dictionary).

However, if we know that a lot of misspelling is actually miskeying, then the dictionary approach won’t help as miskeying often occurs with the first strike of the keyboard, so you can’t assume that the first few characters are actually correct.

To solve the misspelling problem, Locayta uses two algorithms (trigram analysis and Levenshtein edit-distance). Trigram analysis breaks the misspelled words into blocks of characters and tries to work out how to correct the  misspelled word in relation to what it knows about the words in the index, using edit-distance as a measure of how misspelled the word is. The two combined algorithms provide a dynamic spell-correction capability.

Naïve search results 
A major issue for most search engines is how they calculate the relevancy of search results. The traditional approach is to use word-frequency (i.e. how many times the words in the search query appear in the data being searched) and word-proximity (i.e. how close together those words appear in the data being searched).

However, there is no reason why word-frequency or word repetition should equate to relevancy, because it doesn't take into account the context in which the data is being searched.

For example, a search for “black cocktail dress”  might find “wedding dress”  if that item has “dress” repeated many times or the search might find “black cocktail bag” because “black” and “cocktail” appear close together.
 Locayta’s statistical model largely solves the relevancy problem. When Locayta indexes data, it uses information about the distribution of terms within the data-set to create a weighting system that automatically applies weights to the ranking of search results. So Locayta will return an item as being the most relevant, when it contains a high density of terms that are considered uncommon in the data-set as whole.

Field weightings
Another issue is that many search engines apply the same importance to the product description and the product name or the product category; whereas product category is often more important than product name and product name is often more important than product description. This is why many in-site search engines deliver search results that appear out of context, even though they may be technically correct.

Locayta's many years of experience have shown that the relevancy of search results can be significantly improved by using field weightings. Weights can be applied to the different fields within the product data indexed by Locayta. Sometimes it is also useful to include information about the navigational position of the product in the website, within the product index.

Optimisation and Control 
A frequent complaint about many search engines is that once implemented it can be hard to adjust and optimise the way the search engine works, especially if you want to achieve a certain order of search results if you’re trying to promote certain products etc. 

To address this concern, Locayta have surfaced-up most of the configurable elements of the search engine within a control panel. This allows the client to configure items such as:

  • Field weightings & Balance factor items.
  • Use of synonyms, substitutions and stop words.
  • Which matching algorithm to use.
  • What Guided Navigation options to use.
  • The number of search results per page.
  • Sortation options and filtering options.
  • Relevancy cut-off threshold.
  • Fields options (inc. sort and spell-correction) at index time.



>> Click here for a free trial of Locayta ESP