a bit about our search technology

Locayta is a full implementation of the probabilistic model as developed by Dr. Porter and his colleagues at Cambridge University, and utilises the 'magic formula' as shown below. It incorporates a number of new technology developments in the world of information retrieval, which relate to the new RezolveTM algorithms and very importantly, distributed searching.

The probabilistic model is based on a formal mathematical definition that has been proven to provide the best retrieval performance in the general case.

In concept, it uses information about the distribution of terms within the data-set to apply weights to the user's search query, with more important terms carrying higher weights. From this weight, a score may be computed for each item in the data-set and a list, ranked by relevance, presented back to the user.

Magic Formula

Enlarge image

The weighting scheme has a number of variants, but as shown here it forms the basis of what is called BM25 weighting, which, in terms of retrieval excellence, is essentially "state of the art".

The skill of implementing an efficient search using the formula, is to keep to a bare minimum the number of times its component parts need to be evaluated for each term with each potentially important record in the data-set (of which there could be billions).

This has been achieved within Locayta by two new break-through developments: the new Rezolve Algorithms and Distributed searching.

Rezolve algorithms

Retrieval speeds for any search engine are absolutely critical. Great effort has been put into optimising the processing of queries within Locayta. Locayta draws no distinction between a Boolean and probabilistic query, which Locayta can interpret in either way by transforming the query into a tree structure of optimal shape for processing.

Locayta has solved the problem of running big multi-term probabilistic queries over huge IR systems. This is a particular advantage that Locayta has over Google. Large scale query expansion has an excellent practical implementation in Locayta, where large probabilistic queries can be processed with unusual efficiency - there is no theoretical upper limit to the size of a query that can be used in the probabilistic model. This gives it flexibility and a retrieval precision, which marks it out for special consideration.

Distributed searching

Locayta was designed to support distributed searching. The implications of this capability are immense. Currently all search technologies require an index to be created that can be searched against at the time of query. However, often the index is at the very best, minutes and at worst, months out of date. This is because the web-crawler physically needs to re-visit each and every web-site, server or database that is being indexed in order to create a pointer or reference to each item that resides on that machine.

This means that an item may be posted, amended or deleted on a server minutes before a search query is entered into the search engine, and irrespective of how relevant or important that is to the query, it will not be retrieved because the retrieval engine is not aware of its presence or changed status.

Similarly, many eCommerce, Portal and content sites now use ASP technology to serve content to the site from a content management system or some other database system. Effectively, this data is hidden within the system and can't be indexed by normal spidering technology.

Locayta addresses these problems through its revolutionary distributed search capabilities.

In Locayta's distributed searching, an Information Retrieval system is divided into a number of sub-IR systems. The system distributes a search query among a collection of processors, each of which interrogates one or a small number of the sub-IR systems. The match results from each processor are than sent to a master processor which amalgamates these results into a single match which is then sent back to the user who submitted the query.

Locayta's distributed search capabilities means that when a search query is entered into the search engine all web-sites, servers or databases are simultaneously interrogated in real-time with no degradation on speed of retrieval.

Locayta has modelled this capability to an infinite number of machines and has found no limit. In addition, remarkably, there has been no degradation to retrieval speed.

Locayta was designed from the outset specifically around the needs of information retrieval using natural language concepts. The original development was conducted by Dr. Martin Porter, a world-renowned expert in the field of probabilistic IR, whilst researching information retrieval at the Computer Laboratory at Cambridge with Keith van Rijsbergen and Stephen Robertson.

The project was primarily concerned with how to find the best index terms for documents from the document contents; but expanded beyond this to look at methods of implementation, particularly of the probabilistic model of IR, which was very new at the time. Dr Porter finished this research and left Cambridge University in 1994 to concentrate on the development of Muscat (a forerunner of Locayta) as a commercial product.

Dr. Porter (1999) and Professor Stephen Robertson (2001) are both past winners of the Strix Award, which is given in recognition of outstanding practical innovation or achievement in the field of information retrieval.

Bayes and Shannon

Some search technologies claim to utilise Bayesian Inference and Shannon's information theory within their search algorithms. The connection between information theory and the probabilistic model has been recognised for a long time, but it is interesting to note that the probabilistic model does not use any of Shannon's work, and derives entirely from the theory of probability.

Similarly, whilst Bayes' theorem was certainly important in helping to develop the 'Magic Formula,' which underpins the probabilistic model (essentially it helps to turn probabilities inside out: given initial events, you can work out the probability of outcomes), Bayes' theorem is not actually used in any probabilistic algorithms.

The connection between Bayesian Inference and the probabilistic model can be seen in the derivation of the 'magic formula' (see below) which invokes Bayes' theorem at one significant point.

Further Reading

  • C.J. van Rijsbergen, Information Retrieval (Second Edition), Butterworths, London, 1979. ISBN 0-408-70929-4.
  • Peter Willett (Ed.), Document Retrieval Systems, Taylor Graham, London, 1988. ISBN 0-947568-21-2.
  • Karen Sparck Jones and Peter Willett (Eds.), Readings in Information Retrieval, Morgan Kaufmann, San Francisco, 1997. ISBN 1-55860-454-5.