As part of a recent assignment, we were involved in the implementation of Endeca based search solution for an African Telecom giant. The design phase of the project was already complete and we were asked to take over the implementation phase. During the implementation we encountered a few issues, which should have been handled during the design before the development phase started.
- Delta Indexing Strategy: One of the most frequent activities that happens on a search engine (other than keyword search) is delta Indexing. If you have not clearly thought about your delta indexing strategy before you exit your design phase, then the implementation in bound to run into problems.In our case the search engine was crawling the live site and it had no concept of delta change. Even a small spelling change on the site will only reflect in the search engine if the web site was re-crawled and re-indexed. Live site crawling is an expensive operation and in some cases, it can take a while to complete. Instead of crawling the live site, a better way would have been to fetch the changed data through a push by the underlying CMS system. Besides providing a superior performance, it would have also simplified the design.
- Availability: This is an extension of the delta indexing issue. Most of the search engines available in the market have a down time when full indexing is being done and hence we need to take this into account to manage availability. Thus, delta indexing instead of full re-indexing can minimize the search engine’s down-time. However, a full index update is sometimes unavoidable. In such cases, Endeca heartbeat endpoints should be configured in a load balancer to manage automatic failover while full indexing is happening.
- Search Relevance Algorithms: The thing that classifies a search implementation as a failure or success is the quality or relevance of the search results. Of course, all search engines return results; but the successful ones return the results that get clicked on. That measures its success.
It is very important that the search rules are validated against actual production data and not on dummy data. In our case, we noticed that while the search engine rules work properly on the dummy and test data, they failed miserably when live data was fed into the system. The relevancy-ranking configuration was not thought through and this caused non-relevant records to come on top of the search results page.
Although these are specific instances of one implementation and may not apply to every project but they all point out one common root cause – absence of thinking through the design.
I have always used ‘Measure twice, cut once’ as my design principle to avoid surprises during the implementation phase.