Choosing a Search Engine Database

Person holding a magnifying glass

A search engine database indexes data in addition to storing it, so users are able to search collections of data using keywords. There are several search engine database options available, so let’s look at some common features that can influence the decision.

Apache Lucene indexing

Apache Lucene is an open source library written in Java with cross-platform support. Lucene is one of the most commonly used search engine softwares, including by Apache Solr and Elasticsearch, because of its high performance and features such as full text search, incremental and batch indexing, less RAM requirements, sorting, multiple indexes with merged results, joins and grouping, pluggable ranking models, and a configurable search engine. It also supports search results based on rank and field, as well as different types of queries, such as phrase, wildcard, range, and proximity.

Document formats

Most search engines support parsing and indexing documents in commonly used formats, such as JSON, XML, and CSV. For other document formats, Apache Tika is supported by most search engines. Tika is a software toolkit that supports over a thousand different file types for text and metadata extraction.

Full text search

Some search engine databases only search a predefined title, metadata, or abstract of a document, instead of the complete document text. Searching in full text could be refined by using keywords, specifying fields or phrases, or doing a proximity search based on words close to the searched terms.

Near real-time search

Near real-time (NRT) search makes updates to a document searchable within milliseconds of the update completing. NRT is transparent to the users, and input/output (I/O) overhead is reduced by efficient RAM management and using the RAM to cache updates rather than syncing each update with the disk.

Schema API

Lucene supports the Schema API for searching a collection using an HTTP URL with syntax http://<host>:<port>/solr/<collection_name>. Using the Schema API, fields in a collection may be added, updated, searched, and deleted. The output formats supported are JSON and XML.

Stemming

Stemming is reducing inflected words to their stem so that, when searching for a term, the documents with the inflected words are also returned. For example, consider the following three statements taken from three different documents:

  • After much testing the application has been developed.
  • The application is in development.
  • It requires a lot of testing to develop an application.

If the three documents are indexed, a search for “develop” would return only the third document if stemming is not supported. But with stemming, the documents with inflections “development” and “developed” also would be returned.

Distributed, scalable, and highly available

Most search engines are distributed, which means they are deployed on a cluster of machines, making the search engine fault-tolerant. You also want your search engine database to be scalable so that the cluster size may be increased or decreased as needed. And sharding is used to partition the indexed data so the shards themselves are replicated for high availability.

The features here are common to most search engine databases, so when deciding which is best for your use, determine what factors are most important and choose one that prioritizes your needs for indexing and searching data.

Up Next

About the Author

TechWell Insights To Go

(* Required fields)

Get the latest stories delivered to your inbox every month.