Choosing a Search Engine Database

By Deepak Vohra - February 5, 2020

A search engine database indexes data in addition to storing it, so users are able to search collections of data using keywords. There are several search engine database options available, so let’s look at some common features that can influence the decision.

Apache Lucene indexing

Apache Lucene is an open source library written in Java with cross-platform support. Lucene is one of the most commonly used search engine softwares, including by Apache Solr and Elasticsearch, because of its high performance and features such as full text search, incremental and batch indexing, less RAM requirements, sorting, multiple indexes with merged results, joins and grouping, pluggable ranking models, and a configurable search engine. It also supports search results based on rank and field, as well as different types of queries, such as phrase, wildcard, range, and proximity.

Document formats

Most search engines support parsing and indexing documents in commonly used formats, such as JSON, XML, and CSV. For other document formats, Apache Tika is supported by most search engines. Tika is a software toolkit that supports over a thousand different file types for text and metadata extraction.

Full text search

Some search engine databases only search a predefined title, metadata, or abstract of a document, instead of the complete document text. Searching in full text could be refined by using keywords, specifying fields or phrases, or doing a proximity search based on words close to the searched terms.

Near real-time search

Near real-time (NRT) search makes updates to a document searchable within milliseconds of the update completing. NRT is transparent to the users, and input/output (I/O) overhead is reduced by efficient RAM management and using the RAM to cache updates rather than syncing each update with the disk.

Schema API

Lucene supports the Schema API for searching a collection using an HTTP URL with syntax http://<host>:<port>/solr/<collection_name>. Using the Schema API, fields in a collection may be added, updated, searched, and deleted. The output formats supported are JSON and XML.

Stemming

Stemming is reducing inflected words to their stem so that, when searching for a term, the documents with the inflected words are also returned. For example, consider the following three statements taken from three different documents:

After much testing the application has been developed.
The application is in development.
It requires a lot of testing to develop an application.

If the three documents are indexed, a search for “develop” would return only the third document if stemming is not supported. But with stemming, the documents with inflections “development” and “developed” also would be returned.

Distributed, scalable, and highly available

Most search engines are distributed, which means they are deployed on a cluster of machines, making the search engine fault-tolerant. You also want your search engine database to be scalable so that the cluster size may be increased or decreased as needed. And sharding is used to partition the indexed data so the shards themselves are replicated for high availability.

The features here are common to most search engine databases, so when deciding which is best for your use, determine what factors are most important and choose one that prioritizes your needs for indexing and searching data.

Tags:

Up Next

Why You Should Treat Tests as Products

February 4, 2020

Get TechWell Insights Delivered Weekly

All TechWell Insights by this Author

Related Insights

About the Author

Deepak Vohra

Deepak is a Sun Certified Java Programmer and Web Component Developer, and has worked in the fields of XML, Java programming and Java EE for ten years. Deepak is the co-author of the Apress book Pro XML Development with Java Technology and was the technical reviewer for the O'Reilly book WebLogic: The Definitive Guide. Deepak was also the technical reviewer for the Course Technology PTR book Ruby Programming for the Absolute Beginner. Deepak is also the author of the Packt Publishing books JDBC 4.0 and Oracle JDeveloper for J2EE Development, Processing XML Documents with Oracle JDeveloper 11g, EJB 3.0 Database Persistence with Oracle Fusion Middleware 11g, and Java EE Development in Eclipse IDE. Deepak is a Docker Mentor and has published 5 books on Docker and Kubernetes.