Indexing in MarkLogic

To resolve queries, MarkLogic employs a variety of indexes. The Universal Index, as defined in the Overview of MarkLogic Server, is a collection of indexes. The indexing model used by MarkLogic Server is described in this article.

The main topics are:

1. The Universal Index

The global index indexes the loaded documents’ XML elements and JSON properties. By default, MarkLogic Server creates a set of indexes that are optimised for query performance in common cases. To speed up specific sorts of queries, you can set MarkLogic to index additional data. Supporting extra indexes comes at a cost: more disc space and longer document load times. Search performance improves as more indexes are maintained, while document load speed falls.

The types of indexes used by MarkLogic in the Universal Index:

Word Indexing.
Phrase Indexing.
Relationship Indexing.
Value Indexing.
Word and Phrase Indexing.

2. Other Types of Indexes

Other sorts of indexes are used by MarkLogic Server that are not part of the Universal Index.

Range Indexing

The indexes discussed before allow you to search collections of XML and JSON documents for text, structure, and combinations of text and structure quickly.

Word Lexicons

You can create lexicons in MarkLogic Server, which are collections of unique words or values that let you easily identify a term or value in the database and how many times it appears.

Reverse Indexing

All of the indexing algorithms described thus far conduct what are known as forward queries, in which you start with a query and look for a set of documents that match it.

Triple Index

The triple index is used to index sem:triple elements that are schema-valid and can be located everywhere in a document. When documents containing triples are ingested into MarkLogic or when the database is reindexed, triples are indexed.

3. Index Size

Many of the index parameters discussed in this chapter are either set automatically (with no way to turn them off) or enabled by default. The on-disk size is often smaller than the size of the source XML out of the box, with the default set of indexes enabled. The indices are often smaller than the space saved by the compression since MarkLogic compresses the loaded XML. The index size can be two or three times the size of the XML source when more indexes are enabled.

4. Fields

Fields are a technique for MarkLogic to offer distinct indexing capabilities for different areas of a document. The title and abstract of a document, for example, may require wildcard indexes, but the complete content may not.

5. Reindexing

MarkLogic Server must reindex the database content after making changes to the MarkLogic index parameters. MarkLogic handles reindexing in the background while managing queries and updates at the same time. If you modify or add a new index setting, it will not be available for support requests until the reindexing process is complete. When you remove an index setting, it is immediately deactivated.

6. Relevance

It’s not enough to only find documents that fit the provided constraint when running a full text query. The findings must be returned in the order of their relevancy. The concept of relevance is a basic mathematical construct. More matches indicate that a document is more relevant. Longer documents with the same number of matches are less relevant than shorter documents with the same number of matches.

7. Indexing Document Metadata

The material in this article has so far centred on how MarkLogic indexes text and structure using term lists. Term lists are also used by MarkLogic to index other items including collections, directories, and security rules. The Universal Index refers to the sum of all of these indexes.

This covers the following topics:

Collection Indexes
Directory Indexes
Security Indexes
Properties Indexes

8. Fragmentation of XML Documents

Each unit of content in a MarkLogic database has been portrayed as a document thus far in this chapter, which is a bit of a simplification. MarkLogic is responsible for indexing, retrieving, and storing fragments. The document is the default fragment size, and most users leave it that way. However, using the Admin Interface, you can break XML documents into sub-document fragments by configuring the fragment root or fragment parent database settings.