Indexing in Zimbra
How E-Mail Content is Indexed in Zimbra
For information about Lucene files: https://lucene.apache.org/core/3_0_3/fileformats.html
You can inspect the class link below for Zimbra Lucene:
IndexField enum class on line 199 is important. Store and analyzed parameters belonging to the indexed fields are configured here.
https://github.com/Zimbra/zm-mailbox/tree/develop/store/src/java/com/zimbra/cs/index/analysis The classes in this folder, you can see custom analyzer — character filter — tokenizer — token filter classes. Here it is applied within the code instead of the “settings.json” file we use to implement ElasticSearch config in Spring Boot.
In Zimbra, first IndexDocument object is created. The fields’ store and parameters are declared here: https://github.com/Zimbra/zm-mailbox/blob/develop/store/src/java/com/zimbra/cs/index/IndexDocument.java
Another important class is this: https://github.com/Zimbra/zm-mailbox/blob/develop/store/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java
Especially, the “tokenstream” method is important on line 142. It directs to the classes which exists under the path in the provided link above, depending on the field. In IndexDocument, then the fields are set, the instructions in “analysis” are applied on them.
Zimbra indexes using Lucene but parametrically, we can set ElasticSearch config. For indexing in ElasticSearch, https://github.com/Zimbra/zm-mailbox/blob/develop/store/src/java/com/zimbra/cs/index/elasticsearch/ElasticSearchIndex.java class is being used.
In ElasticSearch, There are other extra analyzers are added in code on top pf the ones in “analysis” folder (zimbrastandard, whitespace tokenizer, emailaddress, contactdata). To give an example, let’s say from mail address is “Senorita Developer <firstname.lastname@example.org>” . Lucene, while creating IndexDocument, does the analysis and converts it to “senorita developer email@example.com senoritadev @zimbrathree.nils.local zimbrathree.nils.local” (For ElasticSearch, IndexDocument is prepared by converting json formatted string). On top of it, these ElasticSearch analysis configurations are applied or just past by if not required.
topLevel.put("mappings", mappings);mappings.put(indexType, zimbra);zimbra.put("_source", source);source.put("enabled", false); // save spacesource.put("_all", false); // save spacezimbra.put("properties", properties);
As you can see from the piece of code above, for “ _source” enabled and _all definitions are done.
For indexed fields in Zimbra, I prepared an Excel like below (zimbra_index_fields):
Mail content exists in “l.content”. In indexing code flow, analyzed e-mail addresses, etc. are concatted to it.
We can also inspect Lucene document contents I mentioned at the top of the post from Zimbra. For example on my local Zimbra, my account with mboxId=3 is being indexed in file “/opt/zimbra/index/0/3/index/0”. A screenshot containing a part of its content:
There are some information in files with “.tis” but it is not easy to map and extract the real content (it looks as if meeting requests’ contents are visible but actually not the whole original content).
.tis file sample content from my local Zimbra:
For information about Zimbra Queries: https://github.com/Zimbra/zm-mailbox/blob/develop/store/docs/query.md. “Fields — CONTENT field” title and its content can be examined.