An open‑source Hebrew analyzer for Elasticsearch lemmatization

An open-source Elasticsearch 9.x analyzer plugin that improves Hebrew search by lemmatizing tokens in the analysis chain for better recall across Hebrew morphology.

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Learn how to put them into action in our hands-on webinar on building a modern Search AI experience. You can also start a free cloud trial or try Elastic on your local machine now.

Hebrew is morphologically rich: Prefixes, inflections, and clitics make exact-token search brittle. This project provides an open-source Hebrew analyzer plugin for Elasticsearch 9.x that performs neural lemmatization in the analysis chain, using an embedded DictaBERT model executed in-process via ONNX Runtime with an INT8-quantized model.

Quick start

Download the relevant release or build and install (Linux build script generates Elasticsearch‑compatible zip):

Install in Elasticsearch:

Test:

Why Hebrew search is different

Hebrew is morphologically rich: Prefixes, suffixes, inflection, and clitics all collapse into a single surface form. That makes naive tokenization insufficient. Without true lemmatization, search quality suffers; users miss relevant results due to simple variations in form. This project tackles that by embedding a Hebrew lemmatization model inside the analyzer itself, so every token passes through a neural model before indexing and querying.

Example

Users may search for the lemma “בית” (house), but documents might contain:

  • בית (a house)
  • בבית (in the house)
  • לבית (to the house)
  • בבתים (in houses)
  • לבתים (to houses)

Without lemmatization, these become different surface tokens; lemmatization normalizes them toward the same lemma (בית), improving recall:

What this plugin does

Rather than relying on rule-based stemming, the analyzer runs a Hebrew lemmatization model as part of the Elasticsearch analysis chain and emits one normalized lemma per token. Because the model is neural, it can use local context within each analyzed segment to choose a lemma in ambiguous cases—while still producing stable tokens that work well for indexing and querying. The analyzer:

  • Runs a Hebrew lemmatization model inside Elasticsearch.
  • Produces better normalized tokens for Hebrew text.
  • Supports stopwords and standard analyzer pipelines.

The result: Fast, reliable lemmatization

This analyzer is optimized for real‑world throughput:

  • ONNX Runtime in‑process inference.
  • INT8-quantized model for lower latency and memory footprint.
  • Java Foreign Function Interface (FFI) for high‑performance native inference.

The result: fast, reliable lemmatization with predictable operational behavior.

To evaluate performance, we ran a benchmark in a Docker container (4 cores, 12 GB RAM) on 1 million large documents (5.7 GB of data) from the Hebrew Wikipedia dataset. You’ll find the results below:

Metric (search)TaskValueUnit
Min throughputhebrew-query-search409.75ops/s
Mean throughputhebrew-query-search490.65ops/s
Median throughputhebrew-query-search491.85ops/s
Max throughputhebrew-query-search496.13ops/s
50th percentile latencyhebrew-query-search7.02242ms
90th percentile latencyhebrew-query-search10.7338ms
99th percentile latencyhebrew-query-search19.0406ms
99.9th percentile latencyhebrew-query-search27.165ms
50th percentile service timehebrew-query-search7.02242ms
90th percentile service timehebrew-query-search10.7338ms
99th percentile service timehebrew-query-search19.0406ms
99.9th percentile service timehebrew-query-search27.165ms
Error ratehebrew-query-search0%

Open source and Elastic‑ready

The plugin is fully open source and works on:

  • Elastic open‑source distributions.
  • Elastic Cloud.

You can build it yourself or download prebuilt releases and install it like any other plugin.

To upload the analyzer plugin to Elastic Cloud, navigate to the Extensions section within your Elastic Cloud console and proceed with the upload.

Credits

This project is a fork of the Korra ai Hebrew analysis plugin (MIT), which was implemented by Korra.ai with funding and guidance from the National NLP Program led by MAFAT and the Israel Innovation Authority.

This fork focuses on Elasticsearch 9.x compatibility and running lemmatization fully in-process via ONNX Runtime, using an INT8‑quantized model and bundled Hebrew stopwords. Lemmatization is powered by DictaBERT dicta-il/dictabert-lex (CC‑BY‑4.0).

Huge thanks to the Dicta team for making high-quality Hebrew natural language processing (NLP) models available to the community.

Related Content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as you are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself