Elasticsearch entity resolution: Match entities with LLMs & semantic search

Elasticsearch has native integrations with the industry-leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps with the Elastic vector database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

In Part 1, we prepared our watch list and extracted entity mentions. Now we’re ready to answer the hard question: Which entity does a mention actually refer to? Let's return to the example in the first blog of this series, which set up why we need entity resolution: "The Swift update is here!" Imagine that this headline is accompanied by a bit more context:

The new Swift update is here! Developers are eager to try out the new features.
The new Swift update is here! The new album will drop next month.

With this added context, we should be able to resolve the name "Swift" to the correct entity.

In the previous post, we set up our watch list and enriched the entities with additional context. Looking at our examples above, we need to have at least the following two entities in the list: Taylor Swift and Swift Programming Language. We also covered how we extract entity mentions from text. Both of these examples would extract "Swift". With these ingredients in place, the enriched watch list, and the extracted entities, we’re finally ready to introduce the star of the show: entity matching.

Remember: This is an educational prototype designed to teach entity-matching concepts. Production systems might use different large language models (LLMs), custom matching rules, specialized judgment pipelines, or ensemble approaches combining multiple matching strategies.

The problem: Why matching is hard

Human language is a remarkable thing. One of the most interesting properties of it is its endless creativity. We can generate and understand an infinite number of new sentences. Is it any wonder, then, that exact matches in entity resolution are rare? Authors strive to be creative when they can. It would get quite tedious if we had to write and read full names whenever an entity is mentioned. So, while exact matches are easy, the reality is that we need a more sophisticated approach to entity resolution: one that’s robust enough to handle at least some of the boundless creativity of human authors. That’s why we separate the problem into two steps: Use Elasticsearch to retrieve plausible candidates at scale, and then use an LLM to judge whether those candidates truly refer to the same real-world entity.

The solution: Three-step matching with transparent LLM judgment

We’re in the midst of a paradigm shift in how we use computers. Just as the rise of the internet took us from localized computing to a globally connected network, generative AI (GenAI) is fundamentally changing how content, code, and information are created. In fact, the educational prototype that accompanies this series was almost exclusively "vibe coded" using an LLM with careful prompting by the author. This is not to say that LLMs have or even will reach the kind of productivity inherent with human language, but it does mean that we now have a powerful resource to help with entity resolution.

A common pattern we use with GenAI is retrieval augmented generation (RAG). Here, retrieval means retrieving entity candidates (not generating answers), and the LLM is used strictly for match evaluation and explanation. While we could ask an LLM to help us with end-to-end entity resolution, that’s a costly approach, both in terms of time and money. RAG helps LLMs do their work by using more efficient ways to provide context to the LLM, thereby empowering the LLM to efficiently help with entity resolution.

For the retrieval part of RAG, we again turn to Elasticsearch. We first find potential matches using a combination of exact matching, matching against aliases, and hybrid search, which combines keyword and semantic search. Once we find these potential matches, we send them to an LLM for judgment. The LLM acts as the final match evaluator. We also make the LLM explain its reasoning, an important differentiator with other entity resolution systems. Without these explanations, entity resolution is a black box; with them, we can see for ourselves why a match makes sense.

Key concepts: Three-step matching, hybrid search, and transparent LLM judgment

What is three-step matching? At the onset of this project, we hypothesized that semantic search will be a crucial part of the system, but not every match requires such sophisticated search. In order to find matches efficiently, we take a progressive approach to the problem. First, we check for exact matches using keyword search. If we find such a match, our work is done and we can move on. If exact matching fails, we turn to alias matching. In the prototype, alias matching is also done using exact matching with keywords, for simplicity. In production, you might expand this step with normalization, transliteration rules, fuzzy matching, or curated alias tables. If we still haven't found a potential match in the first two steps, then it's time to bring in semantic search via Elasticsearch's hybrid search with reciprocal rank fusion (RRF).

What is hybrid search? In Elasticsearch, we can use semantic search to find meaningful matches that take context into account. Elasticsearch is widely used for vector search and hybrid retrieval. Semantic similarity is powerful for meaning, but it’s not a substitute for structured filtering (for example, by time ranges, locations, or identifiers), and it’s often unnecessary when an exact match is available. Elasticsearch made its mark with lexical search, which is great at tasks where semantic search doesn't fit. To take full advantage of both approaches, we use lexical search alongside semantic search in a single hybrid query. We then merge the results to find the most likely matches using RRF. In the prototype, the top two results become potential matches that can be sent for LLM judgment.

Why LLM judgment? LLM judgments and explanations allow our system to handle ambiguity and context transparently. This is vital for cases like "the president", which could refer to multiple entities, depending on the context, but it also makes things like nicknames and cultural variations work well in the system. Finally, when we consider mission-critical tasks, like identifying entities from sanctions lists, we need to know why a match was accepted in order to trust the system. Crucially, the LLM does not search the full corpus; it evaluates only the small set of candidates returned by Elasticsearch.

Real-world results: Matching with LLM reasoning

A major challenge for any natural language processing task is the creation of a golden document, an "answer key" that tells us what the expected results are. Without this, it's next to impossible to judge how well a system performs on a task, but creating such a document can be a laborious process. For the entity resolution prototype, we turned again to GenAI to help set up data we could test against.

We first defined several challenge types, such as nicknames and transliteration, and then asked the LLM to create a tiered collection of datasets that would get progressively larger and more challenging for the system. The creation of the datasets was less straightforward than one might hope. The LLM had a strong propensity for "cheating" by making it too easy to get the right answer. For example, one of the challenge types focused on semantic context. This type included things like resolving "Russian author" to "Leo Tolstoy". The LLM incorrectly put "Russian author" as an alias for "Leo Tolstoy", which negated the need for hybrid search to find the match.

After several refactorings to fix issues like this, we had five dataset tiers to work with. Tiers 1–4 were progressively larger with more challenge types. Tier 5 was the "ultimate challenge" dataset, made up of the trickiest examples from all challenge types. All of the test data is available in the comprehensive evaluation directory.

To evaluate our prompt-based entity resolution approach, we focused our attention on the tier 4 dataset. An important note is that the evaluation was conducted as a controlled experiment so that we could focus on entity match quality. The watch list data was pre-enriched with context, and entities were extracted from the article ahead of time. This ensured that evaluation was focused on matching rather than on extraction accuracy. This isolates match quality; end-to-end performance would additionally depend on extraction recall and enrichment quality.

Evaluation dataset

The tier 4 evaluation dataset provides a comprehensive test of the system's capabilities:[1]

Watch list entities: 66 entities across diverse types (people, organizations, locations).
Test articles: 69 articles covering real-world entity resolution scenarios.
Expected matches: 206 expected entity matches across all articles.
Challenge types: 15 different challenge types testing various aspects of entity resolution.

The challenge types included in the dataset are:

Nicknames: "Bob Smith" → "Robert Smith" (seven articles).
Titles and honorifics: "Dr. Sarah Williams" → "Sarah Williams" (five articles).
Semantic context: "Russian author" → "Leo Tolstoy" (eight articles).
Multilingual names: Handling names in different scripts (six articles).
Business entities: Corporate name variations (seven articles).
Executive references: "Microsoft CEO" → "Satya Nadella" (five articles).
Political leaders: Title-based references (five articles).
Initials: "J. Smith" → "John Smith" (three articles).
Name order variations: Different name ordering conventions (three articles).
Truncated names: Partial name matches (three articles).
Name splitting: Names split across text (three articles).
Missing spaces/hyphens: Formatting variations (two articles).
Transliteration: Cross-script name matching (two articles).
Combined challenges: Multiple challenges in one article (six articles).
Complex business: Hierarchical business relationships (five articles).

Let's see how prompt-based entity resolution performed.

Overall performance

The results show that there's a lot of promise with LLM-powered match evaluation, but they also reveal a significant reliability issue. Because each candidate pair must be evaluated by the LLM, failures in structured output can suppress acceptance and recall even when retrieval is working well.

Metric	Value
Precision	83.8%
Recall	62.6%
F1 score	71.7%
Total matches found	344
LLM acceptance rate	44.8%
Error rate	30.2%

The error rate problem

Recall that the first step we take in the prototype is to create potential match pairs using Elasticsearch. Each of these potential matches needs to be evaluated by the LLM. To efficiently process all of those matches, we batch the LLM calls together. This reduces API costs and latency, but there’s also an increased risk of getting malformed JSON in the output. As batch size increases, the JSON becomes longer and more complex, making it more likely that the LLM will generate invalid JSON. This is where the 30% error rate stems from. In the evaluation, we used a batch size of five matches per request. Even with this conservative batch size, we still see JSON parsing failures, which skews the evaluation results significantly.

What's next: Optimizing LLM integration

Now that we've matched entities using semantic search and LLM judgment, we have a complete entity resolution pipeline. This approach introduces a new failure mode, however, when the model’s judgment is correct, but its output isn’t usable. We can optimize the LLM integration for better reliability and cost efficiency. In the next post, we'll explore how to use function calling for structured output, which provides guaranteed structure and type safety while reducing errors and costs.

Try it yourself

Want to see entity matching in action? Check out the Entity Matching notebook for a complete walk-through with real implementations, detailed explanations, and hands-on examples. The notebook shows you exactly how to match entities using three-step search, hybrid search with RRF, and LLM-powered judgment with reasoning.

Remember: This is an educational prototype designed to teach the concepts. When building production systems, consider additional factors, like model selection, cost optimization, latency requirements, quality validation, error handling, and monitoring, which aren't covered in this learning-focused prototype.

Notes

These datasets are synthetic and designed for education; they approximate real challenges but are not representative of any single production domain.

Report an issue

Entity resolution with Elasticsearch & LLMs, Part 2: Matching entities with LLM judgment and semantic search

The problem: Why matching is hard

The solution: Three-step matching with transparent LLM judgment

Key concepts: Three-step matching, hybrid search, and transparent LLM judgment

Real-world results: Matching with LLM reasoning

Evaluation dataset

Overall performance

The error rate problem

What's next: Optimizing LLM integration

Try it yourself

Notes

Related Content

Ensuring semantic precision with minimum score

Better text analysis for complex languages with Elasticsearch and neural models

Entity resolution with Elasticsearch & LLMs, Part 1: Preparing for intelligent entity matching

From vectors to keywords: Elasticsearch hybrid search in LangChain

Query rewriting strategies for LLMs and search engines to improve results

Ready to build state of the art search experiences?