From vector search to powerful REST APIs, Elasticsearch offers developers the most extensive search toolkit. Dive into our sample notebooks in the Elasticsearch Labs repo to try something new. You can also start your free trial or run Elasticsearch locally today.
Why search speed matters for AI agents and context engineering
Our benchmarks on a 20M document corpus show that Elasticsearch delivers up to 8x higher throughput than OpenSearch for filtered vector search, while also achieving higher Recall@100 across the configurations we tested. Context engineering depends on more than fast vector retrieval. Teams also need strong relevance controls, like hybrid search and filtering, operational simplicity, and predictable performance, as workflows iterate. But because agents often run retrieve, reason, retrieve loops many times per request, retrieval latency becomes a multiplier, so improvements here translate directly into better end-to-end responsiveness and lower cost.

Graph 1: Throughput.
For context engineering, retrieval isn’t a one-time step. Agents and applications repeatedly run loops, such as retrieve → reason → retrieve, to refine queries, verify facts, assemble grounded context, and complete tasks. This pattern is common in agentic workflows and iterative retrieval augmented generation (RAG). Because retrieval may be invoked many times per user request, it adds delay to the response and/or increases infrastructure costs.

Figure 1: Context engineering turns a large context pool (docs, memory, tools, chat history) into a limited large language model (LLM) context window via repeated retrieval and curation.
Optimal implementation of context engineering is an emerging technique. Iteration counts vary widely by workflow. The key concept that’s most fundamental to these benchmark results is that context engineering is directional: Iterative retrieval makes latency a multiplier.
Why is vector search performance critical?
Imagine a shopping assistant answering the question, “I need a carry-on backpack under $60 that fits a 15-inch laptop, is water resistant, and can arrive by Friday.”
In production, the assistant rarely issues one vector query and stops. It runs a retrieval loop to build the right context, and each step is typically constrained by filters, like availability, region, shipping promise, brand rules, and policy eligibility.
Step 1: Interpret intent and translate to constraints.
The agent turns the request into structured filters and a semantic query, such as:
- Filters: In stock, deliverable to the user’s postcode, delivery by Friday, price under $60, valid listing
- Vector query: “Carry-on backpack 15-inch laptop water resistant”
Step 2: Retrieve candidates, and then refine.
It often repeats retrieval with variations to avoid missing good matches:
- “travel backpack carry on laptop sleeve”
- “water resistant commuter backpack 15 inch”
- “lightweight cabin backpack”
Each query uses the same eligibility filters, because retrieving irrelevant or unavailable items is wasted context.
Step 3: Expand to confirm details and reduce risk.
The agent then retrieves again to verify key attributes that affect the final answer:
- Material and water resistance wording
- Dimensions and laptop compartment fit
- Return policy or warranty constraints
- Alternate options if inventory is low
This is multistep context engineering: Retrieve, reason, retrieve, assemble.
Why latency and recall matter for context engineering
These interactions can involve dozens of filtered retrieval calls per user session. That makes per-call latency a direct multiplier on end-to-end response time, and low recall forces extra retries or causes the agent to miss eligible items, degrading answer quality.
Takeaway: In context-engineered systems, filtered approximate nearest neighbors (ANN) isn’t a single lookup. It’s a repeated operation under constraints, so vector search performance shows up immediately in latency, throughput, and cost, even when the large language model (LLM) is the most visible component.
Benchmarking
Results
In Graph 2, each dot represents one test configuration. The best results appear toward the top left, meaning higher recall with lower latency. Elasticsearch’s results are consistently closer to the top left than OpenSearch’s, indicating better speed and accuracy under the same workload settings.

Graph 2: Recall versus average latency, rescore of 1.
Some key insights
s_n_r_value: Shorthand forsize_numCandidates_rescoreOversample(k and numCandidates set equal to numCandidates in these tests), for example,100_500_1means size=100, numCandidates=500 and k=500, rescore oversample=1- Recall: Measured Recall@100 for that configuration
- Avg latency (ms): Average end-to-end latency per query
- Throughput: Queries per second
- Recall %: Relative recall lift of Elasticsearch versus OpenSearch (Elasticsearch minus OpenSearch) / OpenSearch
- Latency Xs: OpenSearch average latency divided by Elasticsearch average latency
- Throughput Xs: Elasticsearch throughput divided by OpenSearch throughput
| Engine | `s_n_r_value` | Recall | Avg Latency (ms) | Throughput | Recall % | Latency Xs | Throughput Xs |
|---|---|---|---|---|---|---|---|
| Elasticsearch | 100_250_1 | 0.7704 | 25 | 534.75 | 9.70% | 2.28 | 1.91 |
| OpenSearch | 100_250_1 | 0.7023 | 57.08 | 279.58 | |||
| Elasticsearch | 100_500_1 | 0.8577 | 25.42 | 524.14 | 7.20% | 2.4 | 2 |
| OpenSearch | 100_500_1 | 0.8001 | 60.9 | 262.12 | |||
| Elasticsearch | 100_750_1 | 0.8947 | 29.67 | 528.09 | 5.72% | 2.25 | 2.21 |
| OpenSearch | 100_750_1 | 0.8463 | 66.76 | 239.11 | |||
| Elasticsearch | 100_1000_1 | 0.9156 | 29.65 | 534.5 | 4.66% | 2.46 | 2.44 |
| OpenSearch | 100_1000_1 | 0.8748 | 72.88 | 219.01 | |||
| Elasticsearch | 100_1500_1 | 0.9386 | 31.84 | 497.3 | 3.38% | 2.71 | 2.68 |
| OpenSearch | 100_1500_1 | 0.9079 | 86.16 | 185.4 | |||
| Elasticsearch | 100_2000_1 | 0.9507 | 34.69 | 457.2 | 2.57% | 2.98 | 2.96 |
| OpenSearch | 100_2000_1 | 0.9269 | 103.36 | 154.55 | |||
| Elasticsearch | 100_2500_1 | 0.9582 | 37.9 | 418.43 | 1.99% | 3.28 | 3.26 |
| OpenSearch | 100_2500_1 | 0.9395 | 124.29 | 128.53 | |||
| Elasticsearch | 100_3000_1 | 0.9636 | 41.86 | 379.4 | 1.62% | 3.46 | 3.44 |
| OpenSearch | 100_3000_1 | 0.9482 | 144.67 | 110.34 | |||
| Elasticsearch | 100_4000_1 | 0.9705 | 50.28 | 316.21 | 1.06% | 3.87 | 3.85 |
| OpenSearch | 100_4000_1 | 0.9603 | 194.36 | 82.22 | |||
| Elasticsearch | 100_5000_1 | 0.9749 | 58.77 | 270.91 | 0.73% | 4.43 | 4.41 |
| OpenSearch | 100_5000_1 | 0.9678 | 260.33 | 61.38 | |||
| Elasticsearch | 100_6000_1 | 0.9781 | 66.75 | 238.59 | 0.52% | 4.91 | 4.89 |
| OpenSearch | 100_6000_1 | 0.973 | 327.44 | 48.81 | |||
| Elasticsearch | 100_7000_1 | 0.9804 | 74.64 | 213.49 | 0.38% | 5.28 | 5.27 |
| OpenSearch | 100_7000_1 | 0.9767 | 394.24 | 40.53 | |||
| Elasticsearch | 100_8000_1 | 0.9823 | 82.28 | 193.59 | 0.27% | 6.86 | 6.83 |
| OpenSearch | 100_8000_1 | 0.9797 | 564.14 | 28.33 | |||
| Elasticsearch | 100_9000_1 | 0.9837 | 90.08 | 176.96 | 0.16% | 7.63 | 7.61 |
| OpenSearch | 100_9000_1 | 0.9821 | 687.25 | 23.25 | |||
| Elasticsearch | 100_10000_1 | 0.9848 | 97.64 | 163.31 | 0.08% | 8.38 | 8.36 |
| OpenSearch | 100_10000_1 | 0.984 | 818.64 | 19.53 |
For example, at 100_9000_1, OpenSearch averages 687 milliseconds per retrieval versus 90 milliseconds on Elasticsearch, and in a 10-step retrieval loop that’s about 10 x (687 - 90) = six seconds of additional waiting time.
See the full results.
Methodology
Using Python to send the queries and track the response timing and other statistics, we sent the following queries to the engines. Bear in mind that the performance of any vector search engine depends on how you tune its core parameters: how many candidates to consider, how aggressively to rescore, and how much context to return. These settings directly affect both recall (the likelihood of finding the right answer) and latency (how fast you get results).
In our benchmarks, we used the same candidate, rescore, and result-size settings you’d typically tune in an agentic retrieval loop, and we measured how Elasticsearch performs under that workload. We then ran OpenSearch with the same settings as a reference.
OpenSearch
"size": <RESULT_SIZE>: Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100."k": <NUMBER_OF_CANDIDATES>: The number of nearest neighbor candidates."ef_search": <NUMBER_OF_CANDIDATES>: The number of vectors to examine."oversample_factor": <OVERSAMPLE>: How many candidate vectors are retrieved before rescoring.
Elasticsearch
"size": <RESULT_SIZE>: Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100."k": <NUMBER_OF_CANDIDATES>: Number of nearest neighbors to return from each shard."num_candidates": <NUMBER_OF_CANDIDATES>: Number of nearest neighbor candidates to consider per shard while doingknnsearch."oversample": <OVERSAMPLE>: How many candidate vectors are retrieved before rescoring.
Example
Knn query, (100_500_1), would be as follows:
OpenSearch
Elasticsearch
The full configuration, alongside Terraform scripts, Kubernetes manifests and the benchmarking code is available in this repository in the folder es-9.3-vs-os-3.5-vector-search.
Cluster setup
We ran our tests on six e2-standard-16 cloud servers, each with 16 vCPUs and 64 GB RAM. On each server, we allocated 15 vCPUs and 56 GB RAM to each Kubernetes pod running the search engine node, with 28 GB reserved for the JVM heap.
The clusters ran Elasticsearch 9.3.0 and OpenSearch 3.5.0 (Lucene 10.3.2). Because both systems use the same Lucene version in this benchmark, the throughput and latency differences we observe cannot be attributed to Lucene alone and instead reflect differences in how each engine integrates and executes filtered k-nearest neighbor (kNN) retrieval and rescoring. We used a single index with three primary shards and one replica (so 6 shards total, 1 per node).
We also used a separate server in the same region to run the benchmark client and collect timing statistics.

Figure 2: Diagram of the setup of the clusters.
The dataset
For this benchmark, we used a large-scale ecommerce-style catalog embedding dataset with 20 million documents, designed to reflect real-world filtered vector retrieval at scale.
Each document represents a catalog item and includes:
- A 128-dimensional dense vector embedding used for approximate kNN retrieval.
- Structured metadata fields used for filtering (for example, item validity and availability plus other catalog constraints) enabling the common production pattern of retrieving the nearest neighbors but only within an eligible subset.
We chose this dataset because it captures the core performance challenge we see in agentic and RAG-style systems in production: Vector similarity alone is not enough, retrieval is frequently constrained by filters, and the system must maintain high recall while keeping latency low under those constraints. Compared to smaller QA-style datasets, a 20M document corpus also better reflects the scale and candidate pressure that filtered ANN systems face in practice.
Conclusion
In modern AI architectures, especially those built around context engineering, vector search speed isn’t a minor implementation detail. It’s a multiplier. When agents and workflows iterate through retrieve → reason → retrieve, retrieval performance directly shapes end-to-end latency, throughput, and the quality of the context fed into the model.
In our benchmarks, Elasticsearch consistently delivered higher recall at lower latency than OpenSearch in scenarios where correctness depends on retrieving the right document, not just a similar vector. On a controlled dataset, the difference is clear, and in production those gains accumulate across large volumes of retrieval calls, improving responsiveness, increasing capacity headroom, and reducing infrastructure costs.




