OpenSearch vs. Elasticsearch: Throughput for filtered vector search

From vector search to powerful REST APIs, Elasticsearch offers developers the most extensive search toolkit. Dive into our sample notebooks in the Elasticsearch Labs repo to try something new. You can also start your free trial or run Elasticsearch locally today.

Why search speed matters for AI agents and context engineering

Our benchmarks on a 20M document corpus show that Elasticsearch delivers up to 8x higher throughput than OpenSearch for filtered vector search, while also achieving higher Recall@100 across the configurations we tested. Context engineering depends on more than fast vector retrieval. Teams also need strong relevance controls, like hybrid search and filtering, operational simplicity, and predictable performance, as workflows iterate. But because agents often run retrieve, reason, retrieve loops many times per request, retrieval latency becomes a multiplier, so improvements here translate directly into better end-to-end responsiveness and lower cost.

OpenSearch vs. Elasticsearch: Throughput for filtered vector search benchmark — **Graph 1**: Throughput.

For context engineering, retrieval isn’t a one-time step. Agents and applications repeatedly run loops, such as retrieve → reason → retrieve, to refine queries, verify facts, assemble grounded context, and complete tasks. This pattern is common in agentic workflows and iterative retrieval augmented generation (RAG). Because retrieval may be invoked many times per user request, it adds delay to the response and/or increases infrastructure costs.

Context engineering turns a large context pool into a limited LLM context window. — **Figure 1**: Context engineering turns a large context pool (docs, memory, tools, chat history) into a limited large language model (LLM) context window via repeated retrieval and curation.
Optimal implementation of context engineering is an emerging technique. Iteration counts vary widely by workflow. The key concept that’s most fundamental to these benchmark results is that context engineering is directional: Iterative retrieval makes latency a multiplier.

Why is vector search performance critical?

Imagine a shopping assistant answering the question, “I need a carry-on backpack under $60 that fits a 15-inch laptop, is water resistant, and can arrive by Friday.”

In production, the assistant rarely issues one vector query and stops. It runs a retrieval loop to build the right context, and each step is typically constrained by filters, like availability, region, shipping promise, brand rules, and policy eligibility.

Step 1: Interpret intent and translate to constraints.

The agent turns the request into structured filters and a semantic query, such as:

Filters: In stock, deliverable to the user’s postcode, delivery by Friday, price under $60, valid listing
Vector query: “Carry-on backpack 15-inch laptop water resistant”

Step 2: Retrieve candidates, and then refine.

It often repeats retrieval with variations to avoid missing good matches:

“travel backpack carry on laptop sleeve”
“water resistant commuter backpack 15 inch”
“lightweight cabin backpack”

Each query uses the same eligibility filters, because retrieving irrelevant or unavailable items is wasted context.

Step 3: Expand to confirm details and reduce risk.

The agent then retrieves again to verify key attributes that affect the final answer:

Material and water resistance wording
Dimensions and laptop compartment fit
Return policy or warranty constraints
Alternate options if inventory is low

This is multistep context engineering: Retrieve, reason, retrieve, assemble.

Why latency and recall matter for context engineering

These interactions can involve dozens of filtered retrieval calls per user session. That makes per-call latency a direct multiplier on end-to-end response time, and low recall forces extra retries or causes the agent to miss eligible items, degrading answer quality.

Takeaway: In context-engineered systems, filtered approximate nearest neighbors (ANN) isn’t a single lookup. It’s a repeated operation under constraints, so vector search performance shows up immediately in latency, throughput, and cost, even when the large language model (LLM) is the most visible component.

Benchmarking

Results

In Graph 2, each dot represents one test configuration. The best results appear toward the top left, meaning higher recall with lower latency. Elasticsearch’s results are consistently closer to the top left than OpenSearch’s, indicating better speed and accuracy under the same workload settings.

Graph 2: Recall versus average latency (rescore 1). — **Graph 2**: Recall versus average latency, rescore of 1.

Some key insights

s_n_r_value: Shorthand for size_numCandidates_rescoreOversample (k and numCandidates set equal to numCandidates in these tests), for example, 100_500_1 means size=100, numCandidates=500 and k=500, rescore oversample=1
Recall: Measured Recall@100 for that configuration
Avg latency (ms): Average end-to-end latency per query
Throughput: Queries per second
Recall %: Relative recall lift of Elasticsearch versus OpenSearch (Elasticsearch minus OpenSearch) / OpenSearch
Latency Xs: OpenSearch average latency divided by Elasticsearch average latency
Throughput Xs: Elasticsearch throughput divided by OpenSearch throughput

Engine	`s_n_r_value`	Recall	Avg Latency (ms)	Throughput	Recall %	Latency Xs	Throughput Xs
Elasticsearch	100_250_1	0.7704	25	534.75	9.70%	2.28	1.91
OpenSearch	100_250_1	0.7023	57.08	279.58
Elasticsearch	100_500_1	0.8577	25.42	524.14	7.20%	2.4	2
OpenSearch	100_500_1	0.8001	60.9	262.12
Elasticsearch	100_750_1	0.8947	29.67	528.09	5.72%	2.25	2.21
OpenSearch	100_750_1	0.8463	66.76	239.11
Elasticsearch	100_1000_1	0.9156	29.65	534.5	4.66%	2.46	2.44
OpenSearch	100_1000_1	0.8748	72.88	219.01
Elasticsearch	100_1500_1	0.9386	31.84	497.3	3.38%	2.71	2.68
OpenSearch	100_1500_1	0.9079	86.16	185.4
Elasticsearch	100_2000_1	0.9507	34.69	457.2	2.57%	2.98	2.96
OpenSearch	100_2000_1	0.9269	103.36	154.55
Elasticsearch	100_2500_1	0.9582	37.9	418.43	1.99%	3.28	3.26
OpenSearch	100_2500_1	0.9395	124.29	128.53
Elasticsearch	100_3000_1	0.9636	41.86	379.4	1.62%	3.46	3.44
OpenSearch	100_3000_1	0.9482	144.67	110.34
Elasticsearch	100_4000_1	0.9705	50.28	316.21	1.06%	3.87	3.85
OpenSearch	100_4000_1	0.9603	194.36	82.22
Elasticsearch	100_5000_1	0.9749	58.77	270.91	0.73%	4.43	4.41
OpenSearch	100_5000_1	0.9678	260.33	61.38
Elasticsearch	100_6000_1	0.9781	66.75	238.59	0.52%	4.91	4.89
OpenSearch	100_6000_1	0.973	327.44	48.81
Elasticsearch	100_7000_1	0.9804	74.64	213.49	0.38%	5.28	5.27
OpenSearch	100_7000_1	0.9767	394.24	40.53
Elasticsearch	100_8000_1	0.9823	82.28	193.59	0.27%	6.86	6.83
OpenSearch	100_8000_1	0.9797	564.14	28.33
Elasticsearch	100_9000_1	0.9837	90.08	176.96	0.16%	7.63	7.61
OpenSearch	100_9000_1	0.9821	687.25	23.25
Elasticsearch	100_10000_1	0.9848	97.64	163.31	0.08%	8.38	8.36
OpenSearch	100_10000_1	0.984	818.64	19.53

For example, at 100_9000_1, OpenSearch averages 687 milliseconds per retrieval versus 90 milliseconds on Elasticsearch, and in a 10-step retrieval loop that’s about 10 x (687 - 90) = six seconds of additional waiting time.

See the full results.

Methodology

Using Python to send the queries and track the response timing and other statistics, we sent the following queries to the engines. Bear in mind that the performance of any vector search engine depends on how you tune its core parameters: how many candidates to consider, how aggressively to rescore, and how much context to return. These settings directly affect both recall (the likelihood of finding the right answer) and latency (how fast you get results).

In our benchmarks, we used the same candidate, rescore, and result-size settings you’d typically tune in an agentic retrieval loop, and we measured how Elasticsearch performs under that workload. We then ran OpenSearch with the same settings as a reference.

OpenSearch

"size": <RESULT_SIZE>: Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100.
"k": <NUMBER_OF_CANDIDATES>: The number of nearest neighbor candidates.
"ef_search": <NUMBER_OF_CANDIDATES>: The number of vectors to examine.
"oversample_factor": <OVERSAMPLE>: How many candidate vectors are retrieved before rescoring.

Elasticsearch

"size": <RESULT_SIZE>: Number of hits returned to the client. In this benchmark, result size is 100 to compute Recall@100.
"k": <NUMBER_OF_CANDIDATES>: Number of nearest neighbors to return from each shard.
"num_candidates": <NUMBER_OF_CANDIDATES>: Number of nearest neighbor candidates to consider per shard while doing knn search.
"oversample": <OVERSAMPLE>: How many candidate vectors are retrieved before rescoring.

Example

Knn query, (100_500_1), would be as follows:

OpenSearch

Elasticsearch

The full configuration, alongside Terraform scripts, Kubernetes manifests and the benchmarking code is available in this repository in the folder es-9.3-vs-os-3.5-vector-search.

Cluster setup

We ran our tests on six e2-standard-16 cloud servers, each with 16 vCPUs and 64 GB RAM. On each server, we allocated 15 vCPUs and 56 GB RAM to each Kubernetes pod running the search engine node, with 28 GB reserved for the JVM heap.

The clusters ran Elasticsearch 9.3.0 and OpenSearch 3.5.0 (Lucene 10.3.2). Because both systems use the same Lucene version in this benchmark, the throughput and latency differences we observe cannot be attributed to Lucene alone and instead reflect differences in how each engine integrates and executes filtered k-nearest neighbor (kNN) retrieval and rescoring. We used a single index with three primary shards and one replica (so 6 shards total, 1 per node).

We also used a separate server in the same region to run the benchmark client and collect timing statistics.

Cluster setup for Elasticsearch and for OpenSearch benchmarks — **Figure 2**: Diagram of the setup of the clusters.

The dataset

For this benchmark, we used a large-scale ecommerce-style catalog embedding dataset with 20 million documents, designed to reflect real-world filtered vector retrieval at scale.

Each document represents a catalog item and includes:

A 128-dimensional dense vector embedding used for approximate kNN retrieval.
Structured metadata fields used for filtering (for example, item validity and availability plus other catalog constraints) enabling the common production pattern of retrieving the nearest neighbors but only within an eligible subset.

We chose this dataset because it captures the core performance challenge we see in agentic and RAG-style systems in production: Vector similarity alone is not enough, retrieval is frequently constrained by filters, and the system must maintain high recall while keeping latency low under those constraints. Compared to smaller QA-style datasets, a 20M document corpus also better reflects the scale and candidate pressure that filtered ANN systems face in practice.

Conclusion

In modern AI architectures, especially those built around context engineering, vector search speed isn’t a minor implementation detail. It’s a multiplier. When agents and workflows iterate through retrieve → reason → retrieve, retrieval performance directly shapes end-to-end latency, throughput, and the quality of the context fed into the model.

In our benchmarks, Elasticsearch consistently delivered higher recall at lower latency than OpenSearch in scenarios where correctness depends on retrieving the right document, not just a similar vector. On a controlled dataset, the difference is clear, and in production those gains accumulate across large volumes of retrieval calls, improving responsiveness, increasing capacity headroom, and reducing infrastructure costs.