Agent Builder is available now GA. Get started with an Elastic Cloud Trial, and check out the documentation for Agent Builder here.
Mitr is a voice banking agent that understands Hindi, Marathi, English and 19 other Indian languages, switching mid-sentence, mid-call, without being told to. Behind a single voice turn: Sarvam AI transcribes and translates the caller's question, Elastic Agent Builder verifies their identity and queries their private transaction ledger, and Sarvam speaks the answer back in the caller's own language. Five HTTP calls. No custom RAG framework. No multilingual indices. The Agent runs on hybrid search using semantic_text plus parameterised ES|QL built on Elastic Agent Builder.
The problem: multilingual voice access to private banking data
For many users, accessing personal financial information (such as transaction history, salary deposits, or pending refunds) is complicated by a persistent language barrier. While users may prefer to interact in their native language, most banking and financial systems operate exclusively in English. This creates a significant gap: users need to inquire about specific, private financial data, but they lack a natural way to do so in the languages they are most comfortable speaking. A truly effective solution must bridge this gap, enabling secure, real-time access to private transaction data through natural conversation, regardless of the language the user chooses.
Now layer on the part that makes India distinct: the caller may be most comfortable in Hindi, Marathi, Tamil, or any of two dozen languages, and may switch between them mid-sentence. The data that answers their question, meanwhile, lives in systems that speak English and structured queries. And before you read out a single rupee of balance, you have to verify who's calling and never cross customer boundaries.
So a genuinely useful voice agent has to do three hard things at once:
- Understand and respond in the caller's language and keep up when they switch mid-call.
- Reach into private and sensitive, per customer transaction data their real ledger, not just FAQ articles.
- Stay secure; verify identity before disclosing anything, and scope every query to exactly one customer.
I built exactly that as an open demo: "Mitr" (“Friend” in Hindi) for a fictional "Pratham Bank" with the headline scenario being everyday money questions, plus UPI-fraud reporting. It's the perfect stress test because it forces all three requirements together, and it's deeply relatable across the country.
What is Sarvam AI and why use it for Indian language voice agents?
Sarvam AI is an Indian AI company that provides speech-to-text, translation, and text-to-speech APIs purpose-built for 22 Indian languages.
If you build for Indian users, Sarvam AI is worth knowing. Founded in 2023 in Bengaluru by Vivek Raghavan and Pratyush Kumar, both with roots in AI4Bharat. Sarvam has become one of the most prominent names in India's "sovereign AI" push and was selected under the IndiaAI Mission to build a foundational model for the country.
What makes Sarvam the right fit here isn't the headline models, though — it's the speech-and-language stack purpose built for Indian languages, exposed as clean REST APIs. We use three of them:
- Saaras (saaras:v3): Speech-to-text across 22 Indian languages plus English, with automatic language detection. That detection is the quiet hero of any "switch languages on the fly" experience: the caller never has to declare a language, the model just figures it out per utterance.
- Mayura (mayura:v1) : Translation tuned for Indian and code-mixed language (the English-sprinkled way people actually talk).
- Bulbul (bulbul:v3): Text-to-speech with 30+ natural-sounding Indian voices and accents.
All three are simple HTTP calls — POST /speech-to-text, POST /translate, POST /text-to-speech against https://api.sarvam.ai, authenticated with an api-subscription-key header.
Sarvam offers various integration methods, including HTTP, WebSocket, and Batch processing, allowing you to select the approach that best fits your specific technical requirements.
Western STT/TTS stacks tend to treat Indian languages as an afterthought; Sarvam treats them as the main event, and for a banking helpline aimed at the next 500 million users, that difference is the whole game.
How does the multilingual voice agent architecture work?

The whole system rests on a single design decision that keeps everything else simple:
The caller is always heard and answered in their own language, and Elasticsearch is always queried in English to get the right context via the agent. Sarvam translates both ends, so the data and reasoning layer stays language-agnostic; you never have to make your indices, tools, or agent prompt multilingual. But we’re using Jina model v5 for semantic search, which supports multilingual search.
Here's one voice turn, end to end, orchestrated by a thin FastAPI backend that is little more than five HTTP calls in sequence:
Here is the step-by-step flow for a single voice interaction:
- Audio Input: The process begins with the microphone capturing the user's spoken audio.
- Speech-to-Text (STT) & Language Detection: The audio is sent to the Sarvam AI Saaras (v3) model. This step transcribes the speech and automatically identifies the language used by the caller.
- Translation to English: The transcript is passed to the Sarvam Mayura (v1) translation model, which converts the user's speech into English. This ensures the reasoning layer stays language-agnostic.
- Reasoning & Tool Execution: The English transcript is sent to the Elastic Agent Builder's POST /api/agent_builder/converse endpoint. The agent verifies the user's identity, retrieves necessary information via tools (such as querying transaction data or knowledge bases), and generates a response.
- Translation to Native Language & Text-to-Speech (TTS):
- The generated English response is translated back into the caller's original language using the Sarvam Mayura model.
- Finally, the translated text is sent to the Sarvam Bulbul (v3) model, which converts it into a natural-sounding audio (WAV) file to be played back to the user.
The backend (Python app) holds no business logic of its own. It doesn't decide which tool to call, how to phrase an answer, or whether the caller is verified, all of that lives inside the Agent Builder agent. That separation is deliberate: the orchestrator stays trivial and stateless-per-turn, while the "brain" is configured declaratively in Elastic.
How to structure Elasticsearch indices for a voice agent
We use three indices, split by what the data is and how it's queried — and that split drives the entire agent design.
| Index | Data type | Query method | Use case |
|---|---|---|---|
| bank-support-kb | General knowledge | Hybrid BM25 + semantic | "How do I report fraud?" |
| bank-transactions | Private structured | Parameterised ES|QL | "Did my salary come in?" |
| bank-customers | Private structured | Parameterised ES|QL | Identity verification (DOB) |
Index 1. Generic knowledge (semantic search) - bank-support-kb. Loan rates, fees, fraud procedures, refund, and pension timelines; the content people ask about fuzzily and in many phrasings. We store it as semantic_text, so hybrid keyword + semantic search works with zero embedding code. The trick is copy_to: a plain text field is mirrored into a semantic_text field bound to an inference endpoint, so you get BM25 and vector recall over the same content.
Embeddings are generated at ingest by the referenced inference endpoint (jina-embeddings-v5-text-small). There's no embedding pipeline to run, no vector store to operate, no chunking glue to maintain. You write documents in; you query natural language out; Elasticsearch handles the BM25 + vector fusion.
Index 2 & 3. Private data (exact, structured).
bank-transactions is a per-customer ledger: salary, EMI, UPI, ATM, refunds, pension, each row carrying amount, channel, counterparty, status, and a running balance.
bank-customers holds one profile per customer, including dob for verification. These are structured records where precision matters, so they're queried with ES|QL, always scoped to a single customer_id. You don't want fuzzy semantics deciding whether a ₹14,999 debit happened. You want an exact, deterministic query.
How to configure Elastic Agent Builder tools for private data
Elastic Agent Builder lets you interact with an agent grounded in your data with no separate vector database, RAG framework, or orchestration code. However, in this scenario, we give specific tools to the Agent to execute accurately, so you add the tools and an instruction prompt; Elastic runs the reasoning loop, calls the tools, and streams back an answer. Our agent gets three tools, and the split mirrors the data model exactly:
bank_kb_search: An index_search tool over the semantic KB (hybrid retrieval, for "how do I report fraud?"). It runs hybrid BM25 plus semantic search over the knowledge base.customer_profile: A parameterised ES|QL tool used to verify DOB and check account status. Here is how ES|QL query looks like:
customer_transactions: A parameterised ES|QL tool: the money questions ("did my salary come?", "what was that debit?"). Sample transaction tool:
The ES|QL tools are guardrailed by construction: the LLM controls only a typed parameter, never the query shape. It can't widen the WHERE clause, can't drop the customer_id filter as it acts more like a primary_key, and it can't ask for another customer's rows.
The agent itself is created with POST /api/agent_builder/agents, carrying instructions that encode the behaviour: greet the caller, verify date of birth before disclosing anything, scope every lookup to the authenticated customer_id, and pick the right tool per intent (transactions for "where's my money", KB for "how do I report fraud").
Because that logic lives in Kibana rather than the backend, changing the agent's persona or rules means editing the instructions and re-running the create script. The running server picks up nothing; the agent definition does. You can change these instructions from Kibana -> Agents.
How to call Elastic Agent Builder's converse API
At runtime, the backend never talks to the LLM or the indices directly. It sends one English question to Agent Builder's streaming converse endpoint and lets Elastic orchestrate the tool calls and the model:
Three details worth calling out for developers:
- Trusted context, not a tool the LLM can spoof. The authenticated
customer_idis injected by the backend into the agent input as trusted context. The model uses it to fill the ES|QL parameter, but it never chooses it, so a caller can't talk their way into another account. - Identity persistence across turns. The backend keeps a per-caller
conversation_id. Once the agent verifies the caller's DOB on turn one, that verification holds for the rest of the call without re-asking. For demo purposes, we’ve just added DOB verification, but we can extend these verification parameters. - Pluggable reasoning LLM. The
inference_idfield on the converse body (set from theAGENT_INFERENCE_IDenvironment variable) selects the chat-completion endpoint the agent reasons with, or you leave it blank for the Agent Builder default. Crucially, this is a different thing from the embeddingsinference_idon thesemantic_textfield. One embeds documents; the other reasons over them.
The response is Server-Sent Events; the final answer arrives on the message_complete event's message_content. (Worth surfacing error events explicitly too, or a failed tool call silently degrades into an empty-answer fallback.)
See it in action
In the demo, a caller opens in English, gives his date of birth in Hindi, asks "home loan EMI debited?" in Hindi, then switches to Marathi for the rest of the conversation. Behind that single call, the agent verifies his DOB and confirms his home loan status and Flipkart refund through the ES|QL tool.
There is another demo “Money debited without consent” where the agent responds to connect India's 1930 cyber-crime helpline, blocking the UPI ID and pulled from the semantic knowledge base. Two tools, three languages, one conversation. Watching the same agent glide across languages mid-call is the moment the Sarvam + Elastic combination clicks.
Taking a multilingual voice agent to production
This is a demo. All data is synthetic and "Pratham Bank" is fictional, but the path to production is mostly about hardening, not redesign. For a real deployment, you'd enforce caller scoping with Elasticsearch document-level security / RBAC per authenticated user rather than trusting a passed customer_id; add real authentication and a telephony or WebRTC layer in front of the browser mic; and keep secrets in a managed store rather than a .env file. What carries forward unchanged is the part that matters: the semantic_text + parameterised-ES|QL split, the verify-first identity flow, and the query contract.
Run it yourself
The whole stack is one Docker workflow (teardown, setup (create indices, ingest data, build the agent), and serve) driven by a .env with your Elastic and Sarvam credentials.
Bring an Elastic deployment with Agent Builder enabled and a text-embedding inference endpoint, drop in a Sarvam API key, and you're talking to Mitr in minutes.
The code, full setup, and demo is in this repo.
Takeaways
- Let each platform play to its strength. Sarvam owns voice and Indian languages; Elastic Agent Builder owns reasoning-over-data. The glue between them is a thin, almost logic-free orchestrator.
- "Answer in the user's language" is a one-line contract that keeps a multilingual product maintainable; your indices, tools, and prompts never go multilingual.
semantic_text+ parameterised ES|QL tools cover the two halves of any real assistant (semantic context and exact private records) with no extra infrastructure to stand up.- Verify first, scope always. Identity checks and per-customer scoping, enforced in the tool definition rather than trusted to the model, are what separate a demo from something you'd let near real accounts.
All data in the demo is synthetic; "Pratham Bank" is fictional. Built with Elastic Agent Builder + Sarvam AI.




