Mitr: multilingual banking voice agent with Sarvam AI

Agent Builder is available now GA. Get started with an Elastic Cloud Trial, and check out the documentation for Agent Builder here.

Mitr is a voice banking agent that understands Hindi, Marathi, English and 19 other Indian languages, switching mid-sentence, mid-call, without being told to. Behind a single voice turn: Sarvam AI transcribes and translates the caller's question, Elastic Agent Builder verifies their identity and queries their private transaction ledger, and Sarvam speaks the answer back in the caller's own language. Five HTTP calls. No custom RAG framework. No multilingual indices. The Agent runs on hybrid search using semantic_text plus parameterised ES|QL built on Elastic Agent Builder.

The problem: multilingual voice access to private banking data

For many users, accessing personal financial information (such as transaction history, salary deposits, or pending refunds) is complicated by a persistent language barrier. While users may prefer to interact in their native language, most banking and financial systems operate exclusively in English. This creates a significant gap: users need to inquire about specific, private financial data, but they lack a natural way to do so in the languages they are most comfortable speaking. A truly effective solution must bridge this gap, enabling secure, real-time access to private transaction data through natural conversation, regardless of the language the user chooses.

Now layer on the part that makes India distinct: the caller may be most comfortable in Hindi, Marathi, Tamil, or any of two dozen languages, and may switch between them mid-sentence. The data that answers their question, meanwhile, lives in systems that speak English and structured queries. And before you read out a single rupee of balance, you have to verify who's calling and never cross customer boundaries.

So a genuinely useful voice agent has to do three hard things at once:

Understand and respond in the caller's language and keep up when they switch mid-call.
Reach into private and sensitive, per customer transaction data their real ledger, not just FAQ articles.
Stay secure; verify identity before disclosing anything, and scope every query to exactly one customer.

I built exactly that as an open demo: "Mitr" (“Friend” in Hindi) for a fictional "Pratham Bank" with the headline scenario being everyday money questions, plus UPI-fraud reporting. It's the perfect stress test because it forces all three requirements together, and it's deeply relatable across the country.

What is Sarvam AI and why use it for Indian language voice agents?

Sarvam AI is an Indian AI company that provides speech-to-text, translation, and text-to-speech APIs purpose-built for 22 Indian languages.

If you build for Indian users, Sarvam AI is worth knowing. Founded in 2023 in Bengaluru by Vivek Raghavan and Pratyush Kumar, both with roots in AI4Bharat. Sarvam has become one of the most prominent names in India's "sovereign AI" push and was selected under the IndiaAI Mission to build a foundational model for the country.

What makes Sarvam the right fit here isn't the headline models, though — it's the speech-and-language stack purpose built for Indian languages, exposed as clean REST APIs. We use three of them:

Saaras (saaras:v3): Speech-to-text across 22 Indian languages plus English, with automatic language detection. That detection is the quiet hero of any "switch languages on the fly" experience: the caller never has to declare a language, the model just figures it out per utterance.
Mayura (mayura:v1) : Translation tuned for Indian and code-mixed language (the English-sprinkled way people actually talk).
Bulbul (bulbul:v3): Text-to-speech with 30+ natural-sounding Indian voices and accents.

All three are simple HTTP calls — POST /speech-to-text, POST /translate, POST /text-to-speech against https://api.sarvam.ai, authenticated with an api-subscription-key header.

Sarvam offers various integration methods, including HTTP, WebSocket, and Batch processing, allowing you to select the approach that best fits your specific technical requirements.

Western STT/TTS stacks tend to treat Indian languages as an afterthought; Sarvam treats them as the main event, and for a banking helpline aimed at the next 500 million users, that difference is the whole game.

How does the multilingual voice agent architecture work?

The whole system rests on a single design decision that keeps everything else simple:

The caller is always heard and answered in their own language, and Elasticsearch is always queried in English to get the right context via the agent. Sarvam translates both ends, so the data and reasoning layer stays language-agnostic; you never have to make your indices, tools, or agent prompt multilingual. But we’re using Jina model v5 for semantic search, which supports multilingual search.

Here's one voice turn, end to end, orchestrated by a thin FastAPI backend that is little more than five HTTP calls in sequence:

Here is the step-by-step flow for a single voice interaction:

Audio Input: The process begins with the microphone capturing the user's spoken audio.
Speech-to-Text (STT) & Language Detection: The audio is sent to the Sarvam AI Saaras (v3) model. This step transcribes the speech and automatically identifies the language used by the caller.
Translation to English: The transcript is passed to the Sarvam Mayura (v1) translation model, which converts the user's speech into English. This ensures the reasoning layer stays language-agnostic.
Reasoning & Tool Execution: The English transcript is sent to the Elastic Agent Builder's POST /api/agent_builder/converse endpoint. The agent verifies the user's identity, retrieves necessary information via tools (such as querying transaction data or knowledge bases), and generates a response.
Translation to Native Language & Text-to-Speech (TTS):
- The generated English response is translated back into the caller's original language using the Sarvam Mayura model.
- Finally, the translated text is sent to the Sarvam Bulbul (v3) model, which converts it into a natural-sounding audio (WAV) file to be played back to the user.

The backend (Python app) holds no business logic of its own. It doesn't decide which tool to call, how to phrase an answer, or whether the caller is verified, all of that lives inside the Agent Builder agent. That separation is deliberate: the orchestrator stays trivial and stateless-per-turn, while the "brain" is configured declaratively in Elastic.

How to structure Elasticsearch indices for a voice agent

We use three indices, split by what the data is and how it's queried — and that split drives the entire agent design.

Index	Data type	Query method	Use case
bank-support-kb	General knowledge	Hybrid BM25 + semantic	"How do I report fraud?"
bank-transactions	Private structured	Parameterised ES\|QL	"Did my salary come in?"
bank-customers	Private structured	Parameterised ES\|QL	Identity verification (DOB)

Index 1. Generic knowledge (semantic search) - bank-support-kb. Loan rates, fees, fraud procedures, refund, and pension timelines; the content people ask about fuzzily and in many phrasings. We store it as semantic_text, so hybrid keyword + semantic search works with zero embedding code. The trick is copy_to: a plain text field is mirrored into a semantic_text field bound to an inference endpoint, so you get BM25 and vector recall over the same content.

Embeddings are generated at ingest by the referenced inference endpoint (jina-embeddings-v5-text-small). There's no embedding pipeline to run, no vector store to operate, no chunking glue to maintain. You write documents in; you query natural language out; Elasticsearch handles the BM25 + vector fusion.

Index 2 & 3. Private data (exact, structured).

bank-transactions is a per-customer ledger: salary, EMI, UPI, ATM, refunds, pension, each row carrying amount, channel, counterparty, status, and a running balance.

bank-customers holds one profile per customer, including dob for verification. These are structured records where precision matters, so they're queried with ES|QL, always scoped to a single customer_id. You don't want fuzzy semantics deciding whether a ₹14,999 debit happened. You want an exact, deterministic query.

How to configure Elastic Agent Builder tools for private data

Elastic Agent Builder lets you interact with an agent grounded in your data with no separate vector database, RAG framework, or orchestration code. However, in this scenario, we give specific tools to the Agent to execute accurately, so you add the tools and an instruction prompt; Elastic runs the reasoning loop, calls the tools, and streams back an answer. Our agent gets three tools, and the split mirrors the data model exactly:

bank_kb_search: An index_search tool over the semantic KB (hybrid retrieval, for "how do I report fraud?"). It runs hybrid BM25 plus semantic search over the knowledge base.
customer_profile: A parameterised ES|QL tool used to verify DOB and check account status. Here is how ES|QL query looks like:

customer_transactions: A parameterised ES|QL tool: the money questions ("did my salary come?", "what was that debit?"). Sample transaction tool:

The ES|QL tools are guardrailed by construction: the LLM controls only a typed parameter, never the query shape. It can't widen the WHERE clause, can't drop the customer_id filter as it acts more like a primary_key, and it can't ask for another customer's rows.

The agent itself is created with POST /api/agent_builder/agents, carrying instructions that encode the behaviour: greet the caller, verify date of birth before disclosing anything, scope every lookup to the authenticated customer_id, and pick the right tool per intent (transactions for "where's my money", KB for "how do I report fraud").

Because that logic lives in Kibana rather than the backend, changing the agent's persona or rules means editing the instructions and re-running the create script. The running server picks up nothing; the agent definition does. You can change these instructions from Kibana -> Agents.

How to call Elastic Agent Builder's converse API

At runtime, the backend never talks to the LLM or the indices directly. It sends one English question to Agent Builder's streaming converse endpoint and lets Elastic orchestrate the tool calls and the model:

Three details worth calling out for developers:

Trusted context, not a tool the LLM can spoof. The authenticated customer_id is injected by the backend into the agent input as trusted context. The model uses it to fill the ES|QL parameter, but it never chooses it, so a caller can't talk their way into another account.
Identity persistence across turns. The backend keeps a per-caller conversation_id. Once the agent verifies the caller's DOB on turn one, that verification holds for the rest of the call without re-asking. For demo purposes, we’ve just added DOB verification, but we can extend these verification parameters.
Pluggable reasoning LLM. The inference_id field on the converse body (set from the AGENT_INFERENCE_ID environment variable) selects the chat-completion endpoint the agent reasons with, or you leave it blank for the Agent Builder default. Crucially, this is a different thing from the embeddings inference_id on the semantic_text field. One embeds documents; the other reasons over them.

The response is Server-Sent Events; the final answer arrives on the message_complete event's message_content. (Worth surfacing error events explicitly too, or a failed tool call silently degrades into an empty-answer fallback.)

See it in action

In the demo, a caller opens in English, gives his date of birth in Hindi, asks "home loan EMI debited?" in Hindi, then switches to Marathi for the rest of the conversation. Behind that single call, the agent verifies his DOB and confirms his home loan status and Flipkart refund through the ES|QL tool.

There is another demo “Money debited without consent” where the agent responds to connect India's 1930 cyber-crime helpline, blocking the UPI ID and pulled from the semantic knowledge base. Two tools, three languages, one conversation. Watching the same agent glide across languages mid-call is the moment the Sarvam + Elastic combination clicks.

Taking a multilingual voice agent to production

This is a demo. All data is synthetic and "Pratham Bank" is fictional, but the path to production is mostly about hardening, not redesign. For a real deployment, you'd enforce caller scoping with Elasticsearch document-level security / RBAC per authenticated user rather than trusting a passed customer_id; add real authentication and a telephony or WebRTC layer in front of the browser mic; and keep secrets in a managed store rather than a .env file. What carries forward unchanged is the part that matters: the semantic_text + parameterised-ES|QL split, the verify-first identity flow, and the query contract.

Run it yourself

The whole stack is one Docker workflow (teardown, setup (create indices, ingest data, build the agent), and serve) driven by a .env with your Elastic and Sarvam credentials.

Bring an Elastic deployment with Agent Builder enabled and a text-embedding inference endpoint, drop in a Sarvam API key, and you're talking to Mitr in minutes.

The code, full setup, and demo is in this repo.

Takeaways

Let each platform play to its strength. Sarvam owns voice and Indian languages; Elastic Agent Builder owns reasoning-over-data. The glue between them is a thin, almost logic-free orchestrator.
"Answer in the user's language" is a one-line contract that keeps a multilingual product maintainable; your indices, tools, and prompts never go multilingual.
semantic_text + parameterised ES|QL tools cover the two halves of any real assistant (semantic context and exact private records) with no extra infrastructure to stand up.
Verify first, scope always. Identity checks and per-customer scoping, enforced in the tool definition rather than trusted to the model, are what separate a demo from something you'd let near real accounts.

All data in the demo is synthetic; "Pratham Bank" is fictional. Built with Elastic Agent Builder + Sarvam AI.

Wie hilfreich war dieser Inhalt?

Nicht hilfreich

Einigermaßen hilfreich

Sehr hilfreich

Ein Problem melden

Zugehörige Inhalte

Bringing it together: How we rebuilt Elasticsearch as a columnar metrics engine; 6.6x less storage, 160x faster queries

ES|QL Inside Elastic+1

29. Juni 2026

Bringing it together: How we rebuilt Elasticsearch as a columnar metrics engine; 6.6x less storage, 160x faster queries

Elasticsearch metrics in version 9.4 run on a fully columnar engine: 6.6x less storage, 160x faster queries, native PromQL and OTel support.

YR VC

Von: Yannis Roussos und Vinay Chandrasekhar

Elasticsearch ES|QL: Now with Views, Subqueries, and Schema-on-Read

ES|QL Query Languages+1

25. Juni 2026

Elasticsearch ES|QL: Now with Views, Subqueries, and Schema-on-Read

Query fields you never mapped, combine indices with different schemas in one pipeline, and reuse query logic as named views. ES|QL's most significant data access expansion yet.

Von: Tyler Perkins

Talk to your Elasticsearch data: building a real-time voice agent with Google ADK and MCP in 3 components

Agentic AI Integrations

26. Juni 2026

Talk to your Elasticsearch data: building a real-time voice agent with Google ADK and MCP in 3 components

Wire Google ADK's real-time voice streaming to your Elasticsearch data via Agent Builder's built-in MCP server; no custom integration code required.

Von: Jeffrey Rengifo

Your data analyst doesn't need SQL: wiring Elastic Agent Builder to AWS AgentCore for natural-language Elasticsearch queries

Agentic AI Python+1

22. Juni 2026

Your data analyst doesn't need SQL: wiring Elastic Agent Builder to AWS AgentCore for natural-language Elasticsearch queries

Wire plain-English questions to your Elasticsearch data using Elastic Agent Builder MCP, AWS Bedrock AgentCore and the Strands SDK. Python code included.

Von: Someshwaran Mohankumar

Extract chart data standard OCR misses: Elastic Agent Builder and LlamaParse in one pipeline