Humyn Labs Launches Report for AI Voice BenchmarkingAcross Indian and Global South Languages

By PNI Admin Last updated May 12, 2026

Humyn Labs, a physical and voice AI data infrastructure company, today published BRIDGE (Benchmark of Regional & International Data for Global Evaluation), the largest independent benchmark to evaluate commercial AI speech-recognition tools on real Indian language data. Covering 15+ Indic languages, 22 Indian states, and 15 global models, including tools from ElevenLabs Scribe v2, Deepgram Nova-3, Gemini 2.5 Flash, OpenAI GPT-4o, and Indian providers Sarvam saaras v3 and Gnani vachana v3, it is the most comprehensive evaluation of its kind ever conducted for languages spoken by over 5.5 billion people across the Global South. Read the full report here.

The findings expose a fundamental problem. Several of the most widely deployed tools misheard one in three words on Indian language audio. Most enterprises building on these tools were unaware, because the standard industry measure, Word Error Rate (WER), was never designed to catch the failures that define real Indian speech.

“The models are grading their own work. ASR providers published their own accuracy scores using benchmarks built on English-first, internet-trained datasets, with little independent validation. Meanwhile, enterprises are making million-dollar deployment decisions on numbers that rarely reflect how their users in Global South actually speak. Before BRIDGE, there was no independent benchmark for real-world conversational audio across non-English markets,” said Manish Agarwal, Co-founder, Humyn Labs.

Where most benchmarks stop at WER and Character Error Rate (CER), BRIDGE applies a seven-metric stack: WER, CER, Semantic Similarity, Code-Switch F1, Loan Word WER, Phoneme-Informed Error Rate, and Word Information Lost. Each captures a different dimension of failure. Semantic Similarity measures whether the meaning of what was said is preserved, even when exact words differ. Loan Word WER tracks accuracy specifically on English words embedded in Indian language speech. Phoneme-Informed Error Rate accounts for how Indic phonology is transcribed. Word Information Lost penalises both under- and over-transcription. Together they expose failure modes that a word count alone will never surface.

“The models aren’t the only problem the metrics are. You cannot evaluate non-english speech with a scoring system designed for English phonology and call it rigorous. The performance leaderboard for Hindi is not the leaderboard for Tamil, Bengali and Marathi. A single aggregate benchmark score cannot support cross-regional deployment decisions,” said Ishank Gupta, Co-founder, Humyn Labs.

The most consequential metric for India is Code-Switch F1, which measures how accurately a model handles the natural mixing of Hindi or any Indic language with English mid-sentence. Most AI tools either drop the English words or convert them into transliterated script, breaking the meaning for anyone reading the transcript. This failure is invisible to word error rate. The scores reveal the gap: Deepgram Nova-3 leads at 0.906. Amazon Transcribe scores 0.199. OpenAI’s models fall below 0.4.

The results also offer the first direct, independent comparison of global models against Indian providers. Sarvam AI’s saaras v3 ranks third overall on word error rate at 20.2% -ahead of Google Gemini, Microsoft Azure, and AWS Transcribe, a strong result for a model built specifically for Indian languages. On Code-Switch F1, however, it scores 0.588, placing it in the partial-reliability category where performance varies by language and English density. The gap between headline accuracy and code-switch reliability applies to domestic and international providers alike.

On overall word error rate, ElevenLabs Scribe v2 leads at 10.6%, with a margin over second place wider than the entire spread between second and eleventh. The broader finding, however, is that a single leaderboard number is not a reliable basis for deployment decisions. The model that leads on Spanish does not lead on Vietnamese. The model that leads on code-switching does not lead on word accuracy. Enterprises need to evaluate the language, dialect, and speech pattern that matches their actual users.

BRIDGE was built on field-collected audio, real two-person conversations, human-verified, across 22 Indian states, not scripted speech or data scraped from the internet.

The full dataset is available on Hugging Face. The benchmark report is at humynlabs.ai/bridge Humyn Labs has indicated it will open-source the evaluation methodology subject to demand from the research community.