CASE STUDY

Full-Stack Video Interview Transcription Tool for Job Boards

Industry Recruitment / HR Technology

Region International

Timeline Full-cycle engagement

Team Trembit dedicated engineering team

Transcription

Google Speech-to-Text Amazon Transcribe OpenAI Whisper

Backend

Python Node.js

The Problem

A job board platform needed to make video interviews searchable. Recruiters received hundreds of recorded video interviews per week, but the only way to review them was watching each in full — scrubbing footage to find relevant answers, taking manual notes, and comparing candidates from memory. There was no way to search across interviews, no way to quickly locate where a candidate discussed a specific skill, and no way for hiring managers who were not in the interview to get up to speed without watching the entire recording. They needed an automated transcription pipeline that could process interview videos at scale, handle multiple languages and accents, identify who was speaking when, and produce transcripts recruiters could search, annotate, and share — turning hours of passive video watching into minutes of targeted text review.

Why Building Accurate Multilingual Interview Transcription at Scale Is Hard

Transcribing video interviews sounds like a solved problem until you deal with the reality of recruitment audio — where accents, crosstalk, and domain-specific vocabulary push every speech-to-text API to its limits:

Accent and dialect diversity — candidates come from everywhere; a single job board may process interviews in dozens of accents and language backgrounds, and no single engine handles them all equally well
Speaker diarization in conversational audio — interviews are back-and-forth dialogue, so the system must reliably identify who said what, separating interviewer from candidate even when speakers overlap
Domain-specific vocabulary — technical interviews contain jargon, product names, and acronyms that generic models frequently misrecognize, and accuracy on these terms is what makes a transcript useful
Multiple AI API orchestration — no single provider is best at everything, so consistently high accuracy requires routing audio through multiple engines and merging results, with its own alignment and latency complexity
Volume and turnaround pressure — recruiters need transcripts within minutes of an interview ending, so the pipeline must process concurrent uploads without queuing delays
Editable, searchable output — raw output is never perfect; the platform needs an interface to correct errors, highlight passages, search across interviews, and export clean transcripts for hiring committees

What We Did

Transcription Pipeline Architecture

Designed the multi-provider transcription pipeline integrating Google Speech-to-Text, Amazon Transcribe, and OpenAI Whisper — each engine processes the same audio and a merging algorithm selects the most confident result per segment
Built the audio extraction and preprocessing layer — separating audio from video, normalizing volume, and segmenting long interviews without cutting mid-sentence
Implemented speaker diarization that labels each segment with the correct speaker and an API orchestration layer managing concurrency, rate limits, retries, and cost-optimized routing

Multi-Language Support & Accuracy Optimization

Implemented automatic language detection that routes audio to the provider configuration best suited for the detected language and accent profile
Built the multi-API consensus engine — comparing confidence scores, word-level timestamps, and phonetic alignment to select the most accurate version when providers disagree
Developed domain-specific vocabulary enhancement and fallback logic — custom dictionaries for technical terms, with automatic escalation to alternative providers on low-confidence segments

Transcript Interface & Recruiter Tools

Built the searchable transcript viewer — a timestamped, speaker-labeled transcript synced with the video player, where clicking any line jumps to that moment
Developed inline editing where corrections feed back into accuracy tracking, plus full-text search across all interviews for a role with direct links to the exact video timestamp
Built export functionality — clean, formatted transcripts as PDF or DOCX for hiring managers who prefer to read rather than watch

Scalability & Performance Optimization

Optimized the pipeline for fast turnaround — parallel processing across providers returns a complete diarized transcript for a 30-minute interview within minutes
Implemented job queue management for concurrent upload spikes — prioritizing recent interviews and distributing API calls to stay within rate limits
Built monitoring and accuracy tracking dashboards and load-tested under peak recruitment-season volumes to keep delivery times steady

Discuss Your Project

Key Results

Multi-API consensus Merging Google, Amazon, and OpenAI outputs for highest per-segment confidence

Diarized Interviewer and candidate clearly labeled throughout

Multilingual Automatic detection and provider routing across languages and accents

Minutes, not hours Parallel multi-provider processing for fast transcript delivery

Full-cycle Transcription pipeline, recruiter interface, search, editing, and export

In Their Words

Before this tool, our recruiters watched every interview end to end. Now they search transcripts, jump to the moments that matter, and review twice as many candidates in half the time. It fundamentally changed our hiring workflow.

Job board platform client

Their proactive team gets things done as if it were their own project.

Trembit client

What We Learned

No single transcription API wins across all accents and languages — the value is in the merging layer

Google handles European accents well, Whisper excels at noisy audio and code-switching, and Amazon is strong on formal business English. Rather than picking one, we built a consensus engine that runs all three and selects the best result per segment by confidence and word-level alignment. The merging algorithm consistently outperforms any individual provider by 8-15% on our interview test corpus, especially on accented speech and technical vocabulary.

Speaker diarization is easy in theory and brittle in practice — especially in interviews

Two-speaker diarization sounds simple, but interviews are not clean turn-taking: interviewers ask follow-ups while candidates finish answers, and audio quality varies between studio setups and laptop microphones in noisy apartments. We tuned the diarization pipeline specifically for interview patterns — shorter turns, more interruptions, unequal speaking time — rather than relying on defaults tuned for meetings or podcasts.

Search across interviews is the feature that changed the recruiter workflow

The original ask was transcription — turn video into text. But once transcripts existed, searching across all interviews for a role became the feature recruiters used most. Searching "team management" and instantly seeing which candidates discussed it, with clickable links to the exact timestamp, turned passive watching into active research. We built full-text indexing into storage from the start, which made this trivial to add but would have been expensive to retrofit.

Need Interview Transcription?

Book a 30-minute architecture session — we'll discuss your transcription requirements and the infrastructure decisions that matter most. No pitch deck. Just engineering clarity.