Turning 37,000 Customer Calls into Intelligence with Whisper and LLMs

2026-03-10

We had tens of thousands of recorded customer calls and no idea what was in them. Buried in that audio was a clear signal about what customers actually wanted, what frustrated them, and where the business was leaving money on the table — but no human was ever going to listen to 37,000 calls. So I built something to do it.

The Problem

Like a lot of businesses, we'd been recording sales and support calls for years, mostly for the vague reassurance of "having them on file". The recordings sat in a bucket, accumulating. Every so often someone would suggest we should "do something with all that data", and every time the maths killed it: at an average of five or six minutes a call, listening to the backlog would have taken a single person well over a year of solid, full-time headphones-on listening, and they'd have nothing structured to show for it at the end.

The data was rich but completely unstructured. A recording isn't a record — you can't query it, aggregate it, or spot a trend across it. What I wanted was the opposite: something queryable. If a customer mentions a competitor by name, I want to count that. If three hundred callers ask for a part we don't stock, I want that to surface as a number, not as a feeling someone on the sales desk half-remembers.

The Pipeline

I built the whole thing in Python, end to end, as a batch pipeline. The shape was simple even if the details weren't:

Audio in — pull recordings from storage, normalise formats.
Transcription — run each recording through Whisper for speech-to-text.
LLM analysis — feed each transcript to an LLM to classify the call, score sentiment, extract topics and intent, and produce a short summary.
Aggregation — collapse thousands of per-call results into a structured report.

The first two stages are about getting clean text out of messy audio. The last two are about getting structured fields out of clean text. Treating them as separate concerns kept the whole thing debuggable — when something looked wrong in the output, I could tell immediately whether it was a transcription problem or an analysis problem.

for recording in batch:
    audio = load_and_normalise(recording.path)
    transcript = whisper_transcribe(audio)
    analysis = analyse_transcript(transcript)   # -> dict of structured fields
    store(recording.id, transcript, analysis)

Engineering Decisions

The interesting work was almost never the model call itself. It was everything around it.

Long audio and chunking. Some calls ran well over the comfortable input length for a single pass, so I chunked them, transcribed the chunks, and stitched the text back together with enough overlap that I wasn't slicing words in half at the boundaries.

Local inference vs API. With 37,000 calls to get through, cost compounds fast — a few pennies per call is a real number at that volume. I leaned on local inference for the heavy, repetitive transcription work and reserved paid API calls for the parts where quality genuinely mattered. That trade-off — run it yourself where it's cheap and good enough, pay for it where it isn't — drove most of the architecture.

Defensive parsing. LLMs are probabilistic, and "return JSON" is a request, not a guarantee. I treated every model response as hostile until proven otherwise: schema validation, sensible defaults, retries on malformed output, and a quarantine bucket for anything that still wouldn't parse. The point of the pipeline was to produce trustworthy structured data, and you don't get that by assuming the model behaves.

Messy real-world audio. Crosstalk, hold music, dead air, callers on terrible mobile connections — the real recordings were nothing like the clean demos. A surprising amount of robustness came from simply accepting that some calls would be low-confidence and tagging them as such, rather than pretending every transcript was equally reliable.

What It Surfaced

Once everything was structured, the aggregated view showed things the business genuinely hadn't seen — recurring requests, patterns in why calls went sideways, and a concentration of demand around specific products that hadn't been obvious from the day-to-day noise.

The part that stuck with me was realising the pipeline itself was the product. What I'd built to solve our own problem was just as applicable to any other motor factor sitting on a pile of call recordings they'd never analysed. The same plumbing — audio in, structured intelligence out — was a managed service waiting to happen.

Reflection

The biggest lesson was how little of "applied AI" is actually about the AI. The model is the easy part. You can swap Whisper for the next thing and the LLM for the one after that, and the hard problems don't move: getting clean inputs, processing at volume without the cost spiralling, turning probabilistic output into fields you can trust, and — hardest of all — translating that into something a business will actually act on.

Anyone can call an API. The engineering is in the plumbing, the evaluation, and the discipline of treating model output as data that needs validating rather than answers that can be trusted. That's the work, and it's the part that doesn't fit in a demo.