Processing 170GB of TecDoc Data: A Concurrent Part-Matching Pipeline on AWS

2024-09-15

Cross-referencing a Jaguar Land Rover OEM part number against the right aftermarket equivalents used to mean manual lookups across multiple supplier catalogues. The reference data — TecDoc — runs to over 170GB and millions of records. Here's how I turned an hours-long manual chore into a minutes-long automated one.

The Problem

In the motor trade, the same physical part exists under a dozen different numbers. There's the OEM number the manufacturer stamps on it, and then there are the aftermarket equivalents from every supplier who makes a compatible version. A customer turns up with an OEM number; finding what we could actually sell them meant a human cross-referencing it across catalogues by hand.

That was slow, and slow costs money twice: once in the staff time spent looking things up, and again in the orders lost while a customer waited for an answer. The authoritative reference for all this is TecDoc — and it's enormous. Over 170GB, millions of records, spanning vehicles, parts, manufacturers, and the cross-reference links between them. The data was there. Querying it at speed was the problem.

Architecture

I built the pipeline in Python on top of MySQL, hosted on AWS:

S3 for large-scale storage of the raw CSV exports — TecDoc ships as a mountain of delimited files, and S3 was the natural landing zone.
RDS for the MySQL database the matching logic ran against, so I wasn't responsible for babysitting the data layer.
EC2 worker processes for ingestion and transformation — pulling from S3, parsing, normalising, and loading.
IAM to keep access scoped and the credentials out of the code.

The separation mattered. Storage, compute, and database were independent concerns that could be scaled and reasoned about separately, rather than one heroic script doing everything on one machine.

Concurrency

The naive version — loop over part numbers one at a time, query, move on — worked and was unbearably slow. With hundreds of thousands of part numbers to process, sequential lookups stretched into hours.

The job was overwhelmingly I/O-bound. Most of the wall-clock time was spent waiting on the database and on file reads, not on CPU. That's the ideal shape for concurrency: while one worker waits on a query, others can be doing useful work instead of the whole program sitting idle.

So I batched the part numbers and processed the batches concurrently, with a pool of workers sharing the load against RDS. The trick was tuning the concurrency to the bottleneck — push too hard and you just queue up against database connection limits, getting no faster while risking timeouts; too gentle and you leave throughput on the table. Finding the level where the database was busy but not buried was where the real speed came from.

with ThreadPoolExecutor(max_workers=WORKERS) as pool:
    for batch in chunked(part_numbers, BATCH_SIZE):
        pool.submit(match_batch, batch)

The result was the headline: a job that took hours collapsed to minutes.

Data Matching

The matching logic was the other half of the problem. Cross-referencing OEM to aftermarket isn't a clean key lookup — the data is inconsistent. Numbers come with and without spaces, with different separators, with leading zeros that may or may not be significant. Some links are exact; some are partial; some are missing entirely.

So matching meant normalising aggressively before comparing, handling the partial and ambiguous cases explicitly rather than letting them fall through, and being honest in the output about confidence. A confident exact match and a "this is probably the equivalent" match are different things, and collapsing them would have undermined trust in the whole system.

Reflection

Two things stuck with me. The first is that pipelines have to be designed to stay fast as the data grows. A 170GB dataset is not a fixed target — TecDoc updates, the catalogue expands, and an approach that only works at today's size is a problem deferred, not solved. Concurrency that scales with the work, and storage and compute that scale independently, were what kept it from being a one-off.

The second is that reliability and reprocessing deserve to be first-class, not bolted on. Big data jobs fail partway — a bad file, a dropped connection, a record that won't parse. Building from the start so that a failed run could be re-run safely, picking up where it left off rather than starting from zero, was the difference between a pipeline I trusted and a script I had to nurse. At this scale, "just run it again" is a feature you have to design for.