Behind the Build: NLP Pipeline That Screens 10,000 Resumes a Day

This is the technical companion to our TalentFlow AI case study. The case study covers the business impact — 85% faster screening, 3x better match quality. This post covers the engineering: how we built an NLP pipeline that processes 10,000 resumes daily and why the hardest problem wasn't NLP — it was trust.

The Technical Challenge

Resume screening sounds simple: parse a document, extract skills, match against job requirements. In practice, it's a mess:

Format chaos. Resumes come in PDF, DOCX, plain text, and occasionally HTML. PDFs alone range from machine-generated (easy) to scanned images (hard) to creative designs with columns, sidebars, and icons (harder).
Language ambiguity. "Managed a team of 5" and "Led a cross-functional pod" mean similar things. "Python" could be a programming language or a Monty Python reference (we're only half joking — context matters).
Implicit signals. A candidate's career trajectory, industry transitions, and project complexity carry signal that isn't captured by keyword matching. A senior engineer who's been at Google for 8 years has different implications than one who's been at 8 startups for 1 year each.

Pipeline Architecture

The system is a three-stage pipeline:

Stage 1: Document Parsing

Goal: Convert any resume format into structured text with section headers.

PDF/DOCX extraction using Apache Tika, which handles most standard formats reliably
OCR fallback using Tesseract for scanned documents (about 8% of submissions)
Layout analysis for multi-column resumes — we use a custom heuristic that identifies reading order based on bounding box positions
Section detection using a fine-tuned classifier that identifies resume sections (Experience, Education, Skills, etc.) even when headers are missing or non-standard

Parsing accuracy: 96% field extraction across all formats. The remaining 4% are edge cases (heavily designed resumes, non-standard layouts) that get flagged for manual review.

Stage 2: Entity Extraction and Normalization

Goal: Extract structured data from unstructured resume text.

Named Entity Recognition (NER) using a fine-tuned SpaCy model that extracts: skills, job titles, companies, dates, education, certifications
Skill normalization — maps variations to canonical skills ("JS" → "JavaScript", "React.js" → "React", "ML" → "Machine Learning"). We maintain a taxonomy of 2,500+ skills with aliases.
Experience calculation — computes years of experience per skill based on job dates and descriptions, not just the presence of a keyword
Career trajectory analysis — identifies patterns like career progression, industry focus, and role consistency

Stage 3: Scoring and Ranking

Goal: Score each candidate against job requirements and produce a ranked list with explanations.

Requirement decomposition — job descriptions are parsed into weighted requirements: must-have skills, nice-to-have skills, experience thresholds, education requirements
Feature vector construction — each candidate becomes a vector of ~120 features: skill match scores, experience match, trajectory signals, education match
Gradient-boosted scoring using XGBoost, trained on 18 months of TalentFlow's historical hiring data (15,000 applications with outcome labels: hired, rejected at each stage)
SHAP explanations — every score comes with a breakdown of contributing factors

The Explainability Decision

The first version of Stage 3 used a neural network that achieved slightly higher accuracy (89% vs. 87% for XGBoost on our test set). We chose XGBoost anyway. Here's why:

XGBoost with SHAP values produces human-readable explanations: "This candidate scored 87/100. Key factors: 6 years of Python experience (+15), React expertise (+12), career progression from IC to lead (+8), missing Kubernetes requirement (-10)."

The neural network produced a number. Recruiters didn't trust it.

When we deployed XGBoost with explanations, recruiter adoption hit 94% within two weeks. With the neural network, recruiters were checking every result manually — effectively not using the system at all.

The lesson: a slightly less accurate model that people trust outperforms a more accurate model that people ignore. This applies far beyond recruitment — we've seen the same pattern in healthcare and financial services.

Handling Bias

Automated screening systems can encode and amplify historical biases. We addressed this at multiple levels:

Training data audit. Before training, we analyzed TalentFlow's historical hiring decisions for demographic patterns. We found and corrected for biases in the training data — for example, a pattern where candidates from certain universities were historically preferred regardless of qualifications.

Feature exclusion. The model never sees: name, age, gender, ethnicity, photo, or university name. We include education level and field of study, but not the institution.

Adverse impact testing. We run regular analyses to check whether the model's scores differ systematically across demographic groups. If disparate impact is detected, we investigate and retrain.

Human-in-the-loop. The system recommends — it never auto-rejects. A human recruiter reviews every decision, especially for candidates near the threshold.

This isn't just ethical — it's practical. Biased systems miss qualified candidates, which is the exact opposite of what TalentFlow hired us to solve.

Performance and Scaling

The pipeline processes a single resume in under 2 seconds end-to-end:

Stage 1 (parsing): ~800ms average (varies by format)
Stage 2 (extraction): ~600ms
Stage 3 (scoring): ~200ms (batch scoring is faster)

At peak load (Monday mornings), the system processes 1,200+ resumes per hour. The pipeline runs on AWS Lambda for automatic scaling — each resume is processed independently, so scaling is linear.

Total infrastructure cost: ~$400/month at current volume (10,000 resumes/month). Compare that to the 40+ hours/week of recruiter time it replaced.

What We'd Build Differently

Feedback loop from day one. Recruiters can now flag disagreements with scores ("I think this candidate should be higher/lower"), and those flags feed back into retraining. We added this in month two. Starting from day one would have accelerated model improvement.

Better OCR pipeline. Tesseract handles most scanned resumes, but we've seen quality issues with low-resolution scans and resumes in non-Latin scripts. We'd evaluate cloud OCR services (Google Document AI, AWS Textract) for these edge cases.

Real-time skill taxonomy updates. Our skill taxonomy is updated monthly. In fast-moving fields (AI/ML especially), new tools and frameworks emerge faster than that. We'd build an automated system that detects new skill mentions and flags them for taxonomy inclusion.

Results After 6 Months

10,000+ resumes processed monthly with 96% extraction accuracy
85% reduction in recruiter screening time (from 40+ hours/week to ~6)
3x improvement in match quality (measured by interview-to-offer ratio)
94% recruiter adoption rate — the system is now integral to their workflow

Read the full TalentFlow case study →

Building an NLP pipeline? Let's talk architecture →