The survey had gone well. 5,000 responses to a single open-ended question — "What was your biggest professional challenge this year?" — collected from delegates at a large industry conference. The client wanted a full thematic breakdown: 12 categories, sentiment scoring, and a one-line summary per response.
For Priya, a senior analyst at a mid-size research consultancy, this was the kind of brief that used to require a team. Two coders working from a shared codebook, a third reviewer to handle disagreements, a week of alignment calls. Even then, inter-rater reliability was never perfect — coders got tired, edge cases caused drift, and the final output always felt a little arbitrary around categories 8 through 12.
She had three working days before the client presentation.
The problem with manual coding at scale
Qualitative coding isn't intellectually difficult. The codebook — the 12 categories, their definitions, what counts as "primary" versus "secondary" — takes expertise to design. But applying it row by row is mostly clerical work. Match the response to the category. Note the sentiment. Write the summary. Repeat 5,000 times.
The challenge is that humans doing clerical work for hours get inconsistent. The coder who codes response 4,800 isn't making the same judgements as the coder who coded response 200. You can build inter-rater reliability protocols to manage this, but that adds time and doesn't eliminate the drift — it just measures it.
What Priya needed was something that applied the codebook consistently across all 5,000 rows, with no fatigue and no drift.
Building the CSV
The survey export gave her one column she needed: the raw response text. She added two more columns during prep:
| response_id | response_text | respondent_sector |
|---|---|---|
| R001 | Managing hybrid teams has been really hard — people want different things and there's no one-size policy that works | Financial Services |
| R002 | Budget cuts mid-year meant we had to shelve a project we'd been building for 18 months | Technology |
The respondent_sector column wasn't required for the primary coding task, but she included it knowing the client would want sector-level breakdowns later. Having it in the same CSV meant she could slice the output any way she needed without a join.
Designing the batch instructions
This was where the analytical work happened. She needed to define the 12 categories precisely enough that the model would apply them consistently — the same way she'd written a codebook for a human coding team, but more explicit about output format.
Batch instructions used:
You are a qualitative research analyst. Classify the survey response in response_text according to the codebook below. Output exactly this format — no extra text:
PRIMARY_CATEGORY: [one of the 12 categories]
SECONDARY_CATEGORY: [one of the 12 categories, or NONE]
SENTIMENT: [positive / neutral / negative / mixed]
SUMMARY: [one sentence, max 20 words, third person]
Codebook:
1. TALENT — hiring, retention, skills gaps, team structure
2. HYBRID_WORK — remote work, office policy, collaboration across locations
3. BUDGET — cuts, cost pressure, underfunding, resource constraints
4. TECHNOLOGY — tech adoption, digital transformation, software/tool issues
5. LEADERSHIP — management quality, direction setting, executive decisions
6. WELLBEING — burnout, mental health, workload, work-life balance
7. GROWTH — career progression, learning, promotion, recognition
8. REGULATION — compliance, legal changes, regulatory burden
9. MARKET — competition, demand shifts, economic conditions
10. COMMUNICATION — internal comms, cross-team alignment, information flow
11. CHANGE_MGMT — restructuring, mergers, process change, uncertainty
12. OTHER — does not fit any above category clearly
Rules:
— If the response clearly fits two categories, use both primary and secondary
— If only one category fits, set SECONDARY_CATEGORY to NONE
— SENTIMENT refers to how the respondent feels about their challenge, not the topic itself
— Do not infer meaning not present in the text
— If the response is too short or vague to code reliably, set PRIMARY_CATEGORY to OTHER
She ran a test batch of 50 responses first, checking the output against her own manual coding of the same rows. Agreement was high — over 90% on primary category, with most disagreements falling on genuinely ambiguous cases where she'd have flagged inter-rater issues anyway.
Running the full batch
The full 5,000 rows ran on Gemini 2.5 Pro. She chose Pro over Flash for this task because the responses varied significantly in length and complexity — some were two-sentence paragraphs, some were a single clause. Pro handled the edge cases more reliably, and at 5,000 rows the cost difference was marginal.
The batch completed in just under two hours.
What the output looked like
The output CSV had the original three columns plus four new ones: PRIMARY_CATEGORY, SECONDARY_CATEGORY, SENTIMENT, and SUMMARY. She loaded it into a pivot table and had the client's thematic breakdown in about 20 minutes.
A few things stood out from the review:
- Around 180 responses were coded as OTHER — she reviewed those manually and found most were genuinely ambiguous or off-topic, not coding failures.
- The SUMMARY column was consistently useful. Rather than reading the raw response, she could skim the summaries to spot outliers and flag interesting verbatims for the client presentation.
- Sentiment distribution looked right — roughly what she'd expect from an open-ended challenge question. Predominantly negative, with a cluster of neutral responses from people who'd framed the challenge as solved.
What made this work
The codebook definitions mattered enormously. Early test runs with vague category descriptions produced inconsistent output — HYBRID_WORK and COMMUNICATION were frequently confused because both involve team interaction. Adding specific distinguishing criteria to each definition fixed most of the overlap.
Asking for a secondary category improved accuracy on the primary. When the model had to choose between forcing everything into one category or flagging a second, it made cleaner primary assignments. Responses that would have been ambiguous single-coded ended up better classified overall.
Gemini 2.5 Pro was the right choice here. The responses ranged from one sentence to several paragraphs. Flash is fast and cost-effective for uniform, short inputs — but for varied qualitative data where nuance matters, Pro produced noticeably better output on edge cases.
The sector column paid off. Having it in the original CSV meant the final output was immediately sliceable. The client's first follow-up question after the presentation was "how does this break down by sector?" — Priya had the answer in under a minute.