Data

How to clean and standardise messy data at scale using AI

5 min read · Published 19 May 2026

Every organisation that has been running for more than a few years has the same problem: data entered by different people, at different times, in different formats. Company names spelled six ways. Job titles ranging from "CEO" to "Chief Executive" to "Chief Exec" to "C.E.O". Addresses missing postcodes, or with the postcode in the wrong field.

It's not a big problem until it is. A CRM you can't segment reliably. A mailing list where duplicates slip through because "Acme Ltd" and "Acme Limited" and "ACME LTD" are treated as three separate companies. A reporting system that breaks because the job_title field has 847 distinct values when it should have 12.

Alex ran operations at a B2B services company. Their CRM had 8,000 contacts accumulated over six years across three legacy systems. Before a major marketing push, she needed the data clean. A specialist data agency quoted £4,000 and six weeks. She cleaned it in a weekend using AI batch processing.

What AI is good at in data cleaning

Before getting into the how, it's worth being clear about where AI genuinely helps and where it doesn't.

AI is well-suited to:

AI is not well-suited to:

The data cleaning batch job is most valuable for standardisation and formatting problems — the kind where the right answer is knowable from what's in the row, but the entries are inconsistently formatted.

Building the spreadsheet

Alex exported the CRM with the four fields she needed to clean. She kept the original raw values and used the batch output as the cleaned version — never overwriting the source until she'd reviewed the results.

company_name_raw job_title_raw address_raw email_raw
MERIDIAN RETAIL GRP LTD Head of Mktg 14 Kings Road London [email protected]
apex facilities management ops director Unit 4, Riverside Industrial Estate, Bristol, BS1 4RB j.smith@apexfm
Vertex Logistics Plc. C.O.O. 22 Commerce Street, Manchester M2 1DH [email protected]

The multi-column format was important here. PromptMax sends each row to the model as a set of labelled field pairs — company_name_raw: MERIDIAN RETAIL GRP LTD, job_title_raw: Head of Mktg — so the model has full context about what each value is supposed to represent. A single-column approach where you concatenate everything into one cell loses that structure.

Writing the prompt

The prompt had two jobs: define the target format for each field, and specify exactly what output to produce so the results could be imported cleanly.

Prompt used:

You are a data cleaning specialist. Standardise the CRM record below according to these rules. Output exactly the cleaned values in this format — no explanation, no preamble:

COMPANY_NAME: [standardised company name — Title Case, spell out abbreviations where obvious (GRP→Group, Mgt→Management, Plc.→PLC), remove trailing punctuation]
JOB_TITLE: [standardised title — Title Case, expand common abbreviations (Mktg→Marketing, Dir→Director, C.O.O.→COO, ops→Operations), use standard form]
ADDRESS_CLEAN: [formatted as: Street, City, Postcode — add missing postcode as UNKNOWN if not present]
EMAIL_VALID: [YES if the email looks valid, NO if it is missing @ or domain, UNCERTAIN if partial]

Rules:
— If a field is empty or clearly nonsensical, output MISSING for that field
— Do not invent data — only standardise and format what is present
— For company names: do not add "Ltd" or "PLC" if not present in the original — only standardise what is there
— EMAIL_VALID is a flag only — do not attempt to correct invalid emails

The "do not invent data" rule is the most important constraint. Without it, the model will sometimes fill in plausible-looking values for missing fields — which is worse than flagging them as MISSING, because invented data looks clean and gets through downstream checks.

What the output looked like

company_name_raw AI responses
MERIDIAN RETAIL GRP LTD COMPANY_NAME: Meridian Retail Group Ltd
JOB_TITLE: Head of Marketing
ADDRESS_CLEAN: 14 Kings Road, London, UNKNOWN
EMAIL_VALID: YES
apex facilities management COMPANY_NAME: Apex Facilities Management
JOB_TITLE: Operations Director
ADDRESS_CLEAN: Unit 4, Riverside Industrial Estate, Bristol, BS1 4RB
EMAIL_VALID: NO
Vertex Logistics Plc. COMPANY_NAME: Vertex Logistics PLC
JOB_TITLE: COO
ADDRESS_CLEAN: 22 Commerce Street, Manchester, M2 1DH
EMAIL_VALID: YES

She then split the AI responses column into four separate cleaned fields using a simple text formula in Excel, giving her a clean version of each field alongside the original. That side-by-side view made the review pass fast: any row where the cleaned value looked meaningfully different from the raw input got a second look.

Running the batch

8,000 rows on Gemini 2.5 Flash. The task is highly structured — the input format is predictable, the output format is tightly constrained, and there's no creative interpretation involved. Flash handles this kind of deterministic formatting task well, and at a fraction of Pro's cost.

The batch completed in about 90 minutes. Total cost: under £2.

Results of the review

Alex spot-checked 200 rows — roughly 2.5% — focusing on the cases where the raw and cleaned values diverged most. Her findings:

What this cost compared to the alternative

The data agency quote was £4,000 and six weeks. The batch job cost £2 in compute and a weekend of Alex's time — most of which was the review pass, not the cleaning itself. The 340 UNKNOWN postcodes took another two hours to fix manually. Everything else was handled by the batch.

Data quality won't be perfect from a single AI pass — it rarely is from a manual pass either. But the batch got 8,000 rows from unusable to working in a fraction of the time and cost, with a clear audit trail of what changed and why.

Clean your messy data in a single batch job

Upload your spreadsheet, write your standardisation rules once as a prompt,
and PromptMax applies them to every row automatically.
Start with £5 free credit. No card needed.

Get started free →