How to Clean and Standardise Messy Data at Scale Using AI

Every organisation that has been running for more than a few years has the same problem: data entered by different people, at different times, in different formats. Company names spelled six ways. Job titles ranging from "CEO" to "Chief Executive" to "Chief Exec" to "C.E.O". Addresses missing postcodes, or with the postcode in the wrong field.

It's not a big problem until it is. A CRM you can't segment reliably. A mailing list where duplicates slip through because "Acme Ltd" and "Acme Limited" and "ACME LTD" are treated as three separate companies. A reporting system that breaks because the job_title field has 847 distinct values when it should have 12.

Alex ran operations at a B2B services company. Their CRM had 8,000 contacts accumulated over six years across three legacy systems. Before a major marketing push, she needed the data clean. A specialist data agency quoted £4,000 and six weeks. She cleaned it in a weekend using AI batch processing.

What AI is good at in data cleaning

Before getting into the how, it's worth being clear about where AI genuinely helps and where it doesn't.

AI is well-suited to:

Standardising free-text fields to a defined format — "Managing Dir." → "Managing Director"
Normalising company name variants — "Acme Ltd", "ACME LIMITED", "Acme" → "Acme Ltd"
Inferring missing but derivable data — if the address has a street and city, it can often infer or flag a missing postcode
Classifying messy values into a fixed taxonomy — mapping 847 job title variants to 12 seniority bands
Detecting and flagging obvious anomalies — email addresses with no @, phone numbers that are clearly invalid, dates in the wrong century

AI is not well-suited to:

Verifying facts it can't see — it can't confirm that a company name is correct if there's nothing else in the row to cross-reference
Deduplicating across rows — that requires comparing rows against each other, which batch processing doesn't do (each row is processed independently)
Correcting data that's wrong in a way that requires external knowledge — a misspelled company name where the correct name isn't inferable from context

The data cleaning batch job is most valuable for standardisation and formatting problems — the kind where the right answer is knowable from what's in the row, but the entries are inconsistently formatted.

Building the spreadsheet

Alex exported the CRM with the four fields she needed to clean. She kept the original raw values and used the batch output as the cleaned version — never overwriting the source until she'd reviewed the results.

company_name_raw	job_title_raw	address_raw	email_raw
MERIDIAN RETAIL GRP LTD	Head of Mktg	14 Kings Road London	[email protected]
apex facilities management	ops director	Unit 4, Riverside Industrial Estate, Bristol, BS1 4RB	j.smith@apexfm
Vertex Logistics Plc.	C.O.O.	22 Commerce Street, Manchester M2 1DH	[email protected]

The multi-column format was important here. PromptMax sends each row to the model as a set of labelled field pairs — company_name_raw: MERIDIAN RETAIL GRP LTD, job_title_raw: Head of Mktg — so the model has full context about what each value is supposed to represent. A single-column approach where you concatenate everything into one cell loses that structure.

Writing the prompt

The prompt had two jobs: define the target format for each field, and specify exactly what output to produce so the results could be imported cleanly.

Prompt used:

You are a data cleaning specialist. Standardise the CRM record below according to these rules. Output exactly the cleaned values in this format — no explanation, no preamble:

COMPANY_NAME: [standardised company name — Title Case, spell out abbreviations where obvious (GRP→Group, Mgt→Management, Plc.→PLC), remove trailing punctuation]
JOB_TITLE: [standardised title — Title Case, expand common abbreviations (Mktg→Marketing, Dir→Director, C.O.O.→COO, ops→Operations), use standard form]
ADDRESS_CLEAN: [formatted as: Street, City, Postcode — add missing postcode as UNKNOWN if not present]
EMAIL_VALID: [YES if the email looks valid, NO if it is missing @ or domain, UNCERTAIN if partial]

Rules:
— If a field is empty or clearly nonsensical, output MISSING for that field
— Do not invent data — only standardise and format what is present
— For company names: do not add "Ltd" or "PLC" if not present in the original — only standardise what is there
— EMAIL_VALID is a flag only — do not attempt to correct invalid emails

The "do not invent data" rule is the most important constraint. Without it, the model will sometimes fill in plausible-looking values for missing fields — which is worse than flagging them as MISSING, because invented data looks clean and gets through downstream checks.

What the output looked like

company_name_raw	AI responses
MERIDIAN RETAIL GRP LTD	COMPANY_NAME: Meridian Retail Group Ltd JOB_TITLE: Head of Marketing ADDRESS_CLEAN: 14 Kings Road, London, UNKNOWN EMAIL_VALID: YES
apex facilities management	COMPANY_NAME: Apex Facilities Management JOB_TITLE: Operations Director ADDRESS_CLEAN: Unit 4, Riverside Industrial Estate, Bristol, BS1 4RB EMAIL_VALID: NO
Vertex Logistics Plc.	COMPANY_NAME: Vertex Logistics PLC JOB_TITLE: COO ADDRESS_CLEAN: 22 Commerce Street, Manchester, M2 1DH EMAIL_VALID: YES

She then split the AI responses column into four separate cleaned fields using a simple text formula in Excel, giving her a clean version of each field alongside the original. That side-by-side view made the review pass fast: any row where the cleaned value looked meaningfully different from the raw input got a second look.

Running the batch

8,000 rows on Gemini 2.5 Flash. The task is highly structured — the input format is predictable, the output format is tightly constrained, and there's no creative interpretation involved. Flash handles this kind of deterministic formatting task well, and at a fraction of Pro's cost.

The batch completed in about 90 minutes. Total cost: under £2.

Results of the review

Alex spot-checked 200 rows — roughly 2.5% — focusing on the cases where the raw and cleaned values diverged most. Her findings:

Company names: 96% accurately standardised. The failures were mostly proper nouns where the abbreviation was ambiguous — "MRL Group" expanded to "Meridian Retail Ltd Group" (wrong) instead of being left as-is. Adding a rule to the prompt ("do not expand abbreviations that could be brand names") fixed this on a second pass for flagged rows.
Job titles: 98% accurate. The model handled everything from "C.E.O." to "exec dir" to "snr mktg mgr" correctly. The few errors were genuinely ambiguous abbreviations where a human wouldn't have done better.
Addresses: 340 rows came back with UNKNOWN postcode — she exported just those rows, ran a quick lookup against a Royal Mail postcode file for the ones where city and street were clear, and filled in the missing values manually. A two-hour job instead of eight thousand.
Email flags: 127 invalid emails identified. All genuine issues — missing domains, truncated addresses, entries where someone had typed a phone number into the email field.

What this cost compared to the alternative

The data agency quote was £4,000 and six weeks. The batch job cost £2 in compute and a weekend of Alex's time — most of which was the review pass, not the cleaning itself. The 340 UNKNOWN postcodes took another two hours to fix manually. Everything else was handled by the batch.

Data quality won't be perfect from a single AI pass — it rarely is from a manual pass either. But the batch got 8,000 rows from unusable to working in a fraction of the time and cost, with a clear audit trail of what changed and why.

How to clean and standardise messy data at scale using AI