Applied AI, UX Lead at Google
GenUX/UI · LLM Evals · Autoraters
Image → zod-typed JSON via a vision LLM. Schema-first design, validation gates with retry, locale handling, confidence signalling, and the anti-patterns that cause silent garbage extraction.
$ npx skills add darrenhead/skills --skill multimodal-structured-extractionA field manual for turning images into validated, typed objects with a vision LLM. Distilled from production work shipping multiple image-to-JSON pipelines across receipts, restaurant menus, and financial documents — generalised to the cross-cutting pattern.
The gap between "Gemini can read images" and "Gemini reliably extracts structured data" is roughly one engineering month of pain: silent field omissions, locale-shaped numbers parsed as garbage, snake_case vs camelCase drift, hallucinated fields when the model is uncertain, and pipelines that fall over the first time the API rate-limits you. This skill is the shape of the work you'd otherwise have to redo from scratch.
inlineData or image parts and want the output as JSON, not prose.pdf-parse. Barcodes → a barcode library. MRZ on a passport → an MRZ library. Don't hand structured machine-readable inputs to an LLM and pay tokens for the privilege of getting them wrong sometimes.Every production extractor has these four parts. Skip any of them and you'll see it in production: silent field drift, garbage on the tails, brittle to rate limits, users trusting wrong values.
Define the zod schema before you write a single character of the prompt. The schema is the contract; the prompt is implementation that serves the contract. Inverting the order — writing a prompt and then "writing a schema to match" — is how you end up with fields the model returns sometimes and not others, with mixed snake_case and camelCase, and with amount arriving as both a number and a string depending on the receipt.
The schema gives you:
amount is non-negative, currency is a known ISO code, date is ISO-8601).Write the schema before the prompt. Derive the prompt's "fields to extract" section from the schema, not the other way around. Regenerate the prompt when the schema changes.
Every vision-LLM response goes through schema.safeParse before anything else touches it. Three outcomes:
Result.failure with the original raw response attached for debugging. Do not throw. Throwing pushes the failure mode upward into code that won't handle it; returning a Result forces every caller to make a decision.Two non-obvious rules:
.strict() so unknown fields fail validation instead of silently passing through. The model inventing subtotalBeforeTax when your schema only has subtotal is information you want, not noise to absorb.z.coerce.number() for fields the model might return as a string ("12.50"), but don't bolt regex repair onto bad responses. If the model can't return the right shape after one retry with the error message, the prompt is wrong — fix the prompt, don't patch the output.The single biggest source of silent extraction errors. The model sees "1.234,56" on a German receipt and the schema accepts a number, so the response is the number 1.234 and 56 is lost. Or it sees "30/06/2026" and decides that's June 30, when in the US convention it's June 6 and you've just misfiled a transaction by 24 days.
Locale handling has four dimensions:
. vs ,), thousands separator, currency symbol position, presence of leading zeros. Always extract the raw amount in the receipt's currency and let your server do FX conversion against an authoritative source — never ask the model to convert.YYYY-MM-DD is the only acceptable storage format. Tell the model to return ISO-8601 explicitly. Then validate against z.string().regex(/^\d{4}-\d{2}-\d{2}$/) and reject anything else.Encode locale as a runtime parameter, not a global. The same extractor needs to handle a Singaporean receipt and a French one — branch the prompt's hints and example outputs on the locale you're targeting, not on what you guess from the image.
Ask the model to flag fields it's uncertain about. Two patterns work:
confidence: "high" | "medium" | "low" field on the whole extraction. Route low-confidence results to a human review queue. Cheap to implement, immediately useful.uncertainFields: string[] array listing field names the model wasn't sure about. More expensive (more tokens, more anchor work in the prompt) but lets you build a UI that highlights specific fields for the user to verify.Whichever you pick, act on it. A confidence field you log and ignore is theatre. Either it gates routing (low → human queue) or it gates display (low → "needs review" badge) or it gates storage (low → quarantine table). If it does none of those, delete it.
Critically: tell the model what "low confidence" means. "Low = blurry, occluded, or the format doesn't match what I'd expect from this document type" works. "Low = you're not sure" doesn't — the model is never sure, and you'll either get every response flagged low or every response flagged high.
A generalised template. Replace {PLACEHOLDERS}. The structure is the contribution; the wording is yours to write per domain.
You are extracting structured data from an image of a {DOCUMENT_TYPE}.
Return JSON only — no prose, no markdown, no code fences.
# Context
This image was captured by a {USER_CONTEXT — e.g. "logistics ops user
photographing a shipping label on a warehouse floor"}. The image may be
{EXPECTED_QUALITY_RANGE — e.g. "well-lit and aligned, or skewed and
partially shadowed"}. Locale: {LOCALE_CODE}. Default currency:
{DEFAULT_CURRENCY}. Default date format on this document type:
{EXPECTED_DATE_FORMAT}.
# Fields to extract
{FOR EACH SCHEMA FIELD, ONE LINE:}
- {fieldName}: {type and constraint}. {What to look for, where on the
document, what to do if absent}. {If enum:} Must be one of:
{ENUM_VALUES_PIPE_SEPARATED}.
# Locale rules
- Amounts: extract in the currency printed on the document. Preserve
exact decimals — {EXAMPLE_OF_LOCAL_NUMBER_FORMAT}. Do not convert,
do not round.
- Dates: return ISO-8601 (YYYY-MM-DD). If the document uses
{LOCAL_FORMAT}, convert. If the year is ambiguous (2-digit), assume
the most recent year ≤ today.
- Text fields: return in {OUTPUT_LANGUAGE}. If the source is in a
different language, translate. Do not mix languages in a single field.
# Confidence
Set confidence to "low" if any of these are true:
- {DOMAIN_SPECIFIC_LOW_CONFIDENCE_SIGNAL_1}
- {DOMAIN_SPECIFIC_LOW_CONFIDENCE_SIGNAL_2}
- The image is blurry, occluded, or doesn't appear to be a
{DOCUMENT_TYPE} at all.
Set confidence to "high" only when every field was directly legible
and unambiguous. Otherwise "medium".
# Output schema
{INLINE THE JSON SHAPE — not as a schema, as a literal example with
realistic values. Models follow examples more reliably than they follow
schema descriptions.}
# Failure mode
If the image is not a {DOCUMENT_TYPE}, or is too degraded to extract
the required fields, return:
{ "skipped": true, "reason": "<one sentence>" }
Do not invent values. Do not return partial extractions with placeholder
strings. Empty is better than wrong.
Return JSON only.
Annotations:
tax: null, one with tax: [{...}]).skipped escape hatch. Without it, the model will hallucinate fields rather than admit it can't see them. With it, your pipeline gets a clean signal to surface "we couldn't read this — try a better photo".Return JSON only at the start and end. Repetition isn't redundancy; models attend to the first and last tokens of an instruction block more strongly than the middle.How to design the zod schema for extraction. Five rules and three example sketches.
Rules:
nullable for "the field exists but the model couldn't find it"; optional for "the field doesn't apply to this document type". A receipt for cash has paymentMethod: "cash" not paymentMethod: undefined. A receipt without a visible invoice number has invoiceNumber: null, not absent."other" sentinel rather than an arbitrary string.{ name, quantity, unitPrice } together — nest them. The document's taxBreakdown is an array of { rate, taxableAmount, taxAmount } records — nest them. Don't nest just because the prompt has section headers.amount: 12.50, currency: "EUR", also store originalAmountString: "EUR 12,50". When something goes wrong six months later, you'll want the raw input.meta block. Always: extractionTimestamp, modelId, promptVersion, confidence, processingTimeMs. When extractions disagree across model versions you'll need to know which one produced which row.Sketch 1 — Boarding pass:
const BoardingPass = z.object({
passengerName: z.string(),
flightNumber: z.string().regex(/^[A-Z]{2,3}\d{1,4}$/),
from: z.string().length(3), // IATA code
to: z.string().length(3),
departureLocal: z.string().regex(/^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}$/),
boardingLocal: z.string().regex(/^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}$/).nullable(),
seat: z.string().nullable(),
ticketClass: z.enum(["economy", "premium_economy", "business", "first"]),
pnr: z.string().length(6),
confidence: z.enum(["high", "medium", "low"]),
}).strict()
Notes: seat is nullable because it may not be assigned yet at print time. from / to are constrained to IATA — three uppercase letters — so the model can't return "Heathrow". pnr is exactly 6 characters; a 5-char or 7-char value is known to be wrong.
Sketch 2 — Business card:
const BusinessCard = z.object({
fullName: z.string(),
fullNameOriginalScript: z.string().nullable(), // e.g. CJK characters
title: z.string().nullable(),
organisation: z.string(),
emails: z.array(z.string().email()),
phones: z.array(z.object({
label: z.enum(["mobile", "office", "fax", "other"]),
e164: z.string().regex(/^\+\d{6,15}$/),
})),
websiteUrl: z.string().url().nullable(),
addressLines: z.array(z.string()),
confidence: z.enum(["high", "medium", "low"]),
uncertainFields: z.array(z.string()),
}).strict()
Notes: emails and phones are arrays because cards often list multiple. Phone numbers are normalised to E.164 (+CCNNN…); anything that can't be normalised should be omitted, not stored half-formatted. fullNameOriginalScript is the kanji / hangul / hanzi version, distinct from the romanised fullName.
Sketch 3 — Shipping label:
const ShippingLabel = z.object({
carrier: z.enum(["dhl", "fedex", "ups", "usps", "yamato", "sagawa", "other"]),
trackingNumber: z.string(),
serviceLevel: z.string().nullable(),
sender: z.object({
name: z.string(),
addressLines: z.array(z.string()),
postcode: z.string(),
countryCode: z.string().length(2),
}).nullable(),
recipient: z.object({
name: z.string(),
addressLines: z.array(z.string()),
postcode: z.string(),
countryCode: z.string().length(2),
}),
weightKg: z.number().positive().nullable(),
confidence: z.enum(["high", "medium", "low"]),
}).strict()
Notes: sender is nullable because some labels print only recipient (drop-off boxes, returns). recipient is required — a label without one isn't a shipping label. carrier falls back to "other" rather than letting the model invent a name.
The validation gate's failure path. Concretely:
async function extractWithRetry<T>(
imageBase64: string,
mimeType: string,
schema: z.ZodType<T>,
prompt: string,
maxRetries = 2,
): Promise<Result<T, ExtractionError>> {
let lastRaw: string | null = null
let lastError: z.ZodError | null = null
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const fullPrompt = attempt === 0
? prompt
: `${prompt}\n\n# Previous attempt failed validation\n` +
`Your previous response was:\n\n${lastRaw}\n\n` +
`It failed with this validation error:\n${lastError?.message}\n\n` +
`Return a corrected JSON response. Do not apologise. JSON only.`
const raw = await callVisionModel(fullPrompt, imageBase64, mimeType)
lastRaw = raw
let parsed: unknown
try {
parsed = JSON.parse(raw)
} catch {
lastError = new z.ZodError([
{ code: "custom", path: [], message: "response was not valid JSON" },
])
continue
}
if (parsed && typeof parsed === "object" && "skipped" in parsed) {
return failure({ kind: "skipped", reason: (parsed as { reason?: string }).reason })
}
const result = schema.safeParse(parsed)
if (result.success) return success(result.data)
lastError = result.error
}
return failure({
kind: "validation_exhausted",
attempts: maxRetries + 1,
lastError: lastError?.message,
lastRaw,
})
}
Rate-limit retries are a separate concern from validation retries. Layer them: validation retries wrap the call; rate-limit retries with exponential backoff (1s, 2s, 4s, 8s, jittered, capped at 30s) wrap the underlying HTTP call. Rotate API keys at the bottom of the stack if you have them — a single key under rate limit shouldn't take down the pipeline.
Return Result, don't throw. Throwing makes the failure invisible to the type system; Result<T, E> makes every caller acknowledge the failure path.
A vision LLM call is the most expensive step in your pipeline. Tune it.
confidence === "low" or validation fails after retries, escalate to Pro. You pay Pro prices only on the hard cases.uncertainFields, confidence) where it's machine-readable, not into prose.Things you will see in the wild and should refuse to ship.
JSON.parse the response and hope." You will silently store whatever shape the model felt like returning that day, and you'll find out six months later when a downstream consumer crashes.responseMimeType: "application/json" if your provider supports it.description: string with no guidance and the model writes paragraphs. Constrain it: "≤ 80 characters, one phrase, no sentences."confidence field with no anchor descriptions is the model's vibe. Either define what "low" means in concrete terms or remove the field.Result.failure so callers see them in the type system and have to decide what to do.tax_amount and taxAmount "to be safe".End-to-end. You're building a feature where users photograph business cards at a conference and get them imported as CRM contacts.
Step 1 — Define the schema.
import { z } from "zod"
export const BusinessCardSchema = z.object({
fullName: z.string().min(1),
fullNameOriginalScript: z.string().nullable(),
title: z.string().nullable(),
organisation: z.string().min(1),
emails: z.array(z.string().email()),
phones: z.array(z.object({
label: z.enum(["mobile", "office", "fax", "other"]),
e164: z.string().regex(/^\+\d{6,15}$/),
})),
websiteUrl: z.string().url().nullable(),
addressLines: z.array(z.string()),
countryCode: z.string().length(2).nullable(),
confidence: z.enum(["high", "medium", "low"]),
uncertainFields: z.array(z.string()),
}).strict()
export type BusinessCard = z.infer<typeof BusinessCardSchema>
Step 2 — Write the prompt from the schema.
const prompt = `You are extracting contact details from an image of a
business card. Return JSON only — no prose, no markdown, no code fences.
# Context
This image was captured by a salesperson at a trade conference, often
in low-to-medium light, sometimes at an angle. Cards may be bilingual
(typically English + one of: Japanese, Korean, simplified Chinese,
traditional Chinese, Thai). Locale: ${locale}.
# Fields to extract
- fullName: Person's name in Latin script. If only original script is
printed, romanise using the language's standard system (Hepburn for
Japanese, Pinyin for Mandarin, Revised for Korean).
- fullNameOriginalScript: The name in its original script if the card
shows one, otherwise null.
- title: Job title, or null if not printed.
- organisation: Company / employer name.
- emails: Array of all email addresses on the card.
- phones: Array of phone numbers, each as { label, e164 }. Normalise to
E.164 using the country code printed on the card or implied by the
address. If you cannot determine the country code, omit the number.
Label must be one of: mobile, office, fax, other.
- websiteUrl: A full URL with protocol, or null.
- addressLines: The mailing address as an array of lines, in the order
printed. Empty array if no address.
- countryCode: ISO 3166-1 alpha-2 code for the country in the address,
or null.
# Confidence
Set confidence to "low" if any of these are true:
- The card is blurry, partially out of frame, or rotated >30°.
- The image is not a business card.
- More than one field required guessing rather than reading.
Set confidence to "high" only when every printed field was clearly
legible. Otherwise "medium".
List in uncertainFields the names of any specific fields you had to
guess about (e.g. ["title", "phones"]).
# Output example
{
"fullName": "Aiko Tanaka",
"fullNameOriginalScript": "田中 愛子",
"title": "Head of Partnerships",
"organisation": "Kawasaki Robotics",
"emails": ["a.tanaka@kawasaki-robotics.example"],
"phones": [
{ "label": "office", "e164": "+81312345678" },
{ "label": "mobile", "e164": "+819012345678" }
],
"websiteUrl": "https://kawasaki-robotics.example",
"addressLines": ["2-1-1 Shibaura", "Minato-ku, Tokyo 105-8001"],
"countryCode": "JP",
"confidence": "high",
"uncertainFields": []
}
# Failure mode
If the image is not a business card, or too degraded to read, return:
{ "skipped": true, "reason": "<one sentence>" }
Return JSON only.`
Step 3 — Call with validation gate and retry.
const result = await extractWithRetry(
imageBase64,
"image/jpeg",
BusinessCardSchema,
prompt,
/* maxRetries */ 2,
)
if (!result.ok) {
if (result.error.kind === "skipped") {
return { status: "rejected", reason: result.error.reason }
}
return {
status: "failed",
diagnostics: result.error,
}
}
const card = result.value
if (card.confidence === "low" || card.uncertainFields.length > 0) {
await reviewQueue.enqueue({ card, image: imageBase64 })
return { status: "needs_review", card }
}
return { status: "imported", card }
Step 4 — Watch what happens on a malformed response.
First call returns { "fullName": "Aiko Tanaka", "emails": "a.tanaka@…" } — emails is a string, not an array. safeParse fails with a clear error. Retry sends that error back: your previous response was: {...}. It failed: emails: Expected array, received string. Second call returns the corrected shape. Validation passes. Card is imported. The user never sees the round-trip.
Step 5 — Watch what happens on a bad image. A blurry photo of someone's lunch instead of a business card: the model returns { "skipped": true, "reason": "Image does not appear to be a business card." }. Pipeline routes to the rejection path with a clean message for the user. No hallucinated fields stored.
Notice what made this work: schema first, prompt derived from schema, validation gate non-negotiable, retry with the actual validation error fed back to the model, skipped escape hatch for non-document inputs, confidence gating routing decisions, Result instead of throws.
responseSchema and responseMimeType: "application/json".generateObject — schema-first extraction across providers; the cleanest cross-vendor API for this pattern.safeParse, strict, coerce, and z.infer are the load-bearing primitives.phones field should normalise to.