Multimodal structured extraction

A field manual for turning images into validated, typed objects with a vision LLM. Distilled from production work shipping multiple image-to-JSON pipelines across receipts, restaurant menus, and financial documents — generalised to the cross-cutting pattern.

The gap between "Gemini can read images" and "Gemini reliably extracts structured data" is roughly one engineering month of pain: silent field omissions, locale-shaped numbers parsed as garbage, snake_case vs camelCase drift, hallucinated fields when the model is uncertain, and pipelines that fall over the first time the API rate-limits you. This skill is the shape of the work you'd otherwise have to redo from scratch.

When to use this skill

You need to turn an image of a structured artifact (receipt, ID card, boarding pass, business card, lab result, form) into a typed object you can store, route, or reason about.
You're calling Gemini / GPT-4o / Claude with inlineData or image parts and want the output as JSON, not prose.
You're shipping the same extraction across multiple locales (currency formats, date formats, scripts) and the per-locale quirks are eating your week.
You need a confidence signal on every extraction so you can route low-confidence cases to human review instead of silently storing wrong data.
You're already in production with a working extractor that breaks on edge cases and you can't tell whether the model is the problem or your prompt is.

When NOT to use this skill

Objective parsing where a real parser exists. PDFs with a text layer → pdf-parse. Barcodes → a barcode library. MRZ on a passport → an MRZ library. Don't hand structured machine-readable inputs to an LLM and pay tokens for the privilege of getting them wrong sometimes.
One-off extractions. Five business cards a year? Read them yourself. The overhead of schema + validation + retry only pays back at volume.
Safety-critical or regulated extraction. Medication labels, identity verification, legal-document binding values. Use specialised classifiers, human review, or a regulated OCR vendor — and document the SLA.
Unstructured-to-unstructured. If the output is "a summary" or "a description", you don't need this skill. You need a chat prompt.
Latency-critical paths. A vision LLM round-trip is 2-10 seconds. If you need <500ms, train a small model on the LLM's outputs and serve that instead.

The four-part pattern

Every production extractor has these four parts. Skip any of them and you'll see it in production: silent field drift, garbage on the tails, brittle to rate limits, users trusting wrong values.

1. Schema first

Define the zod schema before you write a single character of the prompt. The schema is the contract; the prompt is implementation that serves the contract. Inverting the order — writing a prompt and then "writing a schema to match" — is how you end up with fields the model returns sometimes and not others, with mixed snake_case and camelCase, and with amount arriving as both a number and a string depending on the receipt.

The schema gives you:

A single source of truth for what "extracted" means in your domain.
A validation gate (see §2) that turns "the model returned something" into "the model returned a typed object I can store".
Generated TypeScript types so every downstream consumer is type-safe.
A natural place to encode invariants (e.g. amount is non-negative, currency is a known ISO code, date is ISO-8601).

Write the schema before the prompt. Derive the prompt's "fields to extract" section from the schema, not the other way around. Regenerate the prompt when the schema changes.

2. Validation gate

Every vision-LLM response goes through schema.safeParse before anything else touches it. Three outcomes:

Parse succeeds. Return the typed object. This is the only path that reaches your business logic.
Parse fails, retries available. Send the response and the zod error message back to the model with "your previous response failed validation, here's why, try again". Retry up to N times (3 is a good default).
Parse fails, retries exhausted. Return a typed Result.failure with the original raw response attached for debugging. Do not throw. Throwing pushes the failure mode upward into code that won't handle it; returning a Result forces every caller to make a decision.

Two non-obvious rules:

Strict, not passthrough. Use .strict() so unknown fields fail validation instead of silently passing through. The model inventing subtotalBeforeTax when your schema only has subtotal is information you want, not noise to absorb.
Coerce, don't sanitise. Use z.coerce.number() for fields the model might return as a string ("12.50"), but don't bolt regex repair onto bad responses. If the model can't return the right shape after one retry with the error message, the prompt is wrong — fix the prompt, don't patch the output.

3. Locale handling

The single biggest source of silent extraction errors. The model sees "1.234,56" on a German receipt and the schema accepts a number, so the response is the number 1.234 and 56 is lost. Or it sees "30/06/2026" and decides that's June 30, when in the US convention it's June 6 and you've just misfiled a transaction by 24 days.

Locale handling has four dimensions:

Currency formatting. Decimal separator (. vs ,), thousands separator, currency symbol position, presence of leading zeros. Always extract the raw amount in the receipt's currency and let your server do FX conversion against an authoritative source — never ask the model to convert.
Date formats. YYYY-MM-DD is the only acceptable storage format. Tell the model to return ISO-8601 explicitly. Then validate against z.string().regex(/^\d{4}-\d{2}-\d{2}$/) and reject anything else.
Language and script. A menu in Thai, a business card in Japanese, a form in Arabic. Decide explicitly: do you store the original-language text, a romanization, a translation, or all three? Each is a separate schema field. Don't conflate them.
Romanization rules. If you ask for romanized output, name the system (Hepburn for Japanese, Pinyin for Mandarin, McCune-Reischauer vs Revised for Korean) and lock it down. "Romanize this" without a system gets you whichever the model felt like that day.

Encode locale as a runtime parameter, not a global. The same extractor needs to handle a Singaporean receipt and a French one — branch the prompt's hints and example outputs on the locale you're targeting, not on what you guess from the image.

4. Confidence signalling

Ask the model to flag fields it's uncertain about. Two patterns work:

Overall confidence. A single confidence: "high" | "medium" | "low" field on the whole extraction. Route low-confidence results to a human review queue. Cheap to implement, immediately useful.
Per-field uncertainty. An uncertainFields: string[] array listing field names the model wasn't sure about. More expensive (more tokens, more anchor work in the prompt) but lets you build a UI that highlights specific fields for the user to verify.

Whichever you pick, act on it. A confidence field you log and ignore is theatre. Either it gates routing (low → human queue) or it gates display (low → "needs review" badge) or it gates storage (low → quarantine table). If it does none of those, delete it.

Critically: tell the model what "low confidence" means. "Low = blurry, occluded, or the format doesn't match what I'd expect from this document type" works. "Low = you're not sure" doesn't — the model is never sure, and you'll either get every response flagged low or every response flagged high.

The system-prompt template

A generalised template. Replace {PLACEHOLDERS}. The structure is the contribution; the wording is yours to write per domain.

You are extracting structured data from an image of a {DOCUMENT_TYPE}.
Return JSON only — no prose, no markdown, no code fences.

# Context

This image was captured by a {USER_CONTEXT — e.g. "logistics ops user
photographing a shipping label on a warehouse floor"}. The image may be
{EXPECTED_QUALITY_RANGE — e.g. "well-lit and aligned, or skewed and
partially shadowed"}. Locale: {LOCALE_CODE}. Default currency:
{DEFAULT_CURRENCY}. Default date format on this document type:
{EXPECTED_DATE_FORMAT}.

# Fields to extract

{FOR EACH SCHEMA FIELD, ONE LINE:}
- {fieldName}: {type and constraint}. {What to look for, where on the
  document, what to do if absent}. {If enum:} Must be one of:
  {ENUM_VALUES_PIPE_SEPARATED}.

# Locale rules

- Amounts: extract in the currency printed on the document. Preserve
  exact decimals — {EXAMPLE_OF_LOCAL_NUMBER_FORMAT}. Do not convert,
  do not round.
- Dates: return ISO-8601 (YYYY-MM-DD). If the document uses
  {LOCAL_FORMAT}, convert. If the year is ambiguous (2-digit), assume
  the most recent year ≤ today.
- Text fields: return in {OUTPUT_LANGUAGE}. If the source is in a
  different language, translate. Do not mix languages in a single field.

# Confidence

Set confidence to "low" if any of these are true:
  - {DOMAIN_SPECIFIC_LOW_CONFIDENCE_SIGNAL_1}
  - {DOMAIN_SPECIFIC_LOW_CONFIDENCE_SIGNAL_2}
  - The image is blurry, occluded, or doesn't appear to be a
    {DOCUMENT_TYPE} at all.
Set confidence to "high" only when every field was directly legible
and unambiguous. Otherwise "medium".

# Output schema

{INLINE THE JSON SHAPE — not as a schema, as a literal example with
realistic values. Models follow examples more reliably than they follow
schema descriptions.}

# Failure mode

If the image is not a {DOCUMENT_TYPE}, or is too degraded to extract
the required fields, return:
  { "skipped": true, "reason": "<one sentence>" }
Do not invent values. Do not return partial extractions with placeholder
strings. Empty is better than wrong.

Return JSON only.

Annotations:

Inline an example, not a schema. Vision-capable models follow a JSON example more reliably than a JSON Schema description. Provide one example for each significant variant (e.g. one with tax: null, one with tax: [{...}]).
Locale rules as their own section. Don't bury "use ISO dates" in the field descriptions. Locale rules apply across all fields and deserve their own block the model can attend to.
A skipped escape hatch. Without it, the model will hallucinate fields rather than admit it can't see them. With it, your pipeline gets a clean signal to surface "we couldn't read this — try a better photo".
Return JSON only at the start and end. Repetition isn't redundancy; models attend to the first and last tokens of an instruction block more strongly than the middle.

Schema design

How to design the zod schema for extraction. Five rules and three example sketches.

Rules:

nullable for "the field exists but the model couldn't find it"; optional for "the field doesn't apply to this document type". A receipt for cash has paymentMethod: "cash" not paymentMethod: undefined. A receipt without a visible invoice number has invoiceNumber: null, not absent.
Enums for anything closed-set. Currency codes, document types, payment methods, language codes. Strings drift; enums force the model to pick from your list. If the model can't pick, that's signal — fall back to a "other" sentinel rather than an arbitrary string.
Nest only what shares a lifecycle. A line item has { name, quantity, unitPrice } together — nest them. The document's taxBreakdown is an array of { rate, taxableAmount, taxAmount } records — nest them. Don't nest just because the prompt has section headers.
Store the original alongside the normalised. If you parse "EUR 12,50" into amount: 12.50, currency: "EUR", also store originalAmountString: "EUR 12,50". When something goes wrong six months later, you'll want the raw input.
Add a meta block. Always: extractionTimestamp, modelId, promptVersion, confidence, processingTimeMs. When extractions disagree across model versions you'll need to know which one produced which row.

Sketch 1 — Boarding pass:

const BoardingPass = z.object({
  passengerName: z.string(),
  flightNumber: z.string().regex(/^[A-Z]{2,3}\d{1,4}$/),
  from: z.string().length(3), // IATA code
  to: z.string().length(3),
  departureLocal: z.string().regex(/^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}$/),
  boardingLocal: z.string().regex(/^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}$/).nullable(),
  seat: z.string().nullable(),
  ticketClass: z.enum(["economy", "premium_economy", "business", "first"]),
  pnr: z.string().length(6),
  confidence: z.enum(["high", "medium", "low"]),
}).strict()

Notes: seat is nullable because it may not be assigned yet at print time. from / to are constrained to IATA — three uppercase letters — so the model can't return "Heathrow". pnr is exactly 6 characters; a 5-char or 7-char value is known to be wrong.

Sketch 2 — Business card:

const BusinessCard = z.object({
  fullName: z.string(),
  fullNameOriginalScript: z.string().nullable(), // e.g. CJK characters
  title: z.string().nullable(),
  organisation: z.string(),
  emails: z.array(z.string().email()),
  phones: z.array(z.object({
    label: z.enum(["mobile", "office", "fax", "other"]),
    e164: z.string().regex(/^\+\d{6,15}$/),
  })),
  websiteUrl: z.string().url().nullable(),
  addressLines: z.array(z.string()),
  confidence: z.enum(["high", "medium", "low"]),
  uncertainFields: z.array(z.string()),
}).strict()

Notes: emails and phones are arrays because cards often list multiple. Phone numbers are normalised to E.164 (+CCNNN…); anything that can't be normalised should be omitted, not stored half-formatted. fullNameOriginalScript is the kanji / hangul / hanzi version, distinct from the romanised fullName.

Sketch 3 — Shipping label:

const ShippingLabel = z.object({
  carrier: z.enum(["dhl", "fedex", "ups", "usps", "yamato", "sagawa", "other"]),
  trackingNumber: z.string(),
  serviceLevel: z.string().nullable(),
  sender: z.object({
    name: z.string(),
    addressLines: z.array(z.string()),
    postcode: z.string(),
    countryCode: z.string().length(2),
  }).nullable(),
  recipient: z.object({
    name: z.string(),
    addressLines: z.array(z.string()),
    postcode: z.string(),
    countryCode: z.string().length(2),
  }),
  weightKg: z.number().positive().nullable(),
  confidence: z.enum(["high", "medium", "low"]),
}).strict()

Notes: sender is nullable because some labels print only recipient (drop-off boxes, returns). recipient is required — a label without one isn't a shipping label. carrier falls back to "other" rather than letting the model invent a name.

Retry strategy

The validation gate's failure path. Concretely:

async function extractWithRetry<T>(
  imageBase64: string,
  mimeType: string,
  schema: z.ZodType<T>,
  prompt: string,
  maxRetries = 2,
): Promise<Result<T, ExtractionError>> {
  let lastRaw: string | null = null
  let lastError: z.ZodError | null = null

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const fullPrompt = attempt === 0
      ? prompt
      : `${prompt}\n\n# Previous attempt failed validation\n` +
        `Your previous response was:\n\n${lastRaw}\n\n` +
        `It failed with this validation error:\n${lastError?.message}\n\n` +
        `Return a corrected JSON response. Do not apologise. JSON only.`

    const raw = await callVisionModel(fullPrompt, imageBase64, mimeType)
    lastRaw = raw

    let parsed: unknown
    try {
      parsed = JSON.parse(raw)
    } catch {
      lastError = new z.ZodError([
        { code: "custom", path: [], message: "response was not valid JSON" },
      ])
      continue
    }

    if (parsed && typeof parsed === "object" && "skipped" in parsed) {
      return failure({ kind: "skipped", reason: (parsed as { reason?: string }).reason })
    }

    const result = schema.safeParse(parsed)
    if (result.success) return success(result.data)
    lastError = result.error
  }

  return failure({
    kind: "validation_exhausted",
    attempts: maxRetries + 1,
    lastError: lastError?.message,
    lastRaw,
  })
}

Rate-limit retries are a separate concern from validation retries. Layer them: validation retries wrap the call; rate-limit retries with exponential backoff (1s, 2s, 4s, 8s, jittered, capped at 30s) wrap the underlying HTTP call. Rotate API keys at the bottom of the stack if you have them — a single key under rate limit shouldn't take down the pipeline.

Return Result, don't throw. Throwing makes the failure invisible to the type system; Result<T, E> makes every caller acknowledge the failure path.

Cost / latency trade-offs

A vision LLM call is the most expensive step in your pipeline. Tune it.

Model tier. Flash-class models (Gemini Flash, GPT-4o-mini, Claude Haiku) are 5-10x cheaper and 2-3x faster than Pro-class. For well-anchored extraction (clear document type, well-defined schema), Flash is usually enough. Pro earns its keep on noisy images and dense documents (a full-page menu, a multi-section statement).
Routing strategy. Run Flash first; if confidence === "low" or validation fails after retries, escalate to Pro. You pay Pro prices only on the hard cases.
Image preprocessing. Resize to the model's optimal input dimensions (typically ≤1568px on the long edge for Gemini, ≤2048 for GPT-4o). Larger costs more tokens for zero accuracy gain. Crop to the document if you can detect edges — cuts background noise and reduces tokens.
Batch vs single. "One item at a time" wins when items are heterogeneous (different receipts) and you want clean error isolation — one bad image doesn't poison the batch. "Batch" wins when items are homogeneous (line items on a single receipt, transactions on a single statement) and the model benefits from seeing the surrounding context. Default to single; batch only when measurement shows it helps accuracy or cost.
Prompt caching. If your provider supports it (Anthropic, Gemini), put the system prompt + schema description in the cached prefix. Per-image content goes in the uncached suffix. Cuts cost on the prompt by ~90% at the price of a small first-call penalty.
Don't ask for explanations. Every extra token of "here's why I extracted this" is a token you pay for and don't use. Get evidence into the schema (uncertainFields, confidence) where it's machine-readable, not into prose.

Anti-patterns

Things you will see in the wild and should refuse to ship.

No schema validation. "We JSON.parse the response and hope." You will silently store whatever shape the model felt like returning that day, and you'll find out six months later when a downstream consumer crashes.
Asking for markdown when you want JSON. "Return a markdown table of fields." You will spend a week writing a markdown parser and it will still get the edge cases wrong. Ask for JSON. Set responseMimeType: "application/json" if your provider supports it.
Free-form prose fields. description: string with no guidance and the model writes paragraphs. Constrain it: "≤ 80 characters, one phrase, no sentences."
Letting the model do FX conversion. Asking the model to convert €12,50 to USD gets you a stale rate from training data. Extract in the source currency; convert with an authoritative source server-side.
Trusting confidence without grounding it. A confidence field with no anchor descriptions is the model's vibe. Either define what "low" means in concrete terms or remove the field.
No retry with feedback. First attempt fails validation → return error. You lose the easy 70% recovery rate you'd get by handing the error back to the model and asking it to fix the shape.
Throwing on failure. Errors are part of the contract. Return Result.failure so callers see them in the type system and have to decide what to do.
Mixing snake_case and camelCase in the same schema. Pick one. Normalise the model's output to your convention in a thin adapter layer — don't pollute the schema with both tax_amount and taxAmount "to be safe".
No locale parameter. Hardcoding "USD" or "MM/DD/YYYY" because that's what the first version handled. You will ship a second market and rewrite the whole extractor. Make locale a parameter from day one.
No raw response stored. Storing only the parsed object makes it impossible to debug "why did this field come out wrong" three months later. Store the raw model response alongside the parsed object until you trust the pipeline, then store a sample.

Worked example: extracting structured data from a business card image

End-to-end. You're building a feature where users photograph business cards at a conference and get them imported as CRM contacts.

Step 1 — Define the schema.

import { z } from "zod"

export const BusinessCardSchema = z.object({
  fullName: z.string().min(1),
  fullNameOriginalScript: z.string().nullable(),
  title: z.string().nullable(),
  organisation: z.string().min(1),
  emails: z.array(z.string().email()),
  phones: z.array(z.object({
    label: z.enum(["mobile", "office", "fax", "other"]),
    e164: z.string().regex(/^\+\d{6,15}$/),
  })),
  websiteUrl: z.string().url().nullable(),
  addressLines: z.array(z.string()),
  countryCode: z.string().length(2).nullable(),
  confidence: z.enum(["high", "medium", "low"]),
  uncertainFields: z.array(z.string()),
}).strict()

export type BusinessCard = z.infer<typeof BusinessCardSchema>

Step 2 — Write the prompt from the schema.

const prompt = `You are extracting contact details from an image of a
business card. Return JSON only — no prose, no markdown, no code fences.

# Context

This image was captured by a salesperson at a trade conference, often
in low-to-medium light, sometimes at an angle. Cards may be bilingual
(typically English + one of: Japanese, Korean, simplified Chinese,
traditional Chinese, Thai). Locale: ${locale}.

# Fields to extract

- fullName: Person's name in Latin script. If only original script is
  printed, romanise using the language's standard system (Hepburn for
  Japanese, Pinyin for Mandarin, Revised for Korean).
- fullNameOriginalScript: The name in its original script if the card
  shows one, otherwise null.
- title: Job title, or null if not printed.
- organisation: Company / employer name.
- emails: Array of all email addresses on the card.
- phones: Array of phone numbers, each as { label, e164 }. Normalise to
  E.164 using the country code printed on the card or implied by the
  address. If you cannot determine the country code, omit the number.
  Label must be one of: mobile, office, fax, other.
- websiteUrl: A full URL with protocol, or null.
- addressLines: The mailing address as an array of lines, in the order
  printed. Empty array if no address.
- countryCode: ISO 3166-1 alpha-2 code for the country in the address,
  or null.

# Confidence

Set confidence to "low" if any of these are true:
  - The card is blurry, partially out of frame, or rotated >30°.
  - The image is not a business card.
  - More than one field required guessing rather than reading.
Set confidence to "high" only when every printed field was clearly
legible. Otherwise "medium".

List in uncertainFields the names of any specific fields you had to
guess about (e.g. ["title", "phones"]).

# Output example

{
  "fullName": "Aiko Tanaka",
  "fullNameOriginalScript": "田中 愛子",
  "title": "Head of Partnerships",
  "organisation": "Kawasaki Robotics",
  "emails": ["a.tanaka@kawasaki-robotics.example"],
  "phones": [
    { "label": "office", "e164": "+81312345678" },
    { "label": "mobile", "e164": "+819012345678" }
  ],
  "websiteUrl": "https://kawasaki-robotics.example",
  "addressLines": ["2-1-1 Shibaura", "Minato-ku, Tokyo 105-8001"],
  "countryCode": "JP",
  "confidence": "high",
  "uncertainFields": []
}

# Failure mode

If the image is not a business card, or too degraded to read, return:
  { "skipped": true, "reason": "<one sentence>" }

Return JSON only.`

Step 3 — Call with validation gate and retry.

const result = await extractWithRetry(
  imageBase64,
  "image/jpeg",
  BusinessCardSchema,
  prompt,
  /* maxRetries */ 2,
)

if (!result.ok) {
  if (result.error.kind === "skipped") {
    return { status: "rejected", reason: result.error.reason }
  }
  return {
    status: "failed",
    diagnostics: result.error,
  }
}

const card = result.value
if (card.confidence === "low" || card.uncertainFields.length > 0) {
  await reviewQueue.enqueue({ card, image: imageBase64 })
  return { status: "needs_review", card }
}
return { status: "imported", card }

Step 4 — Watch what happens on a malformed response.

First call returns { "fullName": "Aiko Tanaka", "emails": "a.tanaka@…" } — emails is a string, not an array. safeParse fails with a clear error. Retry sends that error back: your previous response was: {...}. It failed: emails: Expected array, received string. Second call returns the corrected shape. Validation passes. Card is imported. The user never sees the round-trip.

Step 5 — Watch what happens on a bad image. A blurry photo of someone's lunch instead of a business card: the model returns { "skipped": true, "reason": "Image does not appear to be a business card." }. Pipeline routes to the rejection path with a clean message for the user. No hallucinated fields stored.

Notice what made this work: schema first, prompt derived from schema, validation gate non-negotiable, retry with the actual validation error fed back to the model, skipped escape hatch for non-document inputs, confidence gating routing decisions, Result instead of throws.

Multimodal structured extraction

Install

When to use this skill

SKILL.md

Multimodal structured extraction

When to use this skill

When NOT to use this skill

The four-part pattern

1. Schema first

2. Validation gate

3. Locale handling

4. Confidence signalling

The system-prompt template

Schema design

Retry strategy

Cost / latency trade-offs

Anti-patterns

Worked example: extracting structured data from a business card image

Further reading

Related skills