Every week, the accountant at Vivaio Ventures would open a PDF, squint at a foreign invoice, and manually type the same eight fields into a spreadsheet. Invoice number, date, amount, currency, vendor name. Then repeat. For every non-Italian provider they worked with.

Since November 2025, I’ve been embedded with Vivaio Ventures to help them build the internal tooling they need for their operations. The business is inherently international, which means invoices from vendors across Europe and the US, each formatted differently. In Italy, domestic invoices flow through the Sistema di Interscambio, the national electronic invoicing platform, as structured XML files. But invoices from foreign providers are just PDFs, with no standard, no schema, and no consistency.

When they came to me with this problem, I thought: invoice extraction using an LLM can definitely work. I was right, but the path from that instinct to a production-grade invoice extractor had more twists than I expected.

The First Question: What Do You Feed the Model?

Before writing a single line of production code, I set up a standalone prototyping repo. This is a rule I follow religiously: always prototype in isolation. You iterate faster, you don’t need to integrate anything to test it, and you don’t risk breaking existing systems.

I picked 10 invoices from different providers, deliberately choosing ones with very different layouts, and started experimenting with what to actually send to the LLM.

I tested two approaches. Text-only means extracting the text layer from the PDF and passing it as a string. Fast, cheap, and surprisingly decent. Multimodal means converting every PDF page to a PNG image, extracting the text layer, and sending both to Gemini, pixels and text together.

After testing both on the same invoices, multimodal won by a significant margin. The reason came down to spatial context. A PDF invoice might contain three different dollar amounts scattered across the page: a subtotal, a tax amount, and a grand total. The text layer gives you all three numbers, but loses the layout entirely. Without seeing where those numbers appear relative to labels like “Net Amount” or “Total Due,” the model frequently guessed wrong. Images give it the map, the text layer gives it the precision, and together they’re far more reliable than either alone.

The Prompt Is the Product

Once the input strategy was settled, the real work began: building the extraction prompt.

I defined a strict output schema using Zod:

const InvoiceSchema = z.object({
  invoice_number: z.string(),
  invoice_date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
  invoice_currency: z.string(),
  vendor_name: z.string(),
  country: z.string(),
  amount_net_VAT: z.number(),
  VAT_amount: z.number(),
  amount_with_VAT: z.number(),
});

Eight fields. Sounds simple. But each one revealed its own edge cases once I started running real invoices through it.

Character ambiguity was the first surprise. In many invoice fonts, the digit 0 and the capital letter O are indistinguishable. Same with I, l, and 1. Invoice codes like INV-01O1 were being mangled. I added explicit instructions telling the model to use context to disambiguate: amounts and codes typically use numerals, company names use letters.

Vendor aliases were another problem. AgileBits Inc. operates commercially as 1Password, and an invoice might display both names. I added a rule to always prioritize the legal entity name, including the company type suffix (Inc., Ltd., GmbH, S.r.l.).

Date formats nearly broke me. The date 03/04/2024 means March 4th in the US and April 3rd in most of Europe. I solved this with a spatial proximity rule: the model reads country indicators near the vendor name to determine which convention applies. I also had to explicitly instruct it to ignore dates under “Transaction” headings, since those are bank processing dates, not invoice dates.

The most important mechanism I built was a self-correction loop. When Zod validation fails, I don’t just retry with the same prompt. I pass the actual validation errors back to the model in the next request:

if (previousError) {
  promptText +=
    `\nThe previous response failed validation with these errors:\n` +
    `${JSON.stringify(previousError)}\n` +
    `Please ensure all fields match the exact types specified.`;
}

Up to five attempts, each one informed by what went wrong before. Cheap insurance, and it handles the long tail of edge cases that would otherwise require manual intervention.

The Invoice Extractor, Under the Hood

Here’s the core extraction flow in pseudocode:

Step 1: Prepare inputs

images = convertPDFPagesToPNG(pdfPath)   // multimodal: visual context
pdfText = extractTextLayer(pdfPath)       // precision for numbers and codes

lastError = null
totalTokens = { prompt: 0, candidates: 0, total: 0, calls: 0 }

Step 2: Attempt extraction with self-correction loop

for attempt in 1..maxRetries:

  prompt = buildPrompt(pdfText, previousError=lastError)
    // If lastError exists, prompt includes the Zod validation errors
    // from the previous attempt, asking the model to fix them

  response = callLLM(images + prompt)
  totalTokens += response.tokenUsage   // accumulate across all attempts

  # Step 3: Parse and validate
  try:
    json = extractJSONFromResponse(response.text)
    // handles both raw JSON and ```json code blocks

    validated = ZodSchema.parse(json)
    // strict type checking: dates must be YYYY-MM-DD,
    // amounts must be numbers, all fields required

    return { data: validated, tokenUsage: totalTokens }

  catch ZodError:
    lastError = error.details   // feed errors back into next attempt
    continue

  catch JSONParseError:
    lastError = [{ message: "Invalid JSON: " + error.message }]
    continue

throw MaxRetriesExceeded(lastError)

A couple of implementation details worth calling out. Token usage is accumulated across all retry attempts for a single invoice, not just the successful one. This gives you accurate cost tracking per invoice, including the overhead of self-correction. And each retry is truly informed: the model sees specifically which fields failed validation and why, rather than just being told to try again.

You Can’t Improve What You Don’t Measure

This is where most LLM blog posts stop. “I built a thing, it works, here’s the code.” I wanted actual confidence numbers before putting this near production.

So I built an eval framework. 19 invoices across four categories (US SaaS and EU vendors) and three difficulty levels (easy, medium, hard). Each invoice is manually labelled with ground truth for all eight fields.

The key decision was to run every evaluation 10 times, not once. LLMs are non-deterministic, and a single measurement tells you almost nothing. Running the same set of invoices through the invoice extractor repeatedly is the only way to know whether good results are real or just luck. You’re not looking for a perfect score on one run, but for consistent scores across all of them.

How the Eval Works, Under the Hood

The framework loops through every invoice in the dataset, calls the invoice extractor, and checks each extracted field against the manually labelled ground truth. It knows that dates need to match exactly, that amounts are close enough if they’re within a cent, and that vendor names should be compared without worrying about case or trailing punctuation.

Beyond accuracy, it also tracks how much each extraction actually costs — in tokens and in API calls. If the self-correction loop had to retry three times before getting a valid result, those extra calls show up in the bill. There’s no hiding failure overhead.

Run it 10 times, and you get a clear picture: which invoices are consistently easy, which ones are consistently tricky, which fields the model tends to get wrong, and whether the system behaves the same way every time or surprises you. That last part is what matters most for production.

The Results

Running 19 invoices × 10 times:

  • 100% success rate — every invoice processed without errors
  • ~3,077 tokens per invoice on average — which is practically free
  • Consistent behaviour across all runs, with no surprising drops between one run and the next

Boring Problems, Real Impact

The accountant at Vivaio Ventures no longer opens PDFs and types numbers into spreadsheets.

LLMs earn their keep on boring, repetitive tasks where reliability matters more than creativity. Invoice extraction isn’t glamorous, but it’s exactly the kind of problem where this technology makes a measurable difference in someone’s workday, not by doing something impossible, but by reliably doing something tedious.

The lasting asset from this project isn’t the LLM call or even the prompt. It’s the eval framework. Gemini Flash works well today, but models change, pricing shifts, and better options appear. When that happens, I swap the model, run the eval, compare the numbers, and decide. The model is a pluggable component, and the eval is the source of truth.

If you’re building LLM features for production, start with the eval. Not the prompt, not the model selection, not the architecture. Build the thing that lets you measure whether your system actually works, then build the system.


Hey! I’m Peppe, and I’m a freelance Product Engineer.

Are you shipping fast but unsure of your impact? My Product Experiments as a Service helps you validate ideas, learn quickly, and build what truly matters, before full development.

Ready to transform your workflow and get real answers?

📅 Book a call: https://cal.com/giuseppe-silletti/virtual-coffee-peppe ✉️ Email: himself@peppesilletti.io