RUMAZA Studio
AI for business

Document data extraction: from PDF to ERP without typing

Invoices, delivery notes, contracts, and forms in different formats. AI reads, structures, and validates — you only review the questionable parts.

The problem

Every day, documents arrive in different formats: native PDF, scans, mobile photos, emails with attachments. Someone opens each one, locates the supplier, date, amount, lines, tax ID, and types them into the ERP. A decimal error or a wrongly copied tax ID can cost hours of reconciliation.

Classic OCR returns plain text without structure. A human is still needed to interpret where each field is when the layout changes between suppliers or template versions.

Enterprise document capture solutions are often expensive, slow to configure, and rigid with new documents. Startups hire someone just to 'input invoices'.

Generative AI promises to read any document, but without validation, schemas, and confidence thresholds, you introduce garbage data into critical systems. Worse than the manual process: the error goes unnoticed until the accounting close.

The volume keeps growing: more suppliers, more email attachments, more traceability requirements. Scaling based on templates doesn't work when each document is slightly different.

In regulated sectors — healthcare, construction, food — the wrong document is not just an inconvenience: it’s a fine or loss of certification. Assisted extraction with validation reduces risk as well as time.

Teams that 'already have OCR' still have 3 FTEs reviewing output because the text is unstructured. The value leap is in the validated JSON schema, not in reading characters.

Organizational change matters: support, IT, and business must agree on what gets automated and what requires human judgment. Without that agreement, the project generates internal friction even if the technology works.

Credit notes and refunds break naive parsers. The system must understand document type and amount sign.

Detail lines with decimals, line discounts, and multiple VAT types require post-extraction mathematical validation.

Malicious attachments in invoice emails: the pipeline must scan and isolate before OCR. Security and extraction go hand in hand.

RUMAZA does not sell licenses: we build a system that you can measure, maintain, and expand. If the core of the problem is not automatable with available data, we tell you in the first meeting — saving months and budget.

Multi-page and annexes: multi-sheet invoice with terms on page 4; the pipeline must concatenate context without mixing totals.

Foreign currency and exchange rate: separate fields for original amount and normalized to EUR if applicable.

Comparing three quotes without a common specification is pointless: scope, integrations, and acceptance metrics must be identical to decide with criteria.

Electronic invoices in XML can be parsed without AI; hybrid saves costs: rules for structured, model for the rest.

Iteration with real data from the first fortnight in production: adjusting thresholds, prompts, and rules with client metrics, not lab assumptions.

Project success is defined in the kickoff meeting: base volume, current time per case, manual error rate, and hourly cost — with that we calculate ROI before writing a line of code.

Training at closure: we do not deliver software that only IT understands. The business user knows how to use, scale, and report issues with captures and real examples from their day-to-day.

Go-live checklist: permissions, backups, rollback, escalation contacts, and hypercare window agreed in writing — this way production starts without surprises over the weekend.

What is data extraction with AI (no fluff)

It is a pipeline that receives a document (PDF, image, email), extracts text with OCR if needed, uses vision and language models to identify relevant fields, and returns structured data (JSON) ready for validation and import.

It’s not just OCR. It’s layout understanding: knowing that 'Total' near the bottom right corner is the final amount, that detail lines go in a table, and that the sender's tax ID is not the recipient's.

The production flow includes: image preprocessing, extraction, normalization (dates, currencies, decimals), cross-validation (sum of lines = total), business rules (known supplier, duplicates), and human queue if confidence is low.

It works best when you define the output schema: which fields are mandatory, types, ranges, and what to do if one is missing. Unstructured free text is not suitable for integration.

It integrates with incoming email, shared folders, SFTP, or manual upload. The destination can be ERP, staging spreadsheet, database, or third-party API.

Straight-through processing (STP) is the goal: documents that pass through without touching. You don’t start at 90%; you calibrate thresholds with real data over weeks.

Critical fields — tax ID, IBAN, total, date — require algorithmic validation in addition to the model. The Spanish tax ID has a control digit; use it.

Versioning of models and prompts: when a large supplier's layout changes, you adjust without rewriting the entire pipeline.

Gradual deployment: pilot with one channel or one type of query, measurement for two weeks, expansion based on data — no big bang that overwhelms the team and the client.

Field confidence: not a global score. You can auto-approve if tax ID and total are high even if a line description is questionable.

Export to accounting formats (CSV, API Sage, Holded) with configurable account mapping by supplier or category.

History of corrections by supplier: if the same layout always fails, specific rules without retraining the entire model.

RUMAZA criteria: specific problem, accessible data, success metric, and closed scope. Without these four pillars, there is no project — there is an experiment that charges well to the consultant and poorly to the client.

Webhook upon extraction completion: ERP or n8n receives JSON and triggers the next step of the workflow without polling.

Audit: who approved, when, and what version of the model extracted each field — traceability for ISO or audits.

Evolutionary maintenance — new intents, suppliers, languages — is budgeted separately from the MVP to avoid surprises or zombie projects.

Review UI with keyboard shortcuts for operators processing dozens of documents per hour — productivity matters.

Post-launch support with a direct channel and agreed SLA: critical issues during business hours resolved on the same day — no eternal ticket.

We document assumptions, known limits, and expansion plans in the delivery — total transparency about what the system does today and what remains for a phase two if the numbers justify it.

Architecture ready for expansion: new channels, languages, or documents without starting from scratch — modular extension, not a fragile monolith.

Alignment with security and legal from the design: DPIA when applicable, record of processing activities, and clauses with cloud model subprocessors.

Retrospective meeting at 30 and 60 days: what worked, what to adjust, if phase two is advisable — decision based on data, not budget inertia.

We prioritize deliverables that the business notices in the first week: a resolved query, a processed document, or a useful draft — early victories that fund confidence in the rest of the roadmap.

When it makes sense

Criterios
  • More than 50 documents/month with data to transcribe — with volume and data that justify it.
  • Multiple input formats that prevent fixed templates — with volume and data that justify it.
  • Manual errors with accounting or legal costs — with volume and data that justify it.
  • Too long purchasing or accounting cycle time — with volume and data that justify it.
  • You need traceability: original document + extracted fields — with volume and data that justify it.
  • You want to scale volume without doubling administrative staff — with volume and data that justify it.

What can be built

01

Supplier invoice capture

Email → extraction → matching with order → draft in ERP. Alert if duplicate or amount out of range. Includes logs, confidence thresholds, and human review in the initial phase until metrics are calibrated in production.

02

Processing delivery notes and receipts

Compares extracted quantities with order; marks discrepancies before approving receipt. Includes logs, confidence thresholds, and human review in the initial phase until metrics are calibrated in production.

03

Extraction of contracts and clauses

Identifies parties, dates, automatic renewal, penalties. Structured summary for legal. Includes logs, confidence thresholds, and human review in the initial phase until metrics are calibrated in production.

04

Heterogeneous forms and records

Onboarding suppliers, internal requests, or work orders: variable fields to unified schema. Includes logs, confidence thresholds, and human review in the initial phase until metrics are calibrated in production.

How RUMAZA would build it

01
Document sample
50–100 real anonymized documents to measure variability and define schema. Deliverable documented and reviewed with you before the next step.
02
Schema and validations
Fields, types, cross rules, and list of known suppliers. Deliverable documented and reviewed with you before the next step.
03
OCR + AI pipeline
Preprocessing, extraction with vision/language model, post-processing, and confidence scoring. Deliverable documented and reviewed with you before the next step.
04
Destination integration
ERP API, staging DB, or CSV export with idempotency. Deliverable documented and reviewed with you before the next step.
05
Review UI
Side-by-side screen: document and editable fields. Learning from frequent corrections. Deliverable documented and reviewed with you before the next step.
06
Metrics
Straight-through processing rate, post-import errors, time per document. Deliverable documented and reviewed with you before the next step.

Possible technologies

  • Python
  • OpenAI GPT-4V / Anthropic
  • Tesseract / Azure Document Intelligence
  • PyMuPDF / pdfplumber
  • Django / FastAPI
  • PostgreSQL
  • Celery
  • APIs SAP / Holded / custom ERP

Hypothetical application scenarios

Escenario 1

Supplier PDFs with different formats

Each supplier sends their invoice with a different design. Flexible extraction can normalize key fields before importing to accounting.

Escenario 2

Work orders or delivery notes on paper or photo

Handwritten or scanned information that someone transcribes into the system. OCR + structured validation reduces typing and errors.

Escenario 3

Contracts and records with repetitive fields

Clauses, dates, or identifying data need to be checked in many documents. It makes sense to extract and compare against a checklist, not read one by one.

Common mistakes

Evitar
  • Importing to ERP without validating duplicates
  • Trusting 100% without initial review queue
  • Schema too ambitious in the first version
  • Ignoring scan quality and blurry photos
  • Not saving the source document linked to the record
  • Measuring only 'processed documents', not field accuracy
  • Not reviewing the project at 90 days with real metrics and adjusting or closing what does not contribute.

Frequently asked questions

Does it work with invoices in other languages?

Yes. Multilingual models handle ES, EN, FR, DE. We validate with a real sample from your suppliers. We define this in scope according to your systems, volume, and legal restrictions — without promising generic figures.

What accuracy is realistic?

For standard invoices, 85–95% per field with good scanning. Highly variable documents require more human review at the start. We define this in scope according to your systems, volume, and legal restrictions — without promising generic figures.

Does it replace my accounting software?

No. It feeds your ERP or accounting with structured data. The tax logic remains in your system. We define this in scope according to your systems, volume, and legal restrictions — without promising generic figures.

Does it comply with electronic invoice requirements?

Extraction complements Facturae, PDF, and email. For structured XML, sometimes parsing is enough; AI comes into play for unstructured. We define this in scope according to your systems, volume, and legal restrictions — without promising generic figures.

Where are the documents stored?

In your infrastructure or bucket with encryption. Retention according to your policy and GDPR. We define this in scope according to your systems, volume, and legal restrictions — without promising generic figures.

How long does a pilot take?

3–4 weeks with one document type (e.g., supplier invoices) and one output integration. We define this in scope according to your systems, volume, and legal restrictions — without promising generic figures.

Related guides

Updated: 2026-06-29 · Author: Rubén Maestre

Are you still typing data from PDFs?

Send me anonymized examples and I’ll tell you the expected automation rate and architecture.