Skip to main content

Command Palette

Search for a command to run...

We Process 10,000 Scanned PDFs a Month. Here's What We Learned About OCR Accuracy.

Updated
8 min read
S
We build PDF2TEXT (pdf2text.ai) — an AI document processing tool that extracts data from scanned PDFs, invoices, and bank statements. Built for accountants, mortgage brokers, and finance teams who deal with messy real-world documents.

Last year, our team at pdf2text.ai hit a milestone: 10,000 scanned documents processed in a single month. Bank statements, invoices, receipts, tax forms — the full spectrum of messy, real-world paper that somehow still dominates accounting in 2026.

Along the way, we broke our OCR pipeline about six times. Each failure taught us something about why extracting text from scanned documents is so much harder than it looks.

This post covers the specific problems we ran into and how we solved them. If you're building anything that touches document processing, or if you're an accountant tired of copy-pasting numbers from PDFs, some of this might save you a few headaches.

The 95% Problem

Most OCR engines advertise 95-99% character accuracy. Sounds great until you do the math.

A typical invoice has around 500 characters. At 95% accuracy, that's 25 wrong characters per document. If one of those wrong characters lands in an amount field — say, $1,285.00 becomes $1,285.80 — you've got a bookkeeping error that might not surface for weeks.

We learned this the hard way. Early on, we were using a standard OCR pipeline and our users kept reporting small discrepancies. The numbers were almost right, which made them harder to catch than completely wrong values. An accountant can spot $0.00 where there should be a number. They're less likely to catch $1,285.80 vs $1,285.00 at 4pm on a Friday.

The fix wasn't just "get better OCR." It was adding a verification layer that cross-checks extracted values against each other. Does the line item total match the sum of individual amounts? Does the tax calculation make sense given the subtotal? These sanity checks catch errors that raw character accuracy metrics miss entirely.

Scanned PDFs Are Not Created Equal

Here's something that surprised us early on: the same document, scanned on two different machines, can produce wildly different OCR results.

A brand-new office scanner at 300 DPI? Clean text, sharp edges, near-perfect extraction. That same invoice photographed with a phone camera under fluorescent lighting? Good luck.

We started categorizing input quality into three tiers:

High quality — proper flatbed scans, 200+ DPI, minimal skew. OCR accuracy above 98% with standard engines. About 40% of what we see.

Medium quality — older scanners, slight rotation, some noise or compression artifacts. Accuracy drops to 90-95% without preprocessing. Roughly 45% of documents.

Low quality — phone photos, faxes (yes, faxes still exist in accounting), documents that have been printed, scanned, faxed, and scanned again. Below 85% accuracy without serious intervention. The remaining 15%.

That bottom 15% is where we spend most of our engineering time. Preprocessing steps like deskewing, denoising, and contrast normalization can pull a bad scan from 85% to 95% accuracy. But the specific preprocessing needed varies per document, so we built an adaptive pipeline that detects input quality and adjusts accordingly.

Table Detection Is Where Most Tools Fall Apart

Character recognition gets all the attention, but table detection is the real bottleneck for financial documents.

Think about a bank statement. It's basically a big table: date, description, amount, balance. Humans read it effortlessly because we understand the spatial layout. We know that the number in the right column on the same row as "Direct Deposit" is the deposit amount.

OCR engines see a flat stream of text. Without understanding the table structure, they might output "03/15 Direct Deposit 2,450.00 15,678.32" as a single line, and your parsing code has to figure out which number is the transaction and which is the running balance.

We tried three approaches before landing on one that worked:

Rule-based detection — Look for grid lines, consistent spacing, aligned columns. Works on clean, well-structured tables. Falls apart on statements where columns are separated by whitespace alone, which is most of them.

ML-based detection — Train a model to identify table regions in the document image. Better at handling varied layouts, but needs a lot of labeled training data. We spent two months labeling table boundaries on 5,000 documents before this approach started outperforming the rule-based system.

Hybrid approach — Use the ML model to find table regions and estimate column boundaries, then apply rule-based logic within each region to extract cell values. This is what we ship today in pdf2text.ai, and it handles about 92% of table layouts correctly on the first pass.

The remaining 8% usually involves nested tables, merged cells, or tables that span page breaks. We're still working on those.

The AI Verification Layer

After extraction, we run every document through what we call the verification layer. It's the single biggest improvement we've made to accuracy.

The idea is simple: use a language model to check whether the extracted data makes sense in context. Not to do the OCR — the vision model handles that — but to catch errors the OCR missed.

For example, if we extract an invoice with a subtotal of $1,200, tax of $96, and a total of $1,396, the verification layer flags that $1,200 + \(96 = \)1,296, not $1,396. Either the subtotal, tax, or total was misread.

This catches about 60% of the errors that slip past our OCR pipeline. The other 40% are cases where the error doesn't create an obvious inconsistency — like a vendor name misspelling, or a date that's off by one digit but still plausible.

We also use the verification layer for field classification. A number like "45,000" sitting in a document could be a dollar amount, a quantity, an account number, or a zip code. Context tells you which. The model looks at surrounding text, document type, and field position to make that call.

What Accountants Actually Need

We spent our first six months building features nobody asked for. Batch processing for 500 documents at once. Export to twelve different formats. A dashboard with charts.

Then we talked to actual accountants. What they wanted was much simpler:

They wanted to drag a PDF into something and get the numbers in a format they could paste into QuickBooks or Xero. That's it. No learning curve, no setup wizard, no 14-day trial with a credit card.

The documents they struggle with most aren't complex financial instruments. They're bank statements from small business clients who use regional banks with non-standard statement formats. Receipts from international vendors with mixed languages. Scanned tax forms where someone filled in the blanks by hand.

We rebuilt our entire flow around that use case. Upload a document, get structured data back in under 30 seconds, copy it to your accounting software. Everything else — batch processing, API access, custom templates — comes later, if the core experience works.

Numbers We Track

Since we started measuring properly last quarter, here's where we are:

Processing time averages 8-12 seconds per page, depending on complexity. Bank statements with dense tables take longer than single-page invoices.

Accuracy on clean scans (our "high quality" tier) sits at 99.1% character-level. Medium quality documents hit 96-97% after preprocessing. Low quality is the most variable — anywhere from 91% to 97% depending on how bad the source is.

Table extraction accuracy is 92% fully correct on first pass. "Fully correct" means every cell value landed in the right row and column. Partial extraction (most cells correct, one or two misaligned) bumps that to 97%.

The verification layer catches errors in about 12% of documents and correctly fixes them 83% of the time. The other 17% it flags for human review rather than silently inserting a wrong value.

What We're Working On Next

Handwriting recognition is the big one. A surprising number of documents we see have handwritten notes, corrections, or filled-in fields. Our current pipeline ignores handwritten text entirely, which means we miss data that's sometimes the most important part of the document.

We're also investing in document type detection. Right now, users tell us what they're uploading — invoice, bank statement, receipt. We want to figure that out automatically and adjust our extraction pipeline accordingly. A bank statement needs different table detection logic than an invoice.

If you deal with scanned financial documents and want to try our approach, check out pdf2text.ai. We have a free tier that lets you process up to 20 pages per month — enough to see if it handles your specific document types before committing.


Built by the team at AIMAN SOFTWARE LTD. We make document processing tools for accountants and finance teams.