April 1, 20263 min read

Your scanned PDF is lying to you

Ctrl+F says the word isn't there. You can literally see it on the page. Here's what's going on, and the five-second trick to fix it.

Open a PDF. Hit Ctrl+F. Search for a word you absolutely know is on page seven. Nothing.

If that's happened to you, the PDF isn't broken. It's scanned.

PDFs are a wrapper, not a format

The name sounds like it implies text. It doesn't. A PDF is a container. What's inside can be actual text (exported from Word, LaTeX, a browser, whatever), images (scans from a scanner or phone), or both in the same file.

When you scan a stack of paperwork with your phone, what you get is a PDF full of pictures of paper. Your eye sees words. The computer sees pixels.

How to tell in five seconds

Open the PDF in any reader. Try to click and drag to select a word. If you can, it's text-based: Ctrl+F works, conversions to Word or Text will have something to work with, summarizers have something to read.

If your cursor won't select anything, or it selects the whole page as one solid block that you can't break up, you've got a scan.

The fix is called OCR

Optical Character Recognition looks at the pixels, figures out which shapes are letters, and writes a transparent text layer on top of the image. After OCR, the PDF looks identical but Ctrl+F works, you can copy text out of it, and other tools actually see content instead of pictures.

Three things affect how well OCR works:

Scan resolution. 300 DPI is the floor. Below that, recognition gets spotty on small print.
Skew. If pages went through the scanner crooked, error rates climb. Some OCR engines auto-straighten, some don't.
Language. English and European languages are rock solid. Arabic, Chinese, Japanese all work but you need the right language model loaded.

Why this matters more than it sounds

I have a folder of scanned invoices from a vendor that went out of business. No emails, no accounting system access, just a stack of PDFs on a drive. I spent an hour hunting for a specific line item because Ctrl+F was giving me nothing. Then I OCR'd the whole folder and found the line in ten seconds.

It felt like magic. It's just character recognition, but the difference between not-searchable and searchable when you need something specific is the difference between five minutes and fifty.

If you've got a pile of scans, run them through OCR once and thank yourself later. The OCR PDF tool on this site works in your browser and the output is a normal PDF that everything else on your computer can read.