On this page

PDFs Are Awful

The PDF format was designed to look identical on any screen or printer. It was technology agnostic, a universal container for the printed page. Its visual presentation was all that mattered. As long as it rendered correctly, the internal representation didn't matter.

As a result, the inside of your average PDF is chaos. It is just a bunch of items — words, characters, shapes, images — and their coordinates, with little or no regard for the relationship between anything. What reads as one cohesive line of text could be three groups of words that happened to be positioned sequentially with the same y-value. You have probably experienced this if you have ever tried to select text in a PDF and find yourself highlighting disconnected words on four different lines.

The primary challenge of getting your data out of a PDF is imposing some order on this code soup. A lot of hard-working folks have developed methods and tools to accomplish this over the years, finding ingenious ways to infer meaning from the relative positions of all the elements of a document. And AI can be very helpful here. It doesn't take a particularly advanced LLM to interpret a fairly thorny PDFs.

But models need infrastructure, and not everyone has time to teach themselves how to be an amateur ML engineer. That's why I made Petey.

Petey Is Better

Petey helps you leverage AI to get better results out of tried and true tools for PDF extraction, with as little or as much tinkering as you see fit.

The Schema is the key to getting good results. It's a structured definition of the data you want to extract: field names, data types, and descriptions that tell Petey exactly what to look for and how to interpret it. You can build one from scratch, let Petey infer one from a sample PDF, or import one you've already made.

Under the hood, Petey is also a Python package with much more flexibility. It allows for integration of more parsers, models, and other customization. The larger goal is a framework that strikes a balance between sophistication and accessibility. PDF data extraction is an extremely common problem that is easier to solve than most people think. I would like Petey to help fix that.

API Keys

For the most part, Petey doesn't do anything with your documents itself. It relies on external parsers and AI services to extract and interpret what is in your files. To use these services, you need what is called an API key for each one. An API key is like a password that lets Petey talk to the service on your behalf.

The reason you need to bring your own API keys is that these services cost money. They are very affordable (pennies per document), and you only pay for what you use.

Getting an API key is never very complicated. All you have to do is create a free account, set up your billing, and follow a few steps in the account settings. Once you have a key, paste it into Petey's Settings page.

Which keys do I need?

Required (pick one): OpenAI or Anthropic. These are the AI models that actually read your documents and extract data. You need at least one.

Optional: Datalab or Unstructured. These are AI-powered PDF parsers that do a better job reading complex layouts, scanned documents, and handwriting. The built-in parser (PyMuPDF) is free and fast, but it can struggle with tabular data and doesn't handle scanned or image-based PDFs well.

Setup

OpenAI

LLM extraction (GPT-4.1, GPT-4.1 Mini, GPT-4o, GPT-4o Mini)
  1. Go to platform.openai.com and create an account. This is separate from a regular ChatGPT account — it's OpenAI's developer platform, but you don't need to be a developer to use it.
  2. Add a payment method under Settings → Billing. API usage is pay-as-you-go — you're only charged for what you use.
  3. Go to API Keys and click "Create new secret key". Give it a name like "Petey".
  4. Copy the key and paste it into Petey Settings.
Typical cost: $0.01-0.05 per document with GPT-4.1 Mini. GPT-4o is more expensive but sometimes more accurate on complex layouts.

Anthropic

LLM extraction (Claude Sonnet 4, Claude Haiku 4.5)
  1. Go to console.anthropic.com and create an account. Like OpenAI, this is a separate account from the regular Claude chat — it's Anthropic's API console.
  2. Add credits under Settings → Billing. You prepay for credits and they're drawn down as you use the API.
  3. Go to API Keys and click "Create Key".
  4. Copy the key and paste it into Petey Settings.
Claude models are competitive with GPT on extraction tasks. Haiku 4.5 is fast and cheap; Sonnet 4 is more capable.

Datalab

AI-powered PDF parser — best for scanned docs, complex layouts, and handwriting
  1. Go to datalab.to and click "Sign Up" to create an account
  2. Once logged in, go to API Keys in your dashboard
  3. Click "Create Key", give it a name, and copy the key
  4. Paste it into Petey Settings under "Datalab API Key"
Datalab is the recommended parser for scanned or handwritten documents. Free tier available with generous limits.

Unstructured

Alternative AI-powered PDF parser
  1. Go to unstructured.io and click "Get Started" to create an account
  2. Once logged in, go to API Keys in your dashboard
  3. Generate a new key and copy it
  4. Paste it into Petey Settings under "Unstructured API Key"
An alternative to Datalab. Useful if you're already using Unstructured in other workflows. Free tier available.

Costs

These are approximate costs per page of a typical document. Actual costs vary depending on document length and complexity. Parsers are only charged if you use them — PyMuPDF is free and built in.

Service Model Cost / page Speed Notes
OpenAIGPT-4.1 Mini~$0.01FastBest all-around value. Great starting point.
OpenAIGPT-4o Mini~$0.01FastSimilar to 4.1 Mini. Slightly older model.
OpenAIGPT-4.1~$0.03MediumMore capable on complex or ambiguous documents.
OpenAIGPT-4o~$0.05MediumStrong on complex layouts. Higher cost.
AnthropicClaude Haiku 4.5~$0.01FastComparable to GPT-4.1 Mini. Fast and cheap.
AnthropicClaude Sonnet 4~$0.04MediumVery capable. Good for nuanced extraction.
DatalabDatalab~$0.005MediumAI parser. Best for tables, scans, and handwriting. Free tier available.
PyMuPDFFreeFastBuilt in. Good for clean digital PDFs. Struggles with tables and scans.

Costs are estimates based on provider pricing as of early 2026 and may change. Check each provider's billing page for current rates. A typical extraction job (one document, one page) costs about 1-5 cents.

FAQ

Do API keys cost anything to create?
No. Creating an account and generating an API key is free for all providers. You only pay when Petey actually uses the service to process a document.
Which key should I start with?
OpenAI with GPT-4.1 Mini is the best starting point — it's the cheapest, fastest, and works well on most documents. You can always try other models later.
Do I need a parser key?
It depends on your documents. The built-in parser (PyMuPDF) is fast and free, and works great for clean, digitally-created PDFs with mostly text. But if your documents have tables, scanned pages, or handwriting, an AI parser like Datalab will give significantly better results. The table extraction demos all use Datalab for this reason.
What if I lose my API key?
No problem. Just log back into the provider's website and generate a new key. Then replace the old one in Petey Settings. There's no penalty for creating a new key.
Are my API keys stored securely?
When running Petey locally (desktop app or Docker), your keys are stored on your own machine and never leave it except to go directly to the provider. On the hosted version (petey.cc), keys are stored in your account settings and only used server-side to make requests on your behalf.
How do I know how much I'm spending?
Each provider has a billing dashboard where you can track usage in real time. Most also let you set spending limits or alerts. For reference, processing 100 one-page documents with GPT-4.1 Mini costs about $1.
Can I use the same key for ChatGPT / Claude chat?
No. The API keys used by Petey are from the provider's developer platform, which is separate from their consumer chat products (ChatGPT, Claude). The accounts and billing are separate too — an API key won't charge your ChatGPT subscription, and vice versa.

Cost estimates. Any cost estimates shown by Petey are approximations based on published provider pricing and typical document characteristics. Actual charges depend on document length, complexity, and current provider rates. Petey does not guarantee the accuracy of cost estimates.

Third-party charges. Petey connects to external services (OpenAI, Anthropic, Datalab, Unstructured) using API keys you provide. You are responsible for all charges incurred through these services. Petey does not control pricing, billing, or account management for any third-party provider.

Data handling. When you run an extraction, your document text is sent to the external parser and/or AI model you selected. Petey transmits data securely, but does not control how third-party providers store, process, or retain data after receipt. Review each provider's privacy policy and terms of service for details.

No personal data collection. Petey does not collect, store, or transmit personal information beyond the API keys and run history needed to operate. When running locally (desktop app or Docker), all data stays on your machine.

Accuracy. Petey makes no guarantees about the accuracy or completeness of extracted data. AI-powered extraction is probabilistic — results should be reviewed before use in any critical application.

Open source. Petey is open source software provided as-is, without warranty of any kind. See the license for full terms.