PDFs Are Awful
The PDF format was designed to look identical on any screen or printer. It was technology agnostic, a universal container for the printed page. Its visual presentation was all that mattered. As long as it rendered correctly, the internal representation didn't matter.
As a result, the inside of your average PDF is chaos. It is just a bunch of items — words, characters, shapes, images — and their coordinates, with little or no regard for the relationship between anything. What reads as one cohesive line of text could be three groups of words that happened to be positioned sequentially with the same y-value. You have probably experienced this if you have ever tried to select text in a PDF and find yourself highlighting disconnected words on four different lines.
The primary challenge of getting your data out of a PDF is imposing some order on this code soup. A lot of hard-working folks have developed methods and tools to accomplish this over the years, finding ingenious ways to infer meaning from the relative positions of all the elements of a document. And AI can be very helpful here. It doesn't take a particularly advanced LLM to interpret a fairly thorny PDFs.
But models need infrastructure, and not everyone has time to teach themselves how to be an amateur ML engineer. That's why I made Petey.
Petey Is Better
Petey helps you leverage AI to get better results out of tried and true tools for PDF extraction, with as little or as much tinkering as you see fit.
The Schema is the key to getting good results. It's a structured definition of the data you want to extract: field names, data types, and descriptions that tell Petey exactly what to look for and how to interpret it. You can build one from scratch, let Petey infer one from a sample PDF, or import one you've already made.
Under the hood, Petey is also a Python package with much more flexibility. It allows for integration of more parsers, models, and other customization. The larger goal is a framework that strikes a balance between sophistication and accessibility. PDF data extraction is an extremely common problem that is easier to solve than most people think. I would like Petey to help fix that.
API Keys
For the most part, Petey doesn't do anything with your documents itself. It relies on external parsers and AI services to extract and interpret what is in your files. To use these services, you need what is called an API key for each one. An API key is like a password that lets Petey talk to the service on your behalf.
The reason you need to bring your own API keys is that these services cost money. They are very affordable (pennies per document), and you only pay for what you use.
Getting an API key is never very complicated. All you have to do is create a free account, set up your billing, and follow a few steps in the account settings. Once you have a key, paste it into Petey's Settings page.
Which keys do I need?
Required (pick one): OpenAI or Anthropic. These are the AI models that actually read your documents and extract data. You need at least one.
Optional: Datalab or Unstructured. These are AI-powered PDF parsers that do a better job reading complex layouts, scanned documents, and handwriting. The built-in parser (PyMuPDF) is free and fast, but it can struggle with tabular data and doesn't handle scanned or image-based PDFs well.
Setup
OpenAI
- Go to platform.openai.com and create an account. This is separate from a regular ChatGPT account — it's OpenAI's developer platform, but you don't need to be a developer to use it.
- Add a payment method under Settings → Billing. API usage is pay-as-you-go — you're only charged for what you use.
- Go to API Keys and click "Create new secret key". Give it a name like "Petey".
- Copy the key and paste it into Petey Settings.
Anthropic
- Go to console.anthropic.com and create an account. Like OpenAI, this is a separate account from the regular Claude chat — it's Anthropic's API console.
- Add credits under Settings → Billing. You prepay for credits and they're drawn down as you use the API.
- Go to API Keys and click "Create Key".
- Copy the key and paste it into Petey Settings.
Datalab
- Go to datalab.to and click "Sign Up" to create an account
- Once logged in, go to API Keys in your dashboard
- Click "Create Key", give it a name, and copy the key
- Paste it into Petey Settings under "Datalab API Key"
Unstructured
- Go to unstructured.io and click "Get Started" to create an account
- Once logged in, go to API Keys in your dashboard
- Generate a new key and copy it
- Paste it into Petey Settings under "Unstructured API Key"
Costs
These are approximate costs per page of a typical document. Actual costs vary depending on document length and complexity. Parsers are only charged if you use them — PyMuPDF is free and built in.
| Service | Model | Cost / page | Speed | Notes |
|---|---|---|---|---|
| OpenAI | GPT-4.1 Mini | ~$0.01 | Fast | Best all-around value. Great starting point. |
| OpenAI | GPT-4o Mini | ~$0.01 | Fast | Similar to 4.1 Mini. Slightly older model. |
| OpenAI | GPT-4.1 | ~$0.03 | Medium | More capable on complex or ambiguous documents. |
| OpenAI | GPT-4o | ~$0.05 | Medium | Strong on complex layouts. Higher cost. |
| Anthropic | Claude Haiku 4.5 | ~$0.01 | Fast | Comparable to GPT-4.1 Mini. Fast and cheap. |
| Anthropic | Claude Sonnet 4 | ~$0.04 | Medium | Very capable. Good for nuanced extraction. |
| Datalab | Datalab | ~$0.005 | Medium | AI parser. Best for tables, scans, and handwriting. Free tier available. |
| PyMuPDF | Free | Fast | Built in. Good for clean digital PDFs. Struggles with tables and scans. |
Costs are estimates based on provider pricing as of early 2026 and may change. Check each provider's billing page for current rates. A typical extraction job (one document, one page) costs about 1-5 cents.
FAQ
Do API keys cost anything to create?
Which key should I start with?
Do I need a parser key?
What if I lose my API key?
Are my API keys stored securely?
How do I know how much I'm spending?
Can I use the same key for ChatGPT / Claude chat?
Legal
Cost estimates. Any cost estimates shown by Petey are approximations based on published provider pricing and typical document characteristics. Actual charges depend on document length, complexity, and current provider rates. Petey does not guarantee the accuracy of cost estimates.
Third-party charges. Petey connects to external services (OpenAI, Anthropic, Datalab, Unstructured) using API keys you provide. You are responsible for all charges incurred through these services. Petey does not control pricing, billing, or account management for any third-party provider.
Data handling. When you run an extraction, your document text is sent to the external parser and/or AI model you selected. Petey transmits data securely, but does not control how third-party providers store, process, or retain data after receipt. Review each provider's privacy policy and terms of service for details.
No personal data collection. Petey does not collect, store, or transmit personal information beyond the API keys and run history needed to operate. When running locally (desktop app or Docker), all data stays on your machine.
Accuracy. Petey makes no guarantees about the accuracy or completeness of extracted data. AI-powered extraction is probabilistic — results should be reviewed before use in any critical application.
Open source. Petey is open source software provided as-is, without warranty of any kind. See the license for full terms.