The pipeline On request Privacy Download →
PDF data extraction

Your PDFs in.
Your data out.

Petey turns PDFs into structured data. Drop one in, get a row out. Drop a thousand in, get a table. No code. No upload.

A desktop app for a common problem.

PDF data extraction is everywhere — and easier to solve than most people think. The pieces (parsers, AI models, structured outputs) have existed for years; getting them to work together is the hard part. Petey is the desktop app that does the wiring, striking a balance between sophistication and accessibility.

Four steps. Same every time.

Under the hood, the same four-step pipeline runs everywhere — desktop app, web app, Docker container, Python library. Pluggable parsers, pluggable LLMs. Same blueprints, same outputs. Open source, AGPL, free.

Open source · AGPL In the package. Audit it. Fork it.
Files
PDF, image, scanned
Parse
PyMuPDF, Marker, Tesseract, Datalab, 10+ more
Comprehend
OpenAI, Anthropic, Google, Ollama
Export
CSV / JSON to your filesystem

Need something extra? We'll build it with you.

Sources, evaluation, destinations, custom blueprints, on-prem help. We'll prioritize building what a customer actually needs. Same AGPL license; once it's built, it's in the package for everyone.

Sources
Where the PDFs come from
  • SharePoint
  • OneDrive
  • S3 / Drive
  • Watch folders
  • Email inbox
Evaluation
Trust the output
  • Per-field confidence scores
  • Review UI for flagged fields
  • Append-only audit log
Export
Where the data lands
  • Postgres, SQLite, MongoDB
  • Webhooks
  • Streaming endpoints
  • Custom API integrations
Blueprints & deploy
Built around your work
  • Custom blueprints for your doc types
  • On-prem deployment assist
  • VPC / regulated-environment setup

A real Python framework, under the hood.

The app is built on a library you can use directly — same parsers, same blueprints, same extraction logic. Drop Petey into your existing pipeline, or wire it into agents and automations.

  • Plugin architecture. Register new parsers and LLM backends in YAML. Swap them per run.
  • CLI and Python API. Run batch jobs from a Makefile, CI, scheduler, or your own scripts.
  • MCP server. Expose Petey to Claude Desktop, Cursor, and other agents — documents stay on the user's machine.
extract.py
# $ pip install petey

from petey import extract

results = extract(
    blueprint="blueprints/cms1500.bpt",
    files="claims/*.pdf",
)

Run it where your work lives.

Same code, same blueprints, three deployment shapes. Pick the one that fits the environment your work actually runs in.

Local
On your machine
PyMuPDF, Tesseract, Ollama. Nothing leaves your laptop. Best for sensitive work and tight feedback loops while you build a blueprint.
$ petey extract docs/ --blueprint my.bpt
Container
Your cloud, your cluster
One Docker image runs anywhere — GCP, AWS, your own Kubernetes. Bring your own API keys; Petey never proxies, never logs document contents.
$ docker run afriedman412/petey
VPC / On-prem
Inside your firewall
Drop the whole stack inside your environment. No external calls except the ones you authorize. Built for regulated and compliance-bound work.
$ docker-compose up
Privacy
You choose where your data goes.

Petey is a tool, not a service. We never see your documents — and you decide how much of your data leaves your environment, if any.

Tier 01always true
We never see your documents.
Petey-the-company never accesses, stores, collects, or trains on user documents. Architectural commitment, not a privacy-policy line item.
Tier 02when you self-host
Your documents never leave your environment.
Run the stack on your laptop, in your VPC, or on-prem. Any external APIs (LLMs, parsers) are your vendor relationships, with your keys. Petey is never in the middle.
Tier 03fully local mode
Your documents never touch any external service.
Pair local parsers (PyMuPDF, Tesseract) with a local LLM (Ollama, MLX). Available today for digital PDFs on Apple Silicon; expanding as small local models improve.