PDF data extraction

Your PDFs in.
Your data out.

Petey turns PDFs into structured data. Drop one in, get a row out. Drop a thousand in, get a table. No code. No upload.

Download for Mac & Windows → Try in browser

The basics

A desktop app for a common problem.

PDF data extraction is everywhere — and easier to solve than most people think. The pieces (parsers, AI models, structured outputs) have existed for years; getting them to work together is the hard part. Petey is the desktop app that does the wiring, striking a balance between sophistication and accessibility.

The pipeline

Four steps. Same every time.

Under the hood, the same four-step pipeline runs everywhere — desktop app, web app, Docker container, Python library. Pluggable parsers, pluggable LLMs. Same blueprints, same outputs. Open source, AGPL, free.

Open source · AGPL In the package. Audit it. Fork it.

Files

PDF, image, scanned

→

Parse

PyMuPDF, Marker, Tesseract, Datalab, 10+ more

→

Comprehend

OpenAI, Anthropic, Google, Ollama

→

Export

CSV / JSON to your filesystem

Available on request

Need something extra? We'll build it with you.

Sources, evaluation, destinations, custom blueprints, on-prem help. We'll prioritize building what a customer actually needs. Same AGPL license; once it's built, it's in the package for everyone.

Sources

Where the PDFs come from

SharePoint
OneDrive
S3 / Drive
Watch folders
Email inbox

Evaluation

Trust the output

Per-field confidence scores
Review UI for flagged fields
Append-only audit log

Export

Where the data lands

Postgres, SQLite, MongoDB
Webhooks
Streaming endpoints
Custom API integrations

Blueprints & deploy

Built around your work

Custom blueprints for your doc types
On-prem deployment assist
VPC / regulated-environment setup

For engineers

A real Python framework, under the hood.

The app is built on a library you can use directly — same parsers, same blueprints, same extraction logic. Drop Petey into your existing pipeline, or wire it into agents and automations.

→Plugin architecture. Register new parsers and LLM backends in YAML. Swap them per run.
→CLI and Python API. Run batch jobs from a Makefile, CI, scheduler, or your own scripts.
→MCP server. Expose Petey to Claude Desktop, Cursor, and other agents — documents stay on the user's machine.

GitHub → PyPI → Docker Hub →

extract.py

# $ pip install petey

from petey import extract

results = extract(
    blueprint="blueprints/cms1500.bpt",
    files="claims/*.pdf",
)

Deploy

Run it where your work lives.

Same code, same blueprints, three deployment shapes. Pick the one that fits the environment your work actually runs in.

Local

On your machine

PyMuPDF, Tesseract, Ollama. Nothing leaves your laptop. Best for sensitive work and tight feedback loops while you build a blueprint.

$ petey extract docs/ --blueprint my.bpt

Container

Your cloud, your cluster

One Docker image runs anywhere — GCP, AWS, your own Kubernetes. Bring your own API keys; Petey never proxies, never logs document contents.

$ docker run afriedman412/petey

VPC / On-prem

Inside your firewall

Drop the whole stack inside your environment. No external calls except the ones you authorize. Built for regulated and compliance-bound work.

$ docker-compose up

Privacy

You choose where your data goes.

Petey is a tool, not a service. We never see your documents — and you decide how much of your data leaves your environment, if any.

Tier 01always true

We never see your documents.

Petey-the-company never accesses, stores, collects, or trains on user documents. Architectural commitment, not a privacy-policy line item.

Tier 02when you self-host

Your documents never leave your environment.

Run the stack on your laptop, in your VPC, or on-prem. Any external APIs (LLMs, parsers) are your vendor relationships, with your keys. Petey is never in the middle.

Tier 03fully local mode

Your documents never touch any external service.

Pair local parsers (PyMuPDF, Tesseract) with a local LLM (Ollama, MLX). Available today for digital PDFs on Apple Silicon; expanding as small local models improve.