Schema Tutorial

Understanding Schemas

A schema tells Petey exactly what data to extract from your documents. It's the most important part of the process.

A schema is a list of fields. Each field has three parts:

Anatomy of a field

Name

Type

Description

Name becomes the column header in your results.
Type tells the model what kind of data to expect.
Description guides the model's interpretation — this is where you give instructions.

If you've used ChatGPT or Claude, you already know the basics — a schema is essentially a structured prompt. Instead of writing a paragraph asking an AI to "find the patient's name, age, and gender", you define each piece of data as a field. Petey turns your schema into a prompt behind the scenes.

We'll build a schema for a clinical note step by step. By the end, you'll understand when to use each field type and how descriptions shape your results.

Text fields: copying data

The simplest kind of extraction — find a value and copy it.

A Text field tells the model to extract a string value. For something like a patient's name, the model just finds it in the document and copies it directly.

Field 1

name
Text
Patient name

The description here is simple — "Patient name" — because there's no ambiguity. The model knows exactly what to look for. Not every field needs a complex description.

The result for this field will be something like "Margaret Ellison" — copied straight from the document.

Number fields

Use Number when you want a clean numeric value.

A Number field tells the model to return a numeric value. Even if the document says "34 years old", the model will extract just 34.

Fields so far

name

Text

Patient name

age
Number
Patient age

Number is the right choice here because age is always numeric and we want a clean value for analysis. If a document had ages like "thirty-four" or "3 months", the model will still return a number.

Category fields

Constrain the output to a fixed set of values.

A Category field gives the model a list of allowed values. Instead of free-text, it must pick from your list. This standardizes the output — no "M" vs "Male" vs "male" inconsistencies.

Fields so far

name

Text

Patient name

age

Number

Patient age

gender
Category
Infer from pronouns if not obvious

Allowed values: Male Female Non-binary

Notice the description: "Infer from pronouns if not obvious". This is where descriptions shine. The document might not say "Gender: Female" — but if it uses "she/her" pronouns, the model knows what to do. The description is an instruction, not just a label.

Date fields

Extract dates in a consistent format.

A Date field extracts date values. Documents express dates in all sorts of ways — "March 5, 2024", "3/5/24", "05-MAR-2024". The Date type standardizes them, and you can use the description to specify the format you want.

Fields so far

name

Text

Patient name

age

Number

Patient age

gender

Category

Infer from pronouns if not obvious

visit_date

Date

Date of the visit in YYYY-MM-DD format

visit_outcome
Text
Outcome of the visit

Compare the two Text fields: "Patient name" means find and copy. "Outcome of the visit" means read, understand, and summarize. Same type, completely different behavior. Think of the description as a mini-prompt — it's your instruction to the AI for that specific field.

The result might be something like "discharged with salbutamol inhaler" — not a direct quote from the document, but a concise summary the model produced by reading the full note.

Additional instructions

Give the model extra context that applies to the whole extraction.

Field descriptions are per-field instructions. But sometimes you need to tell the model something that applies to every field — or to the document as a whole. That's what the Additional Instructions box is for.

Think of it like adding a note at the top of a prompt. For our clinical notes, we might write:

Additional Instructions

These are emergency department clinical notes. Use medical terminology where appropriate. If a field cannot be determined from the text, return null rather than guessing.

This does three things: it tells the model the type of document it's reading (so it knows what context to apply), sets a tone (medical terminology), and establishes a default behavior (null over guessing). These instructions are sent to the model alongside your schema and the document text — it's all part of the same prompt.

Good additional instructions answer questions like: What kind of documents are these? Should the model use specific terminology? What should it do when the answer isn't clear?

Your schema

Here's the complete schema we built, plus a few extra fields.

ed_clinical_note

name

Text

Patient name

age

Number

Patient age

gender