API Reference

REST API for the same workflows as the app: create projects, upload documents, define agents, launch extraction, and read results with source citations.

Examples use Python 3.9+ with requests, but the API is plain JSON over HTTPS — use any client you like.

Table of contents

Base URL

https://api.parsewise.ai/api/v1

OpenAPI schema

The schema is the source of truth for request/response shapes, field names, and enum values.


Authentication

Every request must include your API key in the X-API-Key header.

import os
import requests

resp = requests.get(
    "https://api.parsewise.ai/api/v1/projects/",
    headers={"X-API-Key": os.environ["PARSEWISE_API_KEY"]},
)
resp.raise_for_status()
print(resp.json())
  • Keys are prefixed pw_live_ and scoped to one organisation.
  • Manage keys (create, rotate, revoke) on the Developer page.
  • Treat keys as secrets — never commit them or ship them to clients.

Endpoints

All paths are relative to the base URL above. Tables are generated from the OpenAPI schema — see it for full request and response shapes.

Projects

Operation Path Method
List projects /projects/ GET
Get a project /projects/{id}/ GET
Get output schema /projects/{project_id}/schema/ GET
Get project processing status /projects/{project_id}/status/ GET
Create a project /projects/ POST
Partially update a project /projects/{id}/ PATCH
Update a project /projects/{id}/ PUT
Set output schema /projects/{project_id}/schema/ PUT
Delete a project /projects/{id}/ DELETE

Documents

Operation Path Method
List documents /projects/{project_id}/documents/ GET
Get a document /projects/{project_id}/documents/{document_id}/ GET
Get a document page /projects/{project_id}/documents/{document_id}/pages/{page_number}/ GET
Get a page image /projects/{project_id}/documents/{document_id}/pages/{page_number}/image/ GET
Upload documents /projects/{project_id}/documents/ POST
Delete a document /projects/{project_id}/documents/{document_id}/ DELETE

Agents

Operation Path Method
List agents /projects/{project_id}/agents/ GET
Get agent processing status /projects/{project_id}/agents/status/ GET
Get an agent /projects/{project_id}/agents/{agent_id}/ GET
Create an agent /projects/{project_id}/agents/ POST
Launch the agent pipeline /projects/{project_id}/agents/launch/ POST
Partially update an agent /projects/{project_id}/agents/{agent_id}/ PATCH
Update an agent /projects/{project_id}/agents/{agent_id}/ PUT
Delete an agent /projects/{project_id}/agents/{agent_id}/ DELETE

Results

Operation Path Method
List extraction results /projects/{project_id}/results/ GET
Get results in schema format /projects/{project_id}/results/schema/ GET
Get a single extraction result /projects/{project_id}/results/{resolution_result_id}/ GET
List extractions for a resolution result /projects/{project_id}/results/{resolution_result_id}/extractions/ GET

Dimensions

Operation Path Method
List dimensions /projects/{project_id}/dimensions/ GET
Get a dimension /projects/{project_id}/dimensions/{dimension_id}/ GET
Create a dimension /projects/{project_id}/dimensions/ POST
Partially update a dimension /projects/{project_id}/dimensions/{dimension_id}/ PATCH
Delete a dimension /projects/{project_id}/dimensions/{dimension_id}/ DELETE

Extract

Operation Path Method
Extract structured data from documents /extract/ POST

Extractions

Operation Path Method
Get bounding boxes for an extraction /projects/{project_id}/extractions/{extraction_id}/bounding-boxes/ GET

File Edit

Operation Path Method
Download the edited file /projects/{project_id}/file-edit/download/ GET
Get file edit job status /projects/{project_id}/file-edit/status/ GET
projects_file_edit_create /projects/{project_id}/file-edit/ POST
Delete the file edit job /projects/{project_id}/file-edit/status/ DELETE

Webhooks

Operation Path Method
List webhook subscriptions /webhooks/ GET
List available webhook event types /webhooks/events/ GET
Retrieve a webhook subscription /webhooks/{id}/ GET
Create a webhook subscription /webhooks/ POST
Send a synthetic test event /webhooks/{id}/test/ POST
Update a webhook subscription /webhooks/{id}/ PATCH
Delete a webhook subscription /webhooks/{id}/ DELETE

Schema-driven extract (convenience endpoint)

If you already know the output shape you want, POST /extract/ collapses the entire create → upload → configure → launch flow into a single multipart request. You supply the files and a JSON Schema; Parsewise creates a project, auto-generates agents from the schema, and runs the full pipeline in the background.

When to use it: you have a target JSON Schema and want results shaped to it without hand-tuning individual agents first. Use the step-by-step flow below when you need per-agent control.

Request

curl -X POST \
  -H "X-API-Key: $PARSEWISE_API_KEY" \
  -F 'files=@report.pdf' \
  -F 'files=@accounts.xlsx' \
  -F 'schema={"type":"object","properties":{"revenue":{"type":"number"},"ceo":{"type":"string"}}}' \
  -F 'project_name=API Test Project' \
  "https://api.parsewise.ai/api/v1/extract/"
Field Type Required Description
files file(s) yes One or more document files (repeat the field for multiple).
schema JSON string yes A valid JSON Schema (Draft 2020-12) describing the desired output.
project_name string no Name for the auto-created project. Defaults to "API Extraction".

Response (202 Accepted)

{
  "project_id": "<uuid>",
  "status_url": "/api/v1/projects/<uuid>/status/",
  "results_url": "/api/v1/projects/<uuid>/results/schema/"
}

Poll and read results

Poll GET /projects/{project_id}/status/ until pipeline_running is false and schema_status is "success", then fetch:

GET /projects/{project_id}/results/schema/

The response body is a JSON object shaped to the schema you submitted, with values populated from the documents.

Enriched results

Append ?enrich=true to get per-field metadata alongside each value. For every scalar leaf in the output, two sibling keys are added:

Sibling key Description
<field>_consistency Resolution status for the field — one of Resolved, Requires attention, Not resolved, No result, or Ignored.
<field>_parsewise_url Deep link into the Parsewise UI for the underlying resolution result.
GET /projects/{project_id}/results/schema/?enrich=true

Example (plain vs enriched):

// Plain (?enrich omitted or false)
{ "revenue": 42000000, "ceo": "Jane Doe" }

// Enriched (?enrich=true)
{
  "revenue": 42000000,
  "revenue_consistency": "Resolved",
  "revenue_parsewise_url": "https://app.parsewise.ai/projects/<project-uuid>/agents/<agent-uuid>/<result-uuid>",
  "ceo": "Jane Doe",
  "ceo_consistency": "Resolved",
  "ceo_parsewise_url": "https://app.parsewise.ai/projects/<project-uuid>/agents/<agent-uuid>/<result-uuid>"
}

Minimal Python example

import os, time, json, requests

BASE = "https://api.parsewise.ai/api/v1"
H = {"X-API-Key": os.environ["PARSEWISE_API_KEY"]}

schema = {
    "type": "object",
    "properties": {
        "revenue": {"type": "number"},
        "ceo": {"type": "string"},
    },
}

resp = requests.post(
    f"{BASE}/extract/",
    headers=H,
    files=[("files", open("report.pdf", "rb"))],
    data={"schema": json.dumps(schema), "project_name": "API Test Project"},
)
resp.raise_for_status()
project_id = resp.json()["project_id"]

while True:
    status = requests.get(
        f"{BASE}/projects/{project_id}/status/", headers=H
    ).json()
    if not status["pipeline_running"] and status.get("schema_status") == "success":
        break
    time.sleep(30)

results = requests.get(
    f"{BASE}/projects/{project_id}/results/schema/", headers=H
).json()
print(json.dumps(results, indent=2))

Step-by-step full example

Shared setup

import os
import time
import requests

API_KEY = os.environ["PARSEWISE_API_KEY"]
BASE_URL = "https://api.parsewise.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}

1. Create a project

resp = requests.post(
    f"{BASE_URL}/projects/",
    headers=HEADERS,
    json={"name": "Q4 leases", "description": "Lease extraction"},
)
resp.raise_for_status()
project_id = resp.json()["id"]

2. Upload a document

with open("lease.pdf", "rb") as f:
    resp = requests.post(
        f"{BASE_URL}/projects/{project_id}/documents/",
        headers=HEADERS,
        files={"file": f},
    )
resp.raise_for_status()

3. Create an agent

resp = requests.post(
    f"{BASE_URL}/projects/{project_id}/agents/",
    headers=HEADERS,
    json={
        "name": "Annual rent (USD)",
        "extraction_instructions": "Extract the annual rent in USD as a number.",
        "value_type": "number",
        "unit": "USD",
    },
)
resp.raise_for_status()
agent_id = resp.json()["id"]

4. Launch extraction

resp = requests.post(
    f"{BASE_URL}/projects/{project_id}/agents/launch/",
    headers=HEADERS,
)
resp.raise_for_status()  # returns 202 Accepted with no body

5. Poll for progress

Poll with exponential backoff (e.g. 2s → 4s → 8s, capped at ~30s). Stop only when both:

  • pipeline_running is false, and
  • every agent’s extraction_status is Processed.
def wait_for_run(project_id: str, max_wait_seconds: int = 1800) -> dict:
    delay = 2
    deadline = time.time() + max_wait_seconds
    while time.time() < deadline:
        resp = requests.get(
            f"{BASE_URL}/projects/{project_id}/agents/status/",
            headers=HEADERS,
        )
        resp.raise_for_status()
        status = resp.json()
        running = status.get("pipeline_running", False)
        all_processed = all(
            a.get("extraction_status") == "Processed"
            for a in status.get("agents", [])
        )
        if not running and all_processed:
            return status
        time.sleep(delay)
        delay = min(delay * 2, 30)
    raise TimeoutError("Run did not finish in time")

wait_for_run(project_id)

See the FAQ for why pipeline_running=false alone is not enough.

6. Read results

List rows (paginated):

resp = requests.get(
    f"{BASE_URL}/projects/{project_id}/results/",
    headers=HEADERS,
)
resp.raise_for_status()
rows = resp.json()["results"]
for row in rows:
    print(row["agent_name"], "", row["resolution_result"]["value"])

Fetch full detail for a single row (includes document-level citations):

resolution_result_id = rows[0]["resolution_result"]["id"]
resp = requests.get(
    f"{BASE_URL}/projects/{project_id}/results/{resolution_result_id}/",
    headers=HEADERS,
)
resp.raise_for_status()
detail = resp.json()
for source in detail.get("sources", []):
    print(source["document_name"], "p.", source["page_number"])

Webhooks

Instead of polling agents/status/ to find out when an extraction run finishes, you can register a webhook and let Parsewise call you. We POST a JSON event to your URL when an asynchronous pipeline reaches a terminal state.

Manage webhooks (create, test, delete) on the Developer page or via /api/v1/webhooks/ — the OpenAPI schema linked above is the source of truth for request/response shapes.

Event types

The current event registry is exposed at GET /api/v1/webhooks/events/. As of this writing it contains:

Event When it fires
agent.completion.succeeded The agent extraction pipeline for a project finished successfully.
agent.completion.failed The agent extraction pipeline for a project failed.
webhook.ping A synthetic event you can fire from POST /api/v1/webhooks/<id>/test/. Useful for verifying connectivity without launching a real pipeline.

Envelope

Every webhook body is a JSON envelope of the same shape — only the data field varies per event:

{
  "id": "f7d6...",          // unique delivery id (also Parsewise-Delivery-Id header)
  "event": "agent.completion.succeeded",
  "occurred_at": "2026-05-01T17:23:11.123456+00:00",
  "subscription_id": "b1...",
  "data": { /* event-specific  see below */ }
}

Per-event data payloads

agent.completion.succeeded:

{
  "project_id": "5fa1...",
  "run_number": 4,
  "agents": [
    { "id": "8c2e...", "name": "Annual rent (USD)", "extraction_status": "Processed" },
    { "id": "9d11...", "name": "Term length (months)", "extraction_status": "No Result" }
  ]
}

Each entry in agents reflects that agent’s terminal extraction_status for the run (same values as on GET /agents/status/ — typically Processed or No Result).

agent.completion.failed:

{
  "project_id": "5fa1...",
  "run_number": 4,
  "failure_reason": "internal_error"
}

failure_reason is currently always internal_error (an unexpected pipeline failure). The schema reserves additional values (validation_failed, extraction_failed, expired) for future use — handle unknown values gracefully.

webhook.ping (synthetic test event):

{ "message": "This is a test delivery from Parsewise." }

Request headers

Every delivery includes:

  • Content-Type: application/json
  • Parsewise-Delivery-Id: <uuid> — unique per attempt; use it for idempotency.
  • Parsewise-Event: <event_name> — convenience header; the same value also appears in the body.

Plus any custom headers you registered on the subscription.

Retries

Failed deliveries are retried with exponential backoff and jitter. We treat 5xx, 408, 429, connection errors, and read timeouts as retryable. 2xx is success; other 4xx responses stop retrying. After the final attempt, the delivery row is marked failed and shows up in your delivery history.

Authenticating webhooks

Configure an Authorization header (or any custom request header) on the subscription. Parsewise stores the value encrypted at rest and sends it on every delivery, so you can verify it on receipt the way you would any inbound request. Retryable failures (see Retries) mean the same delivery can land more than once, so make your handler idempotent on Parsewise-Delivery-Id.

Testing locally

POST /api/v1/webhooks/<id>/test/ synchronously delivers one webhook.ping event and returns the HTTP status, duration, and any error from the destination. No retries — the response reflects the single attempt. Use it to verify your endpoint is reachable and that TLS is set up correctly before any real event fires.


Iterating

Common follow-ups:

  • New documents arrive → upload them, then re-launch.
  • An agent is wrongPATCH the agent, then re-launch.
  • A new column is needed → create another agent, then re-launch.

Updating an agent is a single PATCH:

resp = requests.patch(
    f"{BASE_URL}/projects/{project_id}/agents/{agent_id}/",
    headers=HEADERS,
    json={"extraction_instructions": "new task text"},
)
resp.raise_for_status()

The body is PatchedV1AgentRequest — every field is optional. Follow the update with a launch to recompute.

Launch cost model

Launches are incremental, not a full recompute:

  • Parsing runs only on documents still in Pending.
  • Extraction is keyed per agent/document. Already-resolved pairs are skipped; only new documents and invalidated agents do work.
  • Agent edits invalidate that agent’s data. Changing any of extraction_instructions, value_type, examples, unit, resolution_instructions, inconsistency_instructions, enable_complex_calculations_in_resolution, or enable_web_search clears the agent’s extractions on save, so the next launch re-runs that agent across every document. Other agents are untouched.

Reading results

Two endpoints return results (see the schema for the full field set):

  • GET /projects/{project_id}/results/ — paginated list. Each row has agent_name, value_type, extraction_status, resolution_result (resolved value + metadata), and dimension_instances.
  • GET /projects/{project_id}/results/{resolution_result_id}/ — full detail for one row, including document-level citations in sources[].

When consuming responses:

  • Parse resolution_result.value by the agent’s value_type (string or number today — see the FAQ).
  • Gate on resolution_result.resolution_status for high-confidence values only.
  • Use resolution_result.references for inline citations ([document_name, page_number]; page_number is None for web sources). The detail endpoint’s sources[] gives document-level citations.

Limits & gotchas

  • Launch is project-scoped. There is no “only these documents” or “only this agent” option. Prove an agent on a small test project before pointing it at thousands of production documents.
  • Don’t stack launches. A second launch issued while a run is in progress is queued silently behind it — wait for the current one to finish.
  • per_document_mode / per_tag_mode are set at project creation. They can’t be toggled on an existing project via PATCH. See the FAQ below.
  • Dimension templates aren’t yet creatable via the API. See the FAQ.

FAQ

What do per_document_mode and per_tag_mode on a project actually do?

They’re mutually exclusive “row shape” modes for the project. With per_document_mode on, a system-managed Document dimension is attached to every agent, producing one result row per uploaded document. per_tag_mode does the same but keyed by document tags. Both default to off. Leave them off unless you need that row shape — custom dimensions on individual agents are usually more flexible.

Can I turn per_document_mode on or off on an existing project via the API?

No — set the mode at project creation. PATCH /projects/{id}/ doesn’t run the required cleanup (wiping stale results, attaching or removing the system Document dimension across every agent), so flipping it that way leaves the project inconsistent.

To change the mode on a project that already has data, recreate the project and re-upload, or toggle it from the Agents page in the UI (which calls an internal endpoint that performs the cleanup).

Can I create or list dimension templates via the API?

No. The v1 agent payload accepts dimension_template_id, but v1 has no endpoint to create or list templates.

  • For row-per-document or row-per-tag output, use per_document_mode or per_tag_mode at project creation instead.
  • For true custom dimensions (per clause, per party, per region), the workflow isn’t available in v1 yet. Unknown dimension_template_id values are silently skipped. Contact support@parsewise.ai if you’re blocked.

The project status endpoint reports parsing_state: stuck. What does that mean?

At least one document has had a parsing run in flight for over an hour. A background monitor auto-retries parsing once around 15 minutes in; stuck means that retry didn’t clear it. There’s no v1 endpoint to force another retry — delete and re-upload the affected documents, or contact support.

How do I tell, strictly, that a run finished successfully?

Check two signals together: pipeline_running=false on agents/status/ and every agent reporting extraction_status=Processed. pipeline_running=false alone also flips true on validation failures and cancelled runs, so on its own it doesn’t prove success. If any agent is still Pending while pipeline_running=false, the run failed — don’t treat the results as final.

What do the resolution_status values on a result mean?

  • Resolved — sources agreed (or disagreements were reconciled) and the resolver produced a single canonical value. Safe to consume.
  • Requires attention — inconsistencies couldn’t be auto-resolved. A value is still set; gate on this for high-confidence only.
  • Not resolved — the resolver hasn’t run yet. Usually means the pipeline is still in progress; poll again.
  • No result — extraction produced no usable candidates (value absent, low confidence, or no matching dimension instances).
  • Ignored — a user or resolution rule excluded every candidate. Treat as intentional.

For high-confidence pipelines, gate on resolution_status == "Resolved".

What do the extraction_status values on an agent or result mean?

  • Pending — not yet run, or in flight.
  • Processed — finished successfully.
  • No Result — finished but produced no candidates. On an agent, this usually means the extraction task doesn’t match the documents — iterate and re-launch.

What do the four pipeline stages do?

  • parsing — extract text and layout from uploaded documents, page by page.
  • extraction — each agent runs its instructions against the relevant pages to find candidate values.
  • web_search — supplements candidates with public web data, only for agents with web search enabled.
  • resolution — consolidates per-page, per-source candidates into one final value per result cell, flagging inconsistencies.

What value_type values are supported end-to-end today?

Two, for new agents:

  • string (default) — free text. Use for dates, booleans, codes, names, clauses — anything non-numeric. Specify the format in the extraction task (e.g. “ISO 8601 date YYYY-MM-DD).
  • number — numeric values you’ll aggregate or compare. Pair with unit (e.g. USD, %, days).

The schema also lists bool, date, and datetime for backwards compatibility, but they aren’t fully wired end-to-end — stick to string or number.


Support


Table of contents