API Reference

REST API for the same workflows as the app: create projects, upload documents, define agents, launch extraction, and read results with source citations.

Examples use Python 3.9+ with requests, but the API is plain JSON over HTTPS — use any client you like.

Table of contents

Base URL

https://api.parsewise.ai/api/v1

OpenAPI schema

The schema is the source of truth for request/response shapes, field names, and enum values. It’s public and requires no API key.


Authentication

Every request must include your API key in the X-API-Key header.

import os
import requests

resp = requests.get(
    "https://api.parsewise.ai/api/v1/projects/",
    headers={"X-API-Key": os.environ["PARSEWISE_API_KEY"]},
)
resp.raise_for_status()
print(resp.json())
  • Keys are prefixed pw_live_ and scoped to one organisation.
  • Manage keys (create, rotate, revoke) on the Developer page.
  • Treat keys as secrets — never commit them or ship them to clients.

Endpoints

All paths are relative to the base URL above. Tables are generated from the OpenAPI schema — see it for full request and response shapes.

Projects

Method Path Description
GET /projects/ List projects
GET /projects/{id}/ Get a project
GET /projects/{project_id}/status/ Get project processing status
POST /projects/ Create a project
PATCH /projects/{id}/ Partially update a project
PUT /projects/{id}/ Update a project
DELETE /projects/{id}/ Delete a project

Documents

Method Path Description
GET /projects/{project_id}/documents/ List documents
GET /projects/{project_id}/documents/{document_id}/ Get a document
GET /projects/{project_id}/documents/{document_id}/pages/{page_number}/ Get a document page
GET /projects/{project_id}/documents/{document_id}/pages/{page_number}/image/ Get a page image
POST /projects/{project_id}/documents/ Upload documents
DELETE /projects/{project_id}/documents/{document_id}/ Delete a document

Agents

Method Path Description
GET /projects/{project_id}/agents/ List agents
GET /projects/{project_id}/agents/status/ Get agent processing status
GET /projects/{project_id}/agents/{agent_id}/ Get an agent
POST /projects/{project_id}/agents/ Create an agent
POST /projects/{project_id}/agents/launch/ Launch the agent pipeline
PATCH /projects/{project_id}/agents/{agent_id}/ Partially update an agent
PUT /projects/{project_id}/agents/{agent_id}/ Update an agent
DELETE /projects/{project_id}/agents/{agent_id}/ Delete an agent

Results

Method Path Description
GET /projects/{project_id}/results/ List extraction results
GET /projects/{project_id}/results/{resolution_result_id}/ Get a single extraction result
GET /projects/{project_id}/results/{resolution_result_id}/extractions/ List extractions for a resolution result

Extractions

Method Path Description
GET /projects/{project_id}/extractions/{extraction_id}/bounding-boxes/ Get bounding boxes for an extraction

Full example

Shared setup

import os
import time
import requests

API_KEY = os.environ["PARSEWISE_API_KEY"]
BASE_URL = "https://api.parsewise.ai/api/v1"
HEADERS = {"X-API-Key": API_KEY}

1. Create a project

resp = requests.post(
    f"{BASE_URL}/projects/",
    headers=HEADERS,
    json={"name": "Q4 leases", "description": "Lease extraction"},
)
resp.raise_for_status()
project_id = resp.json()["id"]

2. Upload a document

with open("lease.pdf", "rb") as f:
    resp = requests.post(
        f"{BASE_URL}/projects/{project_id}/documents/",
        headers=HEADERS,
        files={"file": f},
    )
resp.raise_for_status()

3. Create an agent

resp = requests.post(
    f"{BASE_URL}/projects/{project_id}/agents/",
    headers=HEADERS,
    json={
        "name": "Annual rent (USD)",
        "extraction_instructions": "Extract the annual rent in USD as a number.",
        "value_type": "number",
        "unit": "USD",
    },
)
resp.raise_for_status()
agent_id = resp.json()["id"]

4. Launch extraction

resp = requests.post(
    f"{BASE_URL}/projects/{project_id}/agents/launch/",
    headers=HEADERS,
)
resp.raise_for_status()  # returns 202 Accepted with no body

5. Poll for progress

Poll with exponential backoff (e.g. 2s → 4s → 8s, capped at ~30s). Stop only when both:

  • pipeline_running is false, and
  • every agent’s extraction_status is Processed.
def wait_for_run(project_id: str, max_wait_seconds: int = 1800) -> dict:
    delay = 2
    deadline = time.time() + max_wait_seconds
    while time.time() < deadline:
        resp = requests.get(
            f"{BASE_URL}/projects/{project_id}/agents/status/",
            headers=HEADERS,
        )
        resp.raise_for_status()
        status = resp.json()
        running = status.get("pipeline_running", False)
        all_processed = all(
            a.get("extraction_status") == "Processed"
            for a in status.get("agents", [])
        )
        if not running and all_processed:
            return status
        time.sleep(delay)
        delay = min(delay * 2, 30)
    raise TimeoutError("Run did not finish in time")

wait_for_run(project_id)

See the FAQ for why pipeline_running=false alone is not enough.

6. Read results

List rows (paginated):

resp = requests.get(
    f"{BASE_URL}/projects/{project_id}/results/",
    headers=HEADERS,
)
resp.raise_for_status()
rows = resp.json()["results"]
for row in rows:
    print(row["agent_name"], "", row["resolution_result"]["value"])

Fetch full detail for a single row (includes document-level citations):

resolution_result_id = rows[0]["resolution_result"]["id"]
resp = requests.get(
    f"{BASE_URL}/projects/{project_id}/results/{resolution_result_id}/",
    headers=HEADERS,
)
resp.raise_for_status()
detail = resp.json()
for source in detail.get("sources", []):
    print(source["document_name"], "p.", source["page_number"])

Iterating

Common follow-ups:

  • New documents arrive → upload them, then re-launch.
  • An agent is wrongPATCH the agent, then re-launch.
  • A new column is needed → create another agent, then re-launch.

Updating an agent is a single PATCH:

resp = requests.patch(
    f"{BASE_URL}/projects/{project_id}/agents/{agent_id}/",
    headers=HEADERS,
    json={"extraction_instructions": "new task text"},
)
resp.raise_for_status()

The body is PatchedV1AgentRequest — every field is optional. Follow the update with a launch to recompute.

Launch cost model

Launches are incremental, not a full recompute:

  • Parsing runs only on documents still in Pending.
  • Extraction is keyed per agent/document. Already-resolved pairs are skipped; only new documents and invalidated agents do work.
  • Agent edits invalidate that agent’s data. Changing any of extraction_instructions, value_type, examples, unit, resolution_instructions, inconsistency_instructions, enable_complex_calculations_in_resolution, or enable_web_search clears the agent’s extractions on save, so the next launch re-runs that agent across every document. Other agents are untouched.

Reading results

Two endpoints return results (see the schema for the full field set):

  • GET /projects/{project_id}/results/ — paginated list. Each row has agent_name, value_type, extraction_status, resolution_result (resolved value + metadata), and dimension_instances.
  • GET /projects/{project_id}/results/{resolution_result_id}/ — full detail for one row, including document-level citations in sources[].

When consuming responses:

  • Parse resolution_result.value by the agent’s value_type (string or number today — see the FAQ).
  • Gate on resolution_result.resolution_status for high-confidence values only.
  • Use resolution_result.references for inline citations ([document_name, page_number]; page_number is None for web sources). The detail endpoint’s sources[] gives document-level citations.

Limits & gotchas

  • Launch is project-scoped. There is no “only these documents” or “only this agent” option. Prove an agent on a small test project before pointing it at thousands of production documents.
  • Don’t stack launches. A second launch issued while a run is in progress is queued silently behind it — wait for the current one to finish.
  • per_document_mode / per_tag_mode are set at project creation. They can’t be toggled on an existing project via PATCH. See the FAQ below.
  • Dimension templates aren’t yet creatable via the API. See the FAQ.

FAQ

What do per_document_mode and per_tag_mode on a project actually do?

They’re mutually exclusive “row shape” modes for the project. With per_document_mode on, a system-managed Document dimension is attached to every agent, producing one result row per uploaded document. per_tag_mode does the same but keyed by document tags. Both default to off. Leave them off unless you need that row shape — custom dimensions on individual agents are usually more flexible.

Can I turn per_document_mode on or off on an existing project via the API?

No — set the mode at project creation. PATCH /projects/{id}/ doesn’t run the required cleanup (wiping stale results, attaching or removing the system Document dimension across every agent), so flipping it that way leaves the project inconsistent.

To change the mode on a project that already has data, recreate the project and re-upload, or toggle it from the Agents page in the UI (which calls an internal endpoint that performs the cleanup).

Can I create or list dimension templates via the API?

No. The v1 agent payload accepts dimension_template_id, but v1 has no endpoint to create or list templates.

  • For row-per-document or row-per-tag output, use per_document_mode or per_tag_mode at project creation instead.
  • For true custom dimensions (per clause, per party, per region), the workflow isn’t available in v1 yet. Unknown dimension_template_id values are silently skipped. Contact support@parsewise.ai if you’re blocked.

The project status endpoint reports parsing_state: stuck. What does that mean?

At least one document has had a parsing run in flight for over an hour. A background monitor auto-retries parsing once around 15 minutes in; stuck means that retry didn’t clear it. There’s no v1 endpoint to force another retry — delete and re-upload the affected documents, or contact support.

How do I tell, strictly, that a run finished successfully?

Check two signals together: pipeline_running=false on agents/status/ and every agent reporting extraction_status=Processed. pipeline_running=false alone also flips true on validation failures and cancelled runs, so on its own it doesn’t prove success. If any agent is still Pending while pipeline_running=false, the run failed — don’t treat the results as final.

What do the resolution_status values on a result mean?

  • Resolved — sources agreed (or disagreements were reconciled) and the resolver produced a single canonical value. Safe to consume.
  • Requires attention — inconsistencies couldn’t be auto-resolved. A value is still set; gate on this for high-confidence only.
  • Not resolved — the resolver hasn’t run yet. Usually means the pipeline is still in progress; poll again.
  • No result — extraction produced no usable candidates (value absent, low confidence, or no matching dimension instances).
  • Ignored — a user or resolution rule excluded every candidate. Treat as intentional.

For high-confidence pipelines, gate on resolution_status == "Resolved".

What do the extraction_status values on an agent or result mean?

  • Pending — not yet run, or in flight.
  • Processed — finished successfully.
  • No Result — finished but produced no candidates. On an agent, this usually means the extraction task doesn’t match the documents — iterate and re-launch.

What do the four pipeline stages do?

  • parsing — extract text and layout from uploaded documents, page by page.
  • extraction — each agent runs its instructions against the relevant pages to find candidate values.
  • web_search — supplements candidates with public web data, only for agents with web search enabled.
  • resolution — consolidates per-page, per-source candidates into one final value per result cell, flagging inconsistencies.

What value_type values are supported end-to-end today?

Two, for new agents:

  • string (default) — free text. Use for dates, booleans, codes, names, clauses — anything non-numeric. Specify the format in the extraction task (e.g. “ISO 8601 date YYYY-MM-DD).
  • number — numeric values you’ll aggregate or compare. Pair with unit (e.g. USD, %, days).

The schema also lists bool, date, and datetime for backwards compatibility, but they aren’t fully wired end-to-end — stick to string or number.


Support