← Cavaridge AI for Service Teams---
title: Document analysis workflow
status: draft
note: AI-generated first-pass transcript pending video production + SME review.
---
Document analysis is one of the highest-value Studio surfaces for a
service team — and one of the easiest to get wrong without provenance
discipline. This lesson is the right way to do it.
## What "document analysis" means here
You upload a document — a vendor security questionnaire response, an
SoW from a partner, an audit report — and Studio extracts structured
data: parties, dates, claims, commitments, redlines. The output is data
you can act on: build a ticket, populate a record, escalate.
## The three artifacts every run produces
When you run document analysis, the platform stores three things:
1. **The original document** — versioned, immutable, with the upload
timestamp + the user who uploaded it.
2. **The extracted data** — structured JSON with each field tagged with
which page and which paragraph it came from.
3. **The provenance record** — model used, prompt template, run id,
token counts, latency, cost. Visible in Langfuse.
Without all three, you can't audit. Studio refuses to run analysis if
provenance recording is unavailable.
## Healthcare guard
If the document contains PHI signatures (US SSN format, MRN-shaped
strings, name+DOB pairs in proximity) and the tenant is **not** flagged
healthcare_mode with a BAA in place, the no-PHI gate redacts the
signature region and surfaces a warning. **Redaction is a backstop, not
a workflow.** If you're routinely processing PHI, the operator needs
to enable healthcare_mode + sign the BAA _first_.
## Idempotency
Document analysis is expensive. Set an `Idempotency-Key` per analysis
request. If the request fails — network blip, queue restart, anything
— resend with the same key. The platform will return the cached
successful result if one exists, or pick up where it left off. Never
generate a fresh idempotency key on retry; you'll burn budget.
## Failure modes you'll see
- **Document too large.** Studio chunks but has a per-document ceiling.
Above that, split the document yourself before upload.
- **Citations to the wrong page.** This is rare but it happens. Always
spot-check at least one extracted field by clicking through to the
source page.
- **Extraction looks confident but is wrong.** Especially on tables.
Tables with merged cells, multi-row headers, or implicit column
meaning are still hard. If a table extraction is critical, have a
human verify before downstream use.
## Hands-on
In the sandbox, the **forge-doc-analysis-starter** seed gives you four
sample documents:
- A clean vendor questionnaire (extraction should be confident + correct).
- A scanned PDF with OCR artifacts (you'll see lower confidence).
- A document with a deliberately ambiguous table (verify before trusting).
- A document with embedded PHI (you'll see the gate fire).
Run each one. Observe the difference in the analysis outputs.
Module 3 of 5
Document analysis workflow
Upload, analyze, and extract structured data from a document with audit-ready provenance.
Video — pending production
Read the transcript below. Once recording is complete, the video will replace this notice.
Hands-on sandbox
forge · seed:
forge-doc-analysis-starter · 60 min