Cavaridge Academy

Upload, analyze, and extract structured data from a document with audit-ready provenance.

--- title: Document analysis workflow status: draft note: AI-generated first-pass transcript pending video production + SME review. --- Document analysis is one of the highest-value Studio surfaces for a service team — and one of the easiest to get wrong without provenance discipline. This lesson is the right way to do it. ## What "document analysis" means here You upload a document — a vendor security questionnaire response, an SoW from a partner, an audit report — and Studio extracts structured data: parties, dates, claims, commitments, redlines. The output is data you can act on: build a ticket, populate a record, escalate. ## The three artifacts every run produces When you run document analysis, the platform stores three things: 1. **The original document** — versioned, immutable, with the upload timestamp + the user who uploaded it. 2. **The extracted data** — structured JSON with each field tagged with which page and which paragraph it came from. 3. **The provenance record** — model used, prompt template, run id, token counts, latency, cost. Visible in Langfuse. Without all three, you can't audit. Studio refuses to run analysis if provenance recording is unavailable. ## Healthcare guard If the document contains PHI signatures (US SSN format, MRN-shaped strings, name+DOB pairs in proximity) and the tenant is **not** flagged healthcare_mode with a BAA in place, the no-PHI gate redacts the signature region and surfaces a warning. **Redaction is a backstop, not a workflow.** If you're routinely processing PHI, the operator needs to enable healthcare_mode + sign the BAA _first_. ## Idempotency Document analysis is expensive. Set an `Idempotency-Key` per analysis request. If the request fails — network blip, queue restart, anything — resend with the same key. The platform will return the cached successful result if one exists, or pick up where it left off. Never generate a fresh idempotency key on retry; you'll burn budget. ## Failure modes you'll see - **Document too large.** Studio chunks but has a per-document ceiling. Above that, split the document yourself before upload. - **Citations to the wrong page.** This is rare but it happens. Always spot-check at least one extracted field by clicking through to the source page. - **Extraction looks confident but is wrong.** Especially on tables. Tables with merged cells, multi-row headers, or implicit column meaning are still hard. If a table extraction is critical, have a human verify before downstream use. ## Hands-on In the sandbox, the **forge-doc-analysis-starter** seed gives you four sample documents: - A clean vendor questionnaire (extraction should be confident + correct). - A scanned PDF with OCR artifacts (you'll see lower confidence). - A document with a deliberately ambiguous table (verify before trusting). - A document with embedded PHI (you'll see the gate fire). Run each one. Observe the difference in the analysis outputs.

Document analysis workflow

Knowledge check