title: "Architecture" description: "How AssayPDF's spec layer, generator, harness, scorer, and reports fit together — from spec ingestion through deterministic PDF generation to per-engine accuracy reports." group: "Getting started" order: 4

Architecture

How the pieces fit together.

spec/gwg-2022-spec.xlsx
        │
        │ assay ingest
        ▼
spec/requirement-ids.json
        │
        ▼
src/assay_pdf/generator/
  ├ variants.py        ← 23 variant configs
  ├ rules.py           ← 39 rule generator functions
  ├ base.py            ← PDF/X-4 scaffolding (reportlab + pikepdf)
  ├ injectors.py       ← shared content-stream helpers
  ├ determinism.py     ← byte-identical metadata override
  └ orchestrator.py    ← runs all (rule × variant) combos
        │
        │ assay generate
        ▼
corpus/
  ├ manifest.json      ← committed; SHA-256 + applicability per file
  ├ positive/          ← 23 PDFs, gitignored
  └ negative/          ← 39 PDFs, gitignored
        │
        ▼
src/assay_pdf/harness/
  ├ runners/{codexpdf,lintpdf,pitstop,pdftoolbox}.py   ← CLI runners
  ├ runners/{pitstop_report,callas}.py                 ← ingestion runners
  ├ rule_maps/*.json   ← engine message → R-id
  ├ scorer.py          ← per-(rule, variant) confusion matrix
  └ driver.py          ← benchmark() drives a CLI / ingest_reports() parses reports
        │
        │ assay benchmark --engine X        (CLI runners)
        │ assay ingest-report --engine X    (ingestion runners)
        ▼
results/<engine>-<ts>.{json,score.json}
        │
        ▼
src/assay_pdf/reports/
  ├ generator.py       ← aggregates score JSONs
  └ templates/{markdown.j2,html.j2}
        │
        │ assay report --format md|html
        ▼
REPORT.md / REPORT.html

Layer responsibilities

Spec layer (`src/assay_pdf/spec/`)

Reads spec/gwg-2022-spec.xlsx and produces typed RequirementManifest JSON. Pure data parsing — no PDF construction.

Generator layer (`src/assay_pdf/generator/`)

Builds the PDF corpus deterministically. The flow per rule:

base.build_base_pdfx4() constructs a minimal valid PDF/X-4 with the correct output intent.
The rule generator injects exactly one violation via pikepdf mutations + content-stream snippets.
determinism.stamp_deterministic() overrides /Info dates, /ID array, and producer string with values derived from a fixed seed.
The result is added to corpus/manifest.json with its SHA-256 + applicable variants.

Rule generators are registered via @register("Rxxxx") decorators in rules.py. Adding a rule = adding a generator function. The orchestrator picks them up automatically.

Harness layer (`src/assay_pdf/harness/`)

Runs engines against the corpus. Two runner families share the Runner ABC and both produce the same EngineResult for the scorer:

CLI runners (codexpdf, lintpdf, pitstop, pdftoolbox) drive a licensed engine binary, implementing:

binary_path() — locate the engine's CLI.
engine_version() — report the version (recorded in results).
run(pdf_path, variant_kebab) — invoke the engine, parse output, return EngineResult.

Ingestion runners (pitstop-report, callas) subclass IngestRunner and parse a preflight report the engine already produced — they implement parse_report(report_path, *, file, profile) instead of shelling out to a binary, so AssayPDF can score any engine's output, not just the four whose CLIs it can drive. callas auto-detects JSON vs XML; report shapes mirror lint-pdf's pitstop_xml / callas_json / callas_xml importers. A report that can't be parsed raises IngestParseError — never a silent zero-findings result.

Engine messages are translated to GWG rule IDs via rule_maps/<engine>.json — a regex-pattern → R-id map. Adding a CLI engine = subclassing Runner + writing the rule map JSON; adding an ingestion engine = subclassing IngestRunner + writing its rule map.

Color-conformance layer (`src/assay_pdf/color/` + `harness/runners/colorconformance.py`)

A separate, measurement-driven deliverable that computes G7 / ISO 12647 / Fogra color-conformance metrics over measured colour patches (not a PDF):

color/color_math.py — dependency-free CIE76 / CIEDE2000 ΔE + summaries (local implementation; codex is not a dependency here).
color/models.py — MeasurementSet / MeasurementSample inputs, ColorTolerances threshold inputs, and the metric/report models.
color/references.py — the FOGRA51 anchor aims, the ISO 12647-2 TVI aim curve, and the G7 neutral aim.
color/conformance.py — compute_color_conformance(...), the metric math.
harness/runners/colorconformance.py — ColorConformanceRunner, following the runner shape but returning a ColorConformanceReport instead of GWG rule hits.

Thresholds are inputs, never hardcoded verdicts — see docs/methodology.md for the metric definitions and references.

Scorer (`src/assay_pdf/harness/scorer.py`)

Walks corpus/manifest.json and a list of EngineResult. For each (manifest entry, engine result) pair, increments TP/FP/FN/TN counters per (rule_id, variant). Stub negatives are skipped.

Misattribution (engine flags wrong rule) is penalized: it's both an FN for the targeted rule and an FP for the rule the engine flagged.

Reports (`src/assay_pdf/reports/`)

Loads every results/*.score.json and renders via Jinja2. Two templates:

markdown.j2 — README-embeddable, GitHub-friendly.
html.j2 — standalone, Tailwind via CDN, navy/lime branding.

Models (`src/assay_pdf/models.py`)

Pydantic v2 models are the canonical data shapes. Every persisted artifact is one of these models serialized to JSON. Change the model = change the schema everywhere.

Key models:

RequirementManifest — the spec
CorpusManifest + ManifestEntry — the corpus
EngineResult + RuleHit — one engine run on one PDF
ScoreReport + RuleScore — aggregated benchmark output

Determinism

Three sources of nondeterminism in PDFs:

Creation timestamp.
/ID array — usually MD5 of file contents.
Object stream ordering when pdf.save() rewrites cross-reference tables.

determinism.py handles 1 and 2 by overriding to fixed values. pikepdf's deterministic_id=True save option handles 3.

Tested in tests/test_determinism.py — generates the same PDF twice and asserts SHA-256 equality.

Why pikepdf + reportlab + ghostscript

Each library handles what it's best at:

reportlab — page layout, vector primitives, font metrics. Easy to draw stuff.
pikepdf — low-level PDF object manipulation (resources, content streams, OutputIntents, OCGs, /ID arrays). Hard to do with reportlab alone.
ghostscript (deferred to v0.1.1 for harder rules) — rendering complex stuff (transparency groups, OPM modes) where direct PDF construction is brittle.

The split keeps each rule generator small. Most are <50 lines.

CI strategy

ci.yml runs on every push — pytest, ruff, mypy, schema-only assay validate. No engine binaries in CI (commercial licenses).
url-liveness.yml weekly cron — verifies all GWG asset URLs still respond 200. Files an issue if not.
release.yml triggers on v*.*.* tags — builds wheel/sdist, drafts a GitHub release.

Engine benchmark scores are run locally and committed manually. CI never runs assay benchmark.

title: "Architecture" description: "How AssayPDF's spec layer, generator, harness, scorer, and reports fit together — from spec ingestion through deterministic PDF generation to per-engine accuracy reports." group: "Getting started" order: 4

Architecture

Layer responsibilities

Spec layer (src/assay_pdf/spec/)

Generator layer (src/assay_pdf/generator/)

Harness layer (src/assay_pdf/harness/)

Color-conformance layer (src/assay_pdf/color/ + harness/runners/colorconformance.py)

Scorer (src/assay_pdf/harness/scorer.py)

Reports (src/assay_pdf/reports/)

Models (src/assay_pdf/models.py)