title: "Methodology" description: "How AssayPDF scores a preflight engine: confusion-matrix counts per rule and variant, rule-map translation, misattribution penalties, and reproducibility guarantees." group: "Reference" order: 6

Methodology

How AssayPDF actually scores a preflight engine.

What gets measured

Every PDF in the corpus is one of:

Positive baseline — a minimal PDF/X-4 file for one of the 23 GWG 2022 variants. Should pass every applicable rule.
Negative test — a PDF that violates exactly one rule cleanly, leaving every other rule satisfied.

For each (engine × variant × rule) combination, the scorer computes a confusion matrix:

	Engine flagged the rule	Engine did not flag
Negative test for that rule	true positive (TP)	false negative (FN)
Positive baseline	false positive (FP)	true negative (TN)

From these:

Accuracy = (TP + TN) / total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 · precision · recall / (precision + recall)

What "engine flagged the rule" means

Preflight engines (PitStop, pdfToolbox) emit issue messages, not GWG rule IDs. AssayPDF translates each engine's message catalogue into GWG rule IDs via src/assay_pdf/harness/rule_maps/<engine>.json — a list of regex patterns mapped to rule IDs.

When an engine reports a rule that doesn't match any pattern in its rule map, the rule_id is null. That hit doesn't contribute to TP for any specific rule but does count toward overall noise (the engine flagged something on a positive baseline → FP for the unmapped rule).

This means rule-map quality directly affects scores. Rule maps are versioned in the repo and improved iteratively as actual benchmark output reveals patterns we missed.

Two ways an engine reaches the scorer

There are two harness families, and both end at the same EngineResult → scorer path — the score is computed identically regardless of how the findings arrived:

CLI runners drive a licensed engine binary against each corpus PDF (assay benchmark --engine <name>). These need the engine installed locally with a valid license: codexpdf, lintpdf, pitstop, pdftoolbox.
Ingestion runners consume a preflight report the engine already produced (assay ingest-report --engine <name> --reports-dir <dir>) and normalize it into the same finding model. No binary is invoked — AssayPDF parses the report file. This is what makes AssayPDF a vendor-neutral litmus: any engine that can export a recognized report format can be scored against the GWG 2022 spec without AssayPDF ever driving its CLI.

Supported ingestion formats

Ingestion engine	Format(s)	Report shape (mirrors lint-pdf's importers)
`pitstop-report`	PitStop XML	`<PitStopReport>` / `<Results>` with `<Error>` / `<Warning>` / `<Info>` rows carrying `<CheckID>`, `<Description>`, `<Page>`, `<BBox>` (text or `llx/lly/urx/ury` attrs), `<ObjectID>`, `<ISO>` — namespace-aware.
`callas`	callas (pdfToolbox) JSON and XML	JSON: top-level object with `hits[]` (`severity`, `comment`/`message`, `rule_id`, `page`, `geometry.bbox`). XML: `<preflight_report>` / `<hit>` with the same fields. Format auto-detected from the report bytes.

Ingested findings are translated to GWG rule IDs through the same rule_maps/<engine>.json mechanism (pitstop-report.json, callas.json) and scored with the identical confusion-matrix + misattribution logic — there is no separate or softer scoring path for ingested reports.

A report AssayPDF cannot parse fails cleanly with an IngestParseError naming what broke; a misparse is never allowed to surface as an empty (zero-findings) result, which would let the caller believe the input was clean. This mirrors lint-pdf's "no format autopsies" rule.

Why this is honest

No vendor mode — every engine runs on the same corpus with the same scoring rules.
No engine-specific tuning — we don't suppress an engine's noise; we count it.
Misattribution is penalized — if the engine flags rule R0007 on a PDF that violates R0014, that's both an FN for R0014 and an FP for R0007. Wrong rule attribution is a real quality issue.
Stub negatives are excluded — the v0.1.0 corpus has 18 rules where the negative is structurally a baseline (no violation injected). The scorer skips these. They're listed in the report as a coverage gap, not an engine failure.

Color conformance verification (G7 / ISO 12647 / Fogra)

AssayPDF also computes color-conformance metrics over measurement data. This is a separate deliverable from the GWG rule scoring above: the input is a set of measured colour patches (a CGATS-style measurement export, ingested as a JSON MeasurementSet), not a PDF, and the output is a ColorConformanceReport of metrics — not a pass/fail verdict.

These are checks, not transforms: AssayPDF measures and reports; it never modifies a PDF and never re-renders color. It is also policy-free — thresholds are inputs (ColorTolerances), not hardcoded verdicts. When a tolerance is supplied for a dimension, the metric carries within_tolerance = true/false; when none is supplied, it carries within_tolerance = null (computed-but-unjudged). Pass/fail policy lives downstream (lint owns thresholds), matching how the GWG scorer defers severity policy to the spec.

The metrics

Fogra ΔE (FOGRA51) — each measured patch is matched to the nearest nominal-CMYK patch in the FOGRA51 reference and CIEDE2000 ΔE*00 is taken patch-by-patch, summarised as mean / max / p95. The four solid primaries and the substrate white get their own ΔE callouts. FOGRA51 (PSO Coated v3, ISO 12647-2:2013, premium coated, M1) is the one Fogra reference shipped in this first PR.
ISO 12647 tone value increase (TVI / dot gain) — per process channel (C/M/Y/K), the Murray-Davies tone value of each single-channel patch is derived from its measured L* (relative luminance reconstructed from CIE L*, density D = −log10(Y/Yn)), and TVI = measured_TV − nominal_TV. The mid-tone (40/50%) deviation vs the ISO 12647-2 coated aim curve drives the optional judgement.
G7 grayscale / highlight-range (HR) — over the CMY gray-balance ramp, the weighted ΔL* (lightness vs the NPDC aim) and weighted Δch (a*/b* neutrality vs the G7 a*=b*=0 aim) — the two numbers G7 uses for grayscale conformance — plus the highlight-range subset (the lighter half of the ramp). The step weighting is a normalised triangular weight peaking mid-ramp, which down-weights the paper/max extremes where neutral error is least visible.

ΔE implementation note

The CIEDE2000 / CIE76 math (src/assay_pdf/color/color_math.py) is implemented locally from the CIE 15:2004 and Sharma et al. (2005) canonical formulae. codex-pdf's color/color_math.py is the org's source of truth for the same math, but codex is not a dependency of this repo — the existing codex integration is subprocess-only (see harness/runners/codexpdf.py), so importing or vendoring codex's module would cross that boundary. The local implementation is verified against the Sharma Table-1 vectors and agrees with codex's output to 4 decimal places (including the documented mean-hue-branch pair, where both yield 4.3065 rather than the table's 7.2195).

Downstream consumer contract (the serialized report)

The ColorConformanceReport is measurement output; the pass/fail policy lives downstream (lint-pdf's color_conformance analyzer turns each metric into an advisory/warning/error verdict against its own configurable thresholds). Because lint consumes the serialized report as a plain JSON dict — it does not import AssayPDF (the codex/engine integration is subprocess-only) — the wire shape is a cross-repo contract: the key names, the nesting, and the nullable within_tolerance are what a consumer reads.

That shape is pinned by tests/test_color_conformance_contract.py against a committed golden, tests/fixtures/color/conformance-report-fogra51.json (produced from the deterministic measurement fixture, generated_at excluded as the one non-deterministic field). A consumer can use that golden as the canonical sample of the report it will receive. If a metric definition legitimately changes, regenerate the golden in the same commit and bump COLOR_CONFORMANCE_VERSION so the change is reviewed rather than silently broken at a consumer's runtime.

References

G7 — IDEAlliance G7 Specification (NPDC gray-balance method; a*=b*=0 neutral aim).
ISO 12647-2:2013 — process control for offset; mid-tone TVI aim curves + tolerances.
FOGRA51 / PSO — Fogra characterisation data (PSO Coated v3, M1) for ΔE conformance.

Scope of the first PR (explicit follow-up)

One Fogra reference (FOGRA51) ships as a small documented anchor subset — substrate, solids, and the CMY gray ramp — not the full ~1600-patch IT8.7/4 characterisation file (whose redistribution licensing needs its own review). Adding the full set + more references (FOGRA39/52, GRACoL, SWOP) is follow-up.
Measurement ingest is JSON (MeasurementSet); native CGATS/CxF parsing is follow-up.
The metrics are computed and reported; no color-conformance entries are wired into the engine-vs-engine scoreboard yet.

What it doesn't measure

AssayPDF measures structural conformance to the GWG 2022 spec rules — does the engine catch the cases the spec says should be errors or warnings, and does it stay quiet on cases the spec says should pass — plus the color-conformance metrics above.

Things AssayPDF does not measure (in v0.1.0):

Color accuracy of a rendering — whether an ICC profile produces the same printed result (the color metrics above grade measurement data, not an ICC transform).
Performance under load — only single-file runtime is captured.
Workflow integration — JDF/JMF, PitStop Connect, callas Switch, etc.
GUI usability — only CLI invocations (and ingested report files) are benchmarked, never an interactive UI.

If you want any of those, file an issue with a methodology proposal.

Reproducibility guarantee

Same input → same output, byte-identical:

Corpus generation is deterministic. Two runs of uv run assay generate produce the same SHA-256 for every PDF (see tests/test_determinism.py).
Spec parsing is deterministic. The XLSX is committed; running uv run assay ingest produces the same requirement-ids.json.
Vendor assets are pinned by SHA-256 in vendor/checksums.json.
Engine versions are recorded in every results file. If the engine version changes, the score should be re-run and the report tagged with the new version.

If you can't reproduce a published score, file an issue with your local engine version, the corpus version, and the score JSON.