AssayPDF AssayPDF

AssayPDF

Deploy on Railway

Open-source GWG 2022 conformance assay for PDF preflight engines.

CI Python 3.12+ License: AGPL v3 Spec: GWG 2022

What this is

AssayPDF is a benchmark kit that:

  1. Generates a deterministic PDF test corpus (62 files) derived from the Ghent Workgroup 2022 Specification — every file targets exactly one of the 39 rules in the spec, across all 23 GWG 2022 variants.
  2. Runs that corpus against any preflight engine — lintPDF, Enfocus PitStop Server, callas pdfToolbox — through a uniform harness, either by driving the engine's CLI or by ingesting a preflight report the engine already produced (PitStop XML, callas JSON/XML), so AssayPDF can score any engine's output as a vendor-neutral litmus.
  3. Scores TP / FP / FN / TN per rule, per variant, per engine, and produces reproducible markdown + HTML accuracy reports.

Why this exists

The GWG 2015 Compliancy Test Suite is gated to GWG vendor members. The GWG 2022 spec ships with no public test corpus at all. AssayPDF closes that gap so anyone can self-benchmark a preflight engine without paying for vendor membership.

It also doubles as the credibility layer for lintPDF (Think Neverland's open-source PDF preflight engine, hosted at lintpdf.com) — published accuracy comparisons against incumbents that none of those incumbents publish themselves.

Quick start

git clone https://github.com/printwithsynergy/assay-pdf.git
cd assay-pdf
uv sync --all-extras                                      # install deps + Python 3.12
uv run assay fetch                                        # download GWG vendor assets (~183 MB)
uv run assay generate                                     # build the 62-file PDF/X-4 corpus
uv run assay validate                                     # verify every PDF passes verapdf
uv run assay benchmark --engine pdftoolbox --profile sheetcmyk-cmyk
uv run assay report --format md > REPORT.md

Detailed docs:

What you get

corpus/
├── manifest.json          # every file's expected outcome, rule mapping, sha256
├── positive/              # 23 PDFs — one per GWG 2022 variant — pass every applicable rule
└── negative/              # 39 PDFs — one per rule, targeting that rule's failure mode

Every PDF passes verapdf PDF/X-4 validation (or has documented exception in the manifest). Every PDF is generated deterministically — same code, same seed, byte-identical output.

Coverage

Spec areaRule IDsNegatives
Page geometryR0001–R00066
OverprintR0007–R00137
FontsR00141
Black, registrationR0015–R00195
Spot colorsR0020–R00245
Total ink coverageR0025–R00262
Color space bindingR0027–R00304
Image resolutionR0031–R00333
Optional contentR0034, R00362
Output intentR00351
Sign/display scalingR00371
Processing stepsR1001–R10022

39 negatives total — one per rule — of which 18 are stub negatives pending full failure-mode implementation. Plus 23 positive baselines, one per variant. (A larger boundary-stress corpus is planned for a future release.)

Engine support

AssayPDF reaches an engine two ways. CLI runners (assay benchmark) drive a licensed engine binary against the corpus. Ingestion runners (assay ingest-report) parse a preflight report the engine already produced and score it against the same GWG litmus — so AssayPDF is a vendor-neutral yardstick for any engine's output, not only the ones whose CLI it can drive.

EngineModeStatusNotes
callas pdfToolboxCLI (pdftoolbox)workingTrial license; CLI invocation
Enfocus PitStop ServerCLI (pitstop)workingTrial license; CLI invocation
lintPDFCLI (lintpdf)workinglint-pdf CLI (codex-cluster); runner wired
codexPDFCLI (codexpdf)workingCodex extraction CLI; runner wired
Enfocus PitStopingest (pitstop-report)workingIngests a PitStop XML preflight report; no binary needed
callas pdfToolboxingest (callas)workingIngests a callas JSON or XML preflight report; no binary needed

Adding a CLI engine = implementing one Runner subclass + a rule_maps/<engine>.json mapping. Adding an ingestion engine = implementing one IngestRunner subclass (parse_report) + its rule map. See docs/methodology.md.

Reproducibility

This is not a one-off study. Every claim AssayPDF makes is reproducible:

  • Spec assets fetched from GWG canonical URLs with SHA-256 verification (vendor/checksums.json)
  • Corpus generated deterministically from a seed; manifest records expected SHA-256 per file
  • CI runs assay validate on every commit
  • A weekly cron job verifies all upstream URLs are still alive

Anyone with the same engine versions and licenses can run AssayPDF and reproduce the published accuracy numbers byte-for-byte.

Legal posture

AssayPDF never redistributes GWG copyrighted materials. Vendor assets (GOS 5.0 suites, processing-steps test suite) are fetched from the official GWG endpoints. The corpus AssayPDF generates is original work derived from spec rules, not copies of the GWG 2015 test suite.

See the archived legal-positioning.md (in the pdf-rd repo) for the comparative-advertising / nominative-fair-use stance.

Contributing

See CONTRIBUTING.md. New rule generators, new engine runners, and new boundary-case test files are all welcome.

By participating in this project you agree to abide by the Code of Conduct.

Support and security

  • Usage questions or bug reports: see SUPPORT.md.
  • Security vulnerabilities: see SECURITY.md — please do not open a public issue.

License

AGPL-3.0-or-later — see LICENSE.

ICC profiles bundled under src/assay_pdf/generator/icc/ are redistributed under their respective upstream terms; see src/assay_pdf/generator/icc/README.md.

Sister projects

Print With Synergy's PDF tooling family:

  • lint-pdf — open-source PDF preflight engine (500+ checks). This is the primary engine AssayPDF benchmarks. Hosted at lintpdf.com.
  • lens-pdf — open-source embeddable PDF viewer (React/TypeScript). Renders PDFs with preflight overlays.
  • codex-pdf — central PDF extraction engine; source of truth for fonts, images, color spaces, annotations, and findings.