Who this questionnaire is for
Engineering teams, ML engineers, platform teams, and technical leads building or operating AI systems.
What it assesses
The technical maturity of AI systems — including evaluation practices, failure-path testing, monitoring, logging, drift detection, and separation of retrieval vs reasoning.
How it helps
This questionnaire reveals whether systems are engineered for reliability, not just demos. It identifies gaps that allow behaviour to drift between releases, obscure failures, or prevent effective debugging and audit review. Results help teams prioritise engineering work that directly improves trustworthiness and operational resilience.
Best used when
- Preparing systems for production
- Investigating unexplained failures or regressions
- Strengthening monitoring and evaluation pipelines
AI Engineering Maturity (Systems)
Scores automatically as you click. The top fields are optional — only the questions affect scoring.
Optional fields: these help with internal benchmarking, but are not required to score.
Section A — Tasking, Retrieval & Data Discipline
1) Are tasks typed and routed (e.g., summarise vs extract vs decide) with different policies and prompts?
Tasking2) Is retrieval constrained to approved sources (domain/corpus allowlist) with metadata filters?
Retrieval3) Do you control freshness (document dates, snapshots, versioning) and prevent outdated retrieval?
Freshness4) Do you store retrieval context (top-k, scores, chunk IDs) per response for audit/debug?
Provenance5) Are prompts/tools treated as code (reviews, tests, versioning, rollback)?
Change controlSection B — Verification, Refusal & Safety Behaviour
6) Does the system reliably refuse or defer when evidence is missing or contradictory?
Refusal7) Are there automated checks (schema, citation presence, groundedness, policy gates) before returning an answer?
Gates8) Do you separate retrieval quality from reasoning quality in evaluation (stage-separated metrics)?
Metrics9) Are “high-stakes” requests detected and handled with stronger controls (human review, higher refusal threshold, stricter evidence)?
High-stakes10) Can the system produce an answer trace (what it used, what checks ran, and why it answered/refused)?
TraceSection C — Observability, Drift & Operations
11) Do you monitor key runtime signals (refusal rate, verification pass rate, latency, tool errors) with alerting?
Ops12) Do you test failure paths (empty retrieval, conflicts, prompt injection, tool failure) before release?
Testing13) Do you run structured evaluations on representative tasks before/after changes (not just ad-hoc demos)?
Eval14) Is there an incident loop with rollback, postmortems, and documented remediation?
IR15) Are logs structured and exportable for audit (inputs, retrieval, checks, outputs, versions)?
LogsTip: When engineering and governance disagree on maturity, it usually means controls exist but are not enforced or not measurable.
