AI Engineering Maturity (Systems)

Who this questionnaire is for
Engineering teams, ML engineers, platform teams, and technical leads building or operating AI systems.

What it assesses
The technical maturity of AI systems — including evaluation practices, failure-path testing, monitoring, logging, drift detection, and separation of retrieval vs reasoning.

How it helps
This questionnaire reveals whether systems are engineered for reliability, not just demos. It identifies gaps that allow behaviour to drift between releases, obscure failures, or prevent effective debugging and audit review. Results help teams prioritise engineering work that directly improves trustworthiness and operational resilience.

Best used when

  • Preparing systems for production
  • Investigating unexplained failures or regressions
  • Strengthening monitoring and evaluation pipelines

AI Engineering Maturity (Systems)

Scores automatically as you click. The top fields are optional — only the questions affect scoring.

Status: Not scored Coverage: 0% Score: Decision:

Optional fields: these help with internal benchmarking, but are not required to score.

Your score
0
out of 45
Maturity
Answer the questions to see your maturity level.

Section A — Tasking, Retrieval & Data Discipline

1) Are tasks typed and routed (e.g., summarise vs extract vs decide) with different policies and prompts?

Tasking
Typed task routing reduces policy drift and allows stage-specific controls.

2) Is retrieval constrained to approved sources (domain/corpus allowlist) with metadata filters?

Retrieval
Allowlists + metadata filters are the core anti-hallucination infrastructure.

3) Do you control freshness (document dates, snapshots, versioning) and prevent outdated retrieval?

Freshness
Freshness control prevents “quiet outdated truth” in deployed systems.

4) Do you store retrieval context (top-k, scores, chunk IDs) per response for audit/debug?

Provenance
Per-response top-k capture is the minimum viable audit trail for RAG.

5) Are prompts/tools treated as code (reviews, tests, versioning, rollback)?

Change control
Prompt/tool changes are production changes. Treat them as code changes.

Section B — Verification, Refusal & Safety Behaviour

6) Does the system reliably refuse or defer when evidence is missing or contradictory?

Refusal
Reliable refusal is a safety feature and a trust feature.

7) Are there automated checks (schema, citation presence, groundedness, policy gates) before returning an answer?

Gates
Hard gates prevent “looks plausible” output from escaping into production.

8) Do you separate retrieval quality from reasoning quality in evaluation (stage-separated metrics)?

Metrics
Stage-separated metrics tell you if failures come from RAG or reasoning.

9) Are “high-stakes” requests detected and handled with stronger controls (human review, higher refusal threshold, stricter evidence)?

High-stakes
High-stakes requires detection + stronger controls, not only policy text.

10) Can the system produce an answer trace (what it used, what checks ran, and why it answered/refused)?

Trace
Traces are the glue between engineering observability and governance evidence.

Section C — Observability, Drift & Operations

11) Do you monitor key runtime signals (refusal rate, verification pass rate, latency, tool errors) with alerting?

Ops
Without alerting and ownership, dashboards are decorative.

12) Do you test failure paths (empty retrieval, conflicts, prompt injection, tool failure) before release?

Testing
Failure-path tests prevent “it worked in the demo” deployments.

13) Do you run structured evaluations on representative tasks before/after changes (not just ad-hoc demos)?

Eval
Structured evaluation makes changes measurable and reversible.

14) Is there an incident loop with rollback, postmortems, and documented remediation?

IR
Incident discipline is a maturity marker: rollback + learning loop.

15) Are logs structured and exportable for audit (inputs, retrieval, checks, outputs, versions)?

Logs
Exportable structured logs connect engineering truth to audit evidence.

Tip: When engineering and governance disagree on maturity, it usually means controls exist but are not enforced or not measurable.