Who this questionnaire is for
Engineering teams, ML engineers, platform teams, and technical leads building or operating AI systems.

What it assesses
The technical maturity of AI systems — including evaluation practices, failure-path testing, monitoring, logging, drift detection, and separation of retrieval vs reasoning.

How it helps
This questionnaire reveals whether systems are engineered for reliability, not just demos. It identifies gaps that allow behaviour to drift between releases, obscure failures, or prevent effective debugging and audit review. Results help teams prioritise engineering work that directly improves trustworthiness and operational resilience.

Best used when

Preparing systems for production
Investigating unexplained failures or regressions
Strengthening monitoring and evaluation pipelines

Oyez Tools

Practical tools and field guides used together to decide what to build, how to ship it, and how to govern AI systems in real organisations.

Field Guides

AI Strategy Field Guide

Executive framing, ownership models, and adoption paths.
Engineering Field Guide

Production systems, reliability, and failure modes.
Governance Field Guide

Controls, accountability, auditability, and compliance.

Strategy & Direction

Strategy & Adoption Readiness (Executive)

Used by executives to set direction, ownership, and ship gates.

Intake & Decision Gates

AI Adoption Readiness

Baseline check for mixed teams starting or scaling AI.
Use-Case Intake & Risk Tiering

Approve, condition, or block AI use cases.

Build & Reliability

Engineering Maturity (Systems)

Assess production readiness and drift risk.
RAG & Grounding Readiness

Shipping retrieval-grounded AI safely.

Governance & Procurement

Governance & Compliance Readiness

Controls, evidence, and regulatory posture.
Vendor Procurement & Due Diligence

Evaluate AI vendors before purchase or renewal.

Operate & Review

Change, Drift & Incident Review

Post-incident and post-release review.

Field guides explain the concepts. Questionnaires test whether they exist in practice.

AI Engineering Maturity (Systems)

Scores automatically as you click. The top fields are optional — only the questions affect scoring.

Status: Not scored Coverage: 0% Score: — Decision: —

Optional fields: these help with internal benchmarking, but are not required to score.

Your role (optional)

Company or project identifier (optional)

Your score

out of 45

Maturity

—

Answer the questions to see your maturity level.

Section A — Tasking, Retrieval & Data Discipline

1) Are tasks typed and routed (e.g., summarise vs extract vs decide) with different policies and prompts?

Tasking

Typed task routing reduces policy drift and allows stage-specific controls.

No — single prompt pathEverything runs through one generic prompt and policy path.

Some task typingOnly a few routes exist; coverage is incomplete.

Most tasks typed and routedMost tasks have explicit routes/policies.

Fully typed + tested + enforcedRouting is tested and enforced in production.

2) Is retrieval constrained to approved sources (domain/corpus allowlist) with metadata filters?

Retrieval

Allowlists + metadata filters are the core anti-hallucination infrastructure.

NoRetrieval pulls from broad sources without enforceable constraints.

Partially (some filters)Some filters exist, but allowlisting is incomplete.

Mostly (allowlist + filters)Most retrieval is constrained to approved corpora and filtered.

Enforced end-to-endAllowlists/filters are enforced end-to-end and tested.

3) Do you control freshness (document dates, snapshots, versioning) and prevent outdated retrieval?

Freshness

Freshness control prevents “quiet outdated truth” in deployed systems.

NoNo reliable control of snapshot/version/freshness.

Ad-hocManual processes exist; behaviour is inconsistent.

Mostly controlledSnapshots/versioning are used for key corpora.

Controlled + monitoredFreshness is controlled and monitored for regressions.

4) Do you store retrieval context (top-k, scores, chunk IDs) per response for audit/debug?

Provenance

Per-response top-k capture is the minimum viable audit trail for RAG.

NoNo retrieval context is stored per response.

Partial loggingSome responses store partial retrieval context.

Most responses loggedMost responses capture top-k and IDs.

Always logged + exportableAlways captured and exportable for audits/incidents.

5) Are prompts/tools treated as code (reviews, tests, versioning, rollback)?

Change control

Prompt/tool changes are production changes. Treat them as code changes.

NoNo review/test/versioning discipline for prompts/tools.

Some versioningSome versioning exists; review/testing is limited.

Reviews + rollback existReview and rollback paths exist, but aren’t comprehensive.

CI-style checks + approvals + rollbackChanges pass CI-style checks with approvals and rollback.

Section B — Verification, Refusal & Safety Behaviour

6) Does the system reliably refuse or defer when evidence is missing or contradictory?

Refusal

Reliable refusal is a safety feature and a trust feature.

NoAnswers are returned even without evidence.

SometimesRefusals happen, but behaviour is inconsistent.

UsuallyMost missing/contradictory cases result in refusal/defer.

Enforced + testedRefusal behaviour is enforced and regression-tested.

7) Are there automated checks (schema, citation presence, groundedness, policy gates) before returning an answer?

Gates

Hard gates prevent “looks plausible” output from escaping into production.

NoNo automated checks run before returning outputs.

One or two checksA small number of checks exist (not comprehensive).

Several checksMultiple checks exist, but coverage/logging may be incomplete.

Comprehensive + logged outcomesChecks are comprehensive and logged per response.

8) Do you separate retrieval quality from reasoning quality in evaluation (stage-separated metrics)?

Metrics

Stage-separated metrics tell you if failures come from RAG or reasoning.

NoEvaluation is not stage-separated.

PartiallySome separation exists, but not systematic.

MostlyRetrieval vs reasoning are usually separated.

Fully + continuousStage-separated evaluation is continuous and tracked.

9) Are “high-stakes” requests detected and handled with stronger controls (human review, higher refusal threshold, stricter evidence)?

High-stakes

High-stakes requires detection + stronger controls, not only policy text.

NoNo high-stakes detection/handling exists.

InformalSome extra care exists, but it’s not enforced.

Defined handlingHandling is defined by tier, with some enforcement.

Enforced + auditedEnforced in code and audited with evidence.

10) Can the system produce an answer trace (what it used, what checks ran, and why it answered/refused)?

Trace

Traces are the glue between engineering observability and governance evidence.

NoNo traces are produced for outputs.

Partial tracesSome traces exist, but are incomplete or inconsistent.

Most outputs tracedMost outputs include usable traces.

Fully traced + exportableTraces are consistent and exportable.

Section C — Observability, Drift & Operations

11) Do you monitor key runtime signals (refusal rate, verification pass rate, latency, tool errors) with alerting?

Ops

Without alerting and ownership, dashboards are decorative.

NoNo monitoring of key runtime signals exists.

Basic dashboardsDashboards exist, but alerting/ownership is missing.

Dashboards + regular reviewReview cadence exists; alerting may be partial.

Alerting + SLOs + ownershipAlerts, SLOs, and ownership are defined and used.

12) Do you test failure paths (empty retrieval, conflicts, prompt injection, tool failure) before release?

Testing

Failure-path tests prevent “it worked in the demo” deployments.

NoNo consistent failure-path testing exists.

OccasionallySome tests exist, but not systematic or enforced.

UsuallyMost releases include failure-path testing.

Always + regression suiteFailure paths are in a regression suite and always run.

13) Do you run structured evaluations on representative tasks before/after changes (not just ad-hoc demos)?

Eval

Structured evaluation makes changes measurable and reversible.

NoNo structured evaluation exists.

OccasionallySome evaluation exists, but coverage is limited.

Before major releasesStructured eval runs before major releases.

Continuous evaluation + drift trackingContinuous evaluation + drift tracking is in place.

14) Is there an incident loop with rollback, postmortems, and documented remediation?

Incident discipline is a maturity marker: rollback + learning loop.

NoNo incident loop exists for AI behaviour changes/failures.

InformalSome response exists, but postmortems/rollback are inconsistent.

DocumentedIncident loop is documented and used sometimes.

Practiced + measurable outcomesPracticed with measurable remediation outcomes.

15) Are logs structured and exportable for audit (inputs, retrieval, checks, outputs, versions)?

Logs

Exportable structured logs connect engineering truth to audit evidence.

NoLogs are missing or not structured for audit/exports.

Partial logsSome pieces exist, but not end-to-end or exportable.

Mostly exportableMost key events are logged and exportable.

Fully exportable + retention policyFully exportable logs with retention/access governance.

Tip: When engineering and governance disagree on maturity, it usually means controls exist but are not enforced or not measurable.

Maturity

—

Recommended decision

—

Top risks

Recommended next steps

Use this summary for engineering readiness reviews, launch checklists, and post-release audit packs.