There is a step between 'it works in testing' and 'go live' that most enterprises skip

AI pre-production validation is the missing step between model testing and production. Shadow mode testing and human review gates protect industrial AI from costly failure

Santhosh Sagar Reddy

Jun 23, 2026

min read

Pre-production sign-off is the control layer between a technically validated model and a production AI system. It is where technical validation meets business accountability — and where the most expensive shortcuts in enterprise AI are usually taken.

There is a particular moment in industrial AI programmes that we have learned to watch for. The data science team has built a model. The metrics on the test set are strong. The proof of concept demo went well in the conference room. The business sponsors are visibly enthusiastic. And then someone, almost always at this point, asks the question that decides whether the programme succeeds or quietly unravels over the next six months: can we deploy it next week?

In our experience, the answer almost always should be no. Not because the model is bad, but because there is a stage between testing and go-live that most enterprises skip entirely. Stage 04 of the CoffeeBeans AI Productionization Value Chain exists for exactly this reason. It is the validation layer that confirms the model is not only technically sound, but also operationally usable, business-relevant, and safe enough to influence real decisions in a working mine or processing plant.

Over the last six articles in this series, we have argued that mining operations have a data readiness problem rather than a data shortage, that the historian is not the foundation, that raw sensor data is not a feature, that train/serve consistency determines whether features hold in production, that the algorithm is rarely the hardest part of predictive maintenance, and that a model that cannot be reproduced cannot be productionised. Each of those arguments concerns work that precedes deployment. This week, we focus on the discipline that immediately precedes deployment — and the most common point at which industrial AI programmes take an expensive shortcut.

KEY POINT: A model that works in testing is not automatically ready for production. Pre-production sign-off is the difference.

The shortcut that costs enterprises the most

Strong testing metrics are treated as evidence of operational readiness

The pattern is recognisable in almost every enterprise AI programme we have seen. The model performs well on historical data. The data science team presents strong evaluation metrics. The demo is convincing. Leadership reviews the work, asks reasonable questions, and approves the next step. The model is moved toward production. Within a few weeks of go-live, operational realities expose gaps that the testing process never surfaced. Alerts behave differently in live operating conditions. Maintenance teams cannot act on the recommendations in the available window. The plant control room loses confidence in the output. The model goes quiet in everything but technical terms.

The shortcut is not that anyone is careless. The shortcut is structural. The organisation treats a successful test as evidence that the model is ready, when in fact the test only confirms that the model performs well on controlled data. The conditions a working SAG mill, a primary crusher, or a haul truck fleet will present in live operation are not present in the test environment. Sensor delays, manual overrides, shift transitions, abnormal ore characteristics, planned maintenance windows, and the constant operational improvisation of a working plant are all absent from the controlled dataset.

Pre-production sign-off is the discipline that closes that gap before the model goes live, not after.

A strong demo is not the same as production readiness. The cost of confusing the two falls on operations, not on the project team that approved the move.

What AI pre-production validation should include

The core checks that turn a tested model into a deployable one

AI pre-production validation is not a single test. It is a structured set of checks that confirm the model is safe and operationally appropriate to deploy. In our experience working with mining and industrial organisations, the following are the components that matter most:

Business metric validation. The model performance is evaluated against the operational decision it is meant to support, not only against statistical accuracy.
Technical metric validation. Performance is re-confirmed against a holdout dataset that the model has not been tuned against, with attention to precision, recall, lead time, and false-alarm cost.
Data quality review. The data sources feeding the live pipeline are reviewed for freshness, completeness, and consistency with the training environment.
Feature pipeline review. The production feature pipeline is confirmed to generate features identical to those used in training. This is the Stage 02 train/serve consistency check executed at deployment time.
Model reproducibility review. The experiment that produced the model can be re-executed against the original dataset and feature set, with the same result.
Edge case and stress testing. The model is exposed to abnormal operating conditions, sensor outages, missing data, and shift transitions to confirm it behaves sensibly.
Risk and bias review. The model is evaluated for systematic errors across operating modes, asset classes, shifts, and ore types.
Human review process. Maintenance, operations, and reliability leaders review the model output for operational sense and actionability.
Deployment readiness and rollback plan. Operational ownership, monitoring, escalation procedures, and rollback paths are confirmed before go-live.
Documented go/no-go criteria. The conditions under which the model will be approved for production are agreed in advance and applied without renegotiation at the decision point.

Shadow mode testing in mining operations

Run the model alongside the plant before letting it influence the plant

Shadow mode testing machine learning is the most practical and underused control in industrial AI. The concept is straightforward: the model runs in parallel with live operations on real production data, but its outputs do not yet drive any decision or action. The model is producing predictions. The operation is producing outcomes. The two are observed side by side over a defined period.

In a mining context, shadow mode looks like the following:

A predictive maintenance model generates SAG mill failure risk scores in real time, but maintenance planning continues to use the existing rule-based and operator-led process. The model's predictions are logged and reviewed weekly against actual outcomes.
A conveyor downtime model produces alerts that are sent only to the data science team and reliability engineers. The alerts are not surfaced to operations until alert quality has been confirmed.
A fleet optimisation model produces recommendations that are compared against dispatcher decisions. The model is judged on whether its recommendations would have improved on the dispatcher's decisions had they been followed.

Shadow mode answers a question that controlled testing cannot answer: does the model behave sensibly when it meets the conditions a working operation actually presents? The cost of running a model in shadow mode is modest. The cost of skipping shadow mode and discovering the model behaves erratically in production is rarely modest.

KEY POINT: Shadow mode testing lets the model meet operational reality before operational reality has to rely on the model.

Human review gates are a safety mechanism, not a delay

Operational accountability cannot be automated away

In high-consequence industrial settings, human review gates are a non-negotiable component of pre-production sign-off. The model is technical. The decision to deploy it is not. Maintenance leaders, operations managers, reliability engineers, plant teams, data science teams, IT and OT leadership, and the business owner of the use case each have a role in confirming whether the model is appropriate to deploy.

A useful human review process answers a small number of operationally critical questions:

Does the recommendation make operational sense to the people who run the plant?
Are the alerts actionable within the time the maintenance or operations team has to respond?
Is the false-alarm rate manageable, or will it erode trust with operators?
Are missed detections acceptable given the operational and safety consequences?
Who owns the decision when the model recommends an action?
What is the operational protocol when the model is wrong?

In our experience, review gates that are designed thoughtfully do not slow deployment. They protect it. They surface the operational concerns that would otherwise emerge after go-live, when correction is far more expensive and trust is far harder to rebuild.

Human review gates are not a delay. They are the moment at which the operation accepts accountability for the model. That moment matters.

Why business metric validation matters more than accuracy

The model must be judged by the decision it supports

A technically accurate model that does not support the operational decision well is not a successful model. Model validation enterprise disciplines exist specifically to anchor model performance to the business outcome the model is intended to improve. In a mining environment, that outcome is rarely 'accuracy.' It is some combination of downtime reduction, maintenance planning improvement, alert usefulness, false-alarm burden, lead time before failure, throughput preservation, safety risk reduction, and operator trust.

What technical validation answers	What business validation must also answer
What is the model's accuracy on the test set?	Does the model support the operational decision the business case promised?
What is the AUC across all classes?	How much lead time does the model give before failure?
What is the cross-validation score?	What is the false-alarm burden on the operations team?
What is the precision on the held-out sample?	Are the alerts actionable within the available response window?
	Does the model materially improve on the existing baseline?
	Will operators and maintenance teams trust the output?

For leadership teams, the practical implication is that pre-production sign-off must include the operational stakeholders, not only the data science team. The decision to deploy is a business decision, anchored in technical evidence but accountable to operational reality.

KEY POINT: Skipping pre-production validation is often the most expensive shortcut in enterprise AI.

The leadership mistake we see repeatedly

Asking 'is the model accurate enough?' when the right question is something else

The most common leadership question at the deployment threshold is, is the model accurate enough to deploy? The question is reasonable. It is also incomplete. Accuracy on a controlled dataset is a necessary condition for deployment, not a sufficient one. The question that determines whether the deployment succeeds is different.

Has the model demonstrated, under conditions resembling production, that it can support the operational decision safely, reliably, and accountably? That question requires evidence from shadow mode testing, human review, business metric validation, and an explicit go/no-go decision. It cannot be answered by training metrics alone. The organisations that scale industrial AI successfully are the ones that ask this question routinely and treat it as the deployment gate.

How Stage 04 connects to deployment and beyond

The bridge between experimentation and operational AI

Stage 04 of the CoffeeBeans AI Productionization Value Chain is structurally a bridge. It connects Stage 03 (model building and experimentation) to Stage 05 (deployment and serving) and Stage 06 (model health and performance). When Stage 04 is skipped, the downstream consequences are predictable.

Stage 05, Model Deployment and Serving, inherits a model whose operational behaviour has not been confirmed. The deployment is technically successful and operationally precarious from day one. Rollback and remediation cycles become routine.

Stage 06, Model Health and Performance, begins without a confirmed operational baseline. When performance degrades, the team cannot distinguish whether the model is drifting from its trained behaviour, or whether the trained behaviour itself was never operationally appropriate to begin with.

Stage 04, executed properly, prevents both of these failure modes. It is not optional infrastructure. It is the discipline that turns model experimentation into production AI that operations teams can trust.

How CoffeeBeans helps

Building the pre-production discipline that industrial AI requires

CoffeeBeans works with mining and industrial organisations to build AI pre-production validation discipline into the productionisation chain. Our Stage 04 engagements begin with the design of the validation framework itself: which checks apply to which use cases, who participates in human review, what the documented go/no-go criteria look like, and how the validation outputs flow into the model registry and deployment pipeline.

From there, we operationalise the discipline. Shadow mode testing machine learning is configured against live operational data. Business metric validation is structured around the specific operational decision each model is meant to support. Human review gates are designed to be substantive without becoming bureaucratic. And the deployment readiness checks — including rollback procedures, monitoring plans, and ownership clarity — are completed before, not after, go-live.

The objective of model validation enterprise discipline is not to slow industrial AI programmes. It is to make sure the models that reach production are the ones that deserve to be there, and that the operational teams who depend on them have good reasons to trust them.

Is your AI moving from testing straight to production?

If your industrial AI programme has models that perform well in testing but lose trust quickly after deployment, the gap is almost certainly in Stage 04. CoffeeBeans can help your team design the AI pre-production validation framework, operationalise shadow mode testing machine learning, and embed the human review gates and go/no-go criteria that model validation enterprise discipline requires. Talk to our Enterprise AI practice about pre-production sign-off in your operation.

Like What You’re Reading?

Subscribe to our newsletter to get the latest strategies, trends, and expert perspectives.

There is a step between 'it works in testing' and 'go live' that most enterprises skip

The shortcut that costs enterprises the most

Strong testing metrics are treated as evidence of operational readiness

What AI pre-production validation should include

The core checks that turn a tested model into a deployable one

Shadow mode testing in mining operations

Run the model alongside the plant before letting it influence the plant

Human review gates are a safety mechanism, not a delay

Operational accountability cannot be automated away

Why business metric validation matters more than accuracy

The model must be judged by the decision it supports

The leadership mistake we see repeatedly

Asking 'is the model accurate enough?' when the right question is something else

How Stage 04 connects to deployment and beyond

The bridge between experimentation and operational AI

How CoffeeBeans helps

Building the pre-production discipline that industrial AI requires

Like What You’re Reading?

Similar Reads

There is a step between 'it works in testing' and 'go live' that most enterprises skip

There is a step between 'it works in testing' and 'go live' that most enterprises skip

Your data scientist ran 200 experiments. Nobody can reproduce any of them.

Your data scientist ran 200 experiments. Nobody can reproduce any of them.

Predicting equipment failure at a SAG mill: why the algorithm is the easy part

Predicting equipment failure at a SAG mill: why the algorithm is the easy part

There is a step between 'it works in testing' and 'go live' that most enterprises skip

Your data scientist ran 200 experiments. Nobody can reproduce any of them.