Demystifying and Debunking ML Part 3: Model Development, Optimisation, Fitting & Testing

AI
29th January 2026

Article by Hugh Stitt, Robert Gallen, Joe Emerson and Carl Jackson

Hugh Stitt, Joe Emerson, Carl Jackson and Robert Gallen outline practical principles for building and validating predictive models in chemical engineering

Quick read

Model choice and evaluation must be guided by domain knowledge and the intended use case: Algorithm selection, performance metrics and validation strategies should reflect the problem type, data characteristics and real-world decision context
Generalisation, not training performance, defines model quality: Robust validation, appropriate metrics and careful scrutiny of underfitting, overfitting and generalisation gaps are essential to ensure models perform reliably on unseen data
Data leakage is a critical and often overlooked risk: Test contamination, target leakage and careless preprocessing can severely inflate apparent performance, making rigorous data handling and sceptical validation practices essential

THIS third article in the series introduces core principles of predictive modelling for small- to medium-sized tabular datasets common in chemical engineering. These principles underpin a wide range of supervised learning applications, including design optimisation, process control and soft sensing.

Algorithm selection

When beginning a data-driven modelling project, the first step is to decide which types of algorithms to consider. Clearly, we cannot test every algorithm and some are more suited to certain problems and datasets than others. This is where domain knowledge becomes valuable in narrowing the options down to a handful. As an example, a linear regression model will never capture the non-linearities of Arrhenius reaction data, whereas a Random Forest model can (see Figure 1). Factors such as problem type (eg regression or time series forecasting), dataset size and nature, relationship complexity (including non-linearities) and needs such as regularisation or uncertainty estimation should all inform algorithm selection.

Figure 1: Comparison of linear vs non-linear model fitting Arrhenius reaction data

Model performance appraisal and optimisation

Model performance evaluation and optimisation are central to any ML project, essential for selecting between models and pipelines and for assessing whether a model is fit for purpose. Avoiding these traps requires applying rigorous methods to test a model’s predictive capability. The goal is to establish models with good “generalisation” ability, which means performing well on new, unseen data, not just the data used in model training. Such models achieve a good fit to the data and avoid underfitting or overfitting the training dataset.

As an extreme example of a model which has not avoided these traps, consider a model which predicts product yield across a continuous fixed-bed reactor using the mean reactor temperature and pressure, the H2 feed ratio, residence time and the raw material batch number. If we assess the parity plot and scores (see Figure 2), the training scores are pretty good (R2=0.82). However, when we review the performance on the unseen data, the performance is far worse (R2=0.27). We can see through the learning curve (see Figure 3) that the gap between the model scores has slightly improved with more data, but the training error is significantly lower than the test error.

Figure 2: Parity plot of an overfitted yield prediction model across a fixed-bed reactor

Figure 3: Effect of dataset size on our yield model performance, highlighting the generalisation gap

This second example highlights the importance of proper performance appraisal and assessing model generalisability. The model was trained to predict product particle size (d₅₀) from a mill, yet the test results show clear clustering (see Figure 4), rendering the model unfit for any decision-making. Robust appraisal and optimisation would have identified these issues early on. In our opinion, the only thing worse than a poor performing model, is one that you think performs well, but in fact generalises disastrously.

Figure 4: Poor performance on truly unseen test data of a particle size prediction model

Performance metrics

Defining a model’s performance metric is a key first task. The metric shapes assessment, comparison and suitability for the final application. Choice depends on the ML task (eg classification vs regression) and dataset characteristics. Poorly chosen metrics can be misleading.

A common example is using “accuracy” on an imbalanced classification task when we care about all classes equally. Imagine you were training a classification model on a reactor dataset, aiming to classify when you are at risk of thermal runaway in your vessel. Would you be happy with an accuracy of 99.5%? You probably shouldn’t be! In Figure 5, the model constantly predicts “Normal Operation”. Accuracy would mostly reflect the model performance at classifying the dominant class with the most samples (“Normal Op”), giving very little importance to the infrequent but vital thermal runaway class. Since it is critical to detect runaways, a much better choice of metric would be one that emphasises detection of runaway, albeit at the cost of increased false positives. Thinking your reactor might blow up when it is actually safe is a better option than believing you are in a safe operating mode when you aren’t.

Ultimately, the chosen metric should reflect what matters most in the final application. Well-chosen metrics highlight the aspects of performance that matter and expose weaknesses rather than hiding them.

Figure 5: Confusion matrix of a bad thermal runaway detection model, showing the number of correct and incorrect predictions

Underfitting and overfitting

Underfitting occurs when a model is too simple to capture data patterns, performing poorly on both training and test sets, often because it makes strong assumptions that do not hold true. In the Figure 1 example, modelling reaction rate as a linear function of temperature is obviously inadequate. The Arrhenius equation tells us the relationship is nonlinear, so a linear model will miss the curvature entirely.

Overfitting occurs when a model is too complex, learning noise as signal. It performs well on training data but fails on unseen data. The model might pass through every training point but completely misses the actual pattern.

Underfitting and overfitting are illustrated in Figure 6 a) and c), respectively. The optimal balance between bias and variance results in a model that is neither overfitted nor underfitted; capturing the underlying trend in the data and rejecting noise, Figure 6 b).

Either way, it is the test dataset performance that reveals the problem. That is why test scores are the most reliable way to judge how well a model is doing. These issues are closely tied to the bias-variance trade-off, which comes into play when choosing between models and tuning hyperparameters. Other factors matter too, like the quality and quantity of the data and how much useful variation it contains compared to noise; the “Vs” of the data.

Figure 6: Example of a model with a) high bias, b) a good balance between bias and variance and c) high variance

Evaluating model performance

Evaluating the performance of predictive models starts with splitting your data. The basic idea is to train the model on one subset and test it on another that the model has not seen. This helps you understand how well the model generalises to new data. But depending on the dataset, things can get more nuanced and it is surprisingly easy to introduce data leakage if the split is not handled properly.

Figure 7 a) shows a simple train-validation-test split, which works well for large tabular datasets. Validation data are typically used to compare models and tune hyperparameters, aiming for the best performance on unseen data. This process helps balance the bias-variance trade-off and improve predictive accuracy.

Figure 7: Illustration of data splitting approaches for a) large datasets and b) small to medium size datasets

However, because validation scores are used to guide model choices, they can end up being overly optimistic of predictive performance. That is why it is good practice to keep a separate test set, untouched during training and tuning, to get an unbiased estimate of generalisation error.

For larger datasets, a split like 60% training, 20% validation and 20% testing (or 80:10:10) usually works well, since the samples are rich enough to be representative. With smaller datasets, the splits can result in subsets with noticeably different distributions. In those cases, K-fold cross-validation, shown in Figure 7 b), is preferable because it smooths out the variability and gives a more reliable estimate of model performance. K-fold cross-validation works by dividing the training data into K subsets. The model is trained K times, each time leaving out one of the subsets as a validation set. The final score is the average across all K runs.

Cross-validation is very effective at maximising the usage of your available data but there are numerous pitfalls we would like to highlight. The first is “cherry-picking” the best score from your folds rather than using the average. The second is ignoring validation variance: large differences between fold scores can indicate bad data, outliers or an overly complex model. And don’t stop at just the validation scores: the training and validation scores together provide an opportunity to assess the trade-off between fit and overfit.

The difference or ratio between the two scores indicates the “generalisation gap”, or the extent of overfit. A plot we use for assessing the fit vs overfit trade-off is one which we first saw from Barry Wise at Eigenvector.¹ Referring to the same project as Figure 4, predicting the d50 of milled powder, Figure 8 shows model overfit, quantified as the ratio of the training score relative to the validation score on the y-axis, and the fit on the x-axis is simply the validation score. The colour indicates the maximum tree depth of the different random forest models. As the model becomes more complex, the fit improves at the cost of small increases in the overfit. At a depth greater than 3, the overfit ratio begins to increase significantly. The optimal region is the bottom left but specifically which model you select depends on your acceptable trade-off.

Figure 8: Effect of model hyperparameter (max tree depth) on fit and overfit

While validation and testing scores are critical and ratios of training to testing or validation scores are useful diagnostics, visual assessment of model fit, predictions and residuals is often more informative and intuitive. As an example, we trained a regression model to predict the hardness of catalyst pellets from process variables measured by the pelleting machine sensors.² Predictions on a holdout test set are shown in Figure 9. The test R² score is 0.72 and the RMSE is 0.55 for data ranging from −3 to 2, suggesting reasonable predictive performance. However, visual inspection reveals strong heteroscedasticity in the residuals; that is, the error magnitude varies systematically across the prediction range. The model discriminates well between very low-strength pellets and the remainder but performs poorly when distinguishing between medium- and high-strength pellets. A contributing factor here is the reliability of the strength measurements themselves, which are known to have a ±30% margin of error.³

Data quality should always be questioned when interpreting model performance. Depending on the intended use case, this level of performance may be insufficient. This example highlights the importance of scrutinising models from multiple perspectives and reinforces that visual diagnostics should not be overlooked.

Figure 9: Performance of a pellet strength regression model visualised with a) parity plot showing actual vs predicted strength, and b) residuals vs predicted strength

Model and hyperparameter optimisation

Most ML models have several hyperparameters to tune, (see Table 1 for common regression models). These hyperparameters control aspects of the model’s algorithm which often influence the model’s complexity and tendency to under or overfit. With Random Forests for example, you can tune the number of decision trees in the forest and the maximum depth of decision trees among other hyperparameters. The hyperparameter settings used can make a huge difference to model performance, often comparable to or larger than differences in performance between different model types, as shown in Figure 10. A common pitfall is a lack of thorough hyperparameter optimisation. Indeed, papers may not report having conducted any hyperparameter tuning at all, or they have “manually tuned” their model. This is a shortfall for the reasons given in this section. Grid searches are the most basic approach to hyperparameter tuning, whereby all possible combinations of hyperparameters are tried and the combination giving the best validation performance (eg K-Fold CV scores or simple validation set scores) is selected. This would be appropriate for a simple model like Partial Least Squares Regression with few hyperparameters and low computational demand. More efficient options exist for complex models with numerous hyperparameters and high computational costs. Our preference is Optuna which uses Bayesian optimisation under-the-hood to sample the new set of hyperparameters to try each time.⁴ Regardless of the approach, not following a hyperparameter optimisation protocol is leaving gas in the tank as far as we are concerned and is a common pitfall.

Table 1: Example machine learning models and example hyperparameters

In smaller data science projects, we often see individuals just rely on employing cross-validation on the full dataset. The problem here is that the same data have been used to select the best hyperparameters/model AND to evaluate the model. The model evaluation will therefore inevitably be over-optimistic. If a test dataset cannot be spared, a solution is to use nested cross-validation. We will not cover that here, but it provides the maximum usage of your dataset at the cost of much greater compute times. Table 2 can be used as an approximate guide.

Figure 10: Comparison of model performance on predicting reactor yield on a fixed-bed reactor dataset. RF is random forest and ANN is artificial neural network

Table 2: Starter data split approaches for different dataset sizes

Data leakage

Data leakage is a plague of the machine learning literature and can produce over-optimistic testing performance or invalidate the model. Forms include:

1. Test contamination – testing data leaks into training
2. Target leakage – target variable is used in calculating a feature

Either of these inevitably provides a good, albeit invalid fit. We will deliberately implement these for illustration on the fixed-bed reactor dataset from Figure 1.

Test contamination can occur through explicit duplication of samples, or more subtly through improper data splitting. To illustrate, suppose we add 5% duplicate rows with slight noise in the yield. These duplicates are present in both training and test data, so the model will have memorised them when tested. Alternatively, say you want to predict reactor yield for new raw material batches, where the inter-batch variability affects the reactor performance. Your data-splitting strategy must reflect that, testing under those conditions. The incorrect approach would be to do a typical random splitting of the data into training and testing, ignoring the raw material batch number. A given batch thus appears in both training and testing sets: Figure 11 a). Information about the test batches has been leaked into the training data. The correct approach would be to maintain batch integrity when splitting, Figure 11 b), enabling genuine predictivity testing of unseen batches, or to the prediction of future batches, Figure 11 c).

Figure 11: Data splitting to demonstrate test contamination (top), group splitting by batch number (mid) and timeseries group splitting (bottom)

Another common form of test contamination occurs during preprocessing. While building a model, suppose you standardise the features in the dataset (subtract mean and divide by standard deviation), split the data into training and test sets and then train and test the model on the respective data. That seems fine? Well, no! You have overoptimistically evaluated your model’s generalisation error due to data leakage. The training set was scaled using scaling parameters that were calculated from the whole dataset, including the test set, leaking the test data mean into the training data. Clean separation in training and testing data is paramount. This applies to all preprocessing, including feature selection. It must be done on the training dataset only. If you’re doing cross-validation, all the preprocessing must be trained on the training fold data only.

The other form of data leakage is target leakage. The algebraic equivalent is putting “x” into the left-hand side of a y = f(x) equation:

In effect, the variable you are trying to predict is being used, directly or indirectly, to calculate one of the model inputs. We have replicated this by engineering the feature “deviation from target yield”, defined as the difference between the true yield and the target yield for a specific raw material batch. This feature relies on knowing the yield value that the model is supposed to predict, making the resulting model invalid. Unfortunately, mistakes like this are not uncommon.⁵

The impact on model performance of some of these data leakage mechanisms is shown in Figure 12. The real model performance achieved a test R2 score of 0.55: unlikely to be sufficient for actual deployment. The data leakage cases result in inflated test performance, which we have coined as the “self-delusion gap”. This is the performance you think you will get; very different from reality. The consequence of undetected data leakage should not be underplayed as it could lead to an incapable model being deployed where it may influence business decisions and plant operation.

Figure 12: Effect of data leakage on model performance predicting reaction yield

The above examples used synthetic data, but there is no shortage of real examples in the literature: there is a “leakage and the reproducibility crisis”. In a published example where there is open correspondence, the authors use the “stacking” in series of two ML regression models, each with an R2 in the order of 20%, to generate an ensemble model with an R2 of 90%⁶ graphically represented in Figure 13. Given that the two constituent models individually give essentially no fit, this is a surprising result.

It has been suggested that this resulted from “target leakage”⁷ which the original authors have acknowledged, even identifying the offending line of code.⁸ This underscores how easily such errors can be introduced and the importance of careful diligence.

To avoid data leakage or catch it early, it is important that during validation and testing practitioners are always wary of it and try to precisely replicate the scenario of predicting future data when the model is deployed. Remember the adage that “if it sounds too good to be true, then it probably is”.

Figure 13: The stacking of two low-fit regression models yields a surprising result

Data augmentation

Data augmentation is sometimes useful for tabular datasets,⁹ but we generally caution against uncritical use. Many augmentation methods implicitly assume linear relationships between features and target variables – an assumption that often does not hold, or else augmentation would not be necessary. While augmentation increases the number of data points, it does not add new information, as the synthetic data are dependent on the original dataset. The main risk, as discussed above, is test contamination leading to data leakage between training and testing sets, which can produce overly optimistic performance estimates. Our recommendation is not to avoid augmentation entirely but to apply with caution.

Summary

ML models require rigorous scrutiny. Many flaws are only revealed through multi-angle scrutiny. The good news is that a plethora of techniques exist for evaluating models and their behaviour under-the-hood and best practices are well established. Good ML work that you can trust should include the details of the model scrutiny undertaken. Absence of such detail should prompt questions regarding metrics, test data separation and metric relevance.

Next issue

In the next article we will look at how ML models can be made more transparent to users, with methods for interrogating pure data-driven models. We also highlight ML crimes, with a reference list to help you spot “red flags” that are likely indicative of poor data processing, maths and modelling

References

1. Evaluating Models: Hating on R-squared: bit.ly/eigenvector-evaluating-models
2. J Emerson, V Vivacqua, EH Stitt (2022) Johnson Matthey Technology Review, 66, 154–163
3. M Trower, JT Emerson et al. (2023) Powder Technology, 427, 118745
4. T Akiba et al (2019). Optuna: A Next-generation Hyperparameter Optimization Framework; 25th AGM SIGKDD, 2623–2631
5. S Kapoor, A Narayanan (2023), Leakage and the reproducibility crisis in machine-learning-based science Patterns; 4, 100804
6. H Xiao et al. (2024) Chem. Eng. Sci.; 265, 119579
7. C Jackson, EH Stitt (2025) Chem. Eng. Sci.; 302, 120831
8. H Xiao et al. (2025) Chem. Eng. Sci.; under review
9. A Ishikawa (2022): Heterogeneous catalyst design by generative adversarial network and first-principles based microkinetics: https://go.nature.com/3YrpcD4