Hugh Stitt, Joe Emerson, Carl Jackson and Robert Gallen focus on understanding input data and why quality, pre-processing and feature engineering drive 80% of ML modelling effort
IN A MODEL derived solely from input data, the quality of those data is paramount. If the input data are not adequate or sufficient, then any resulting model will be flawed. The key question, then, is what makes a dataset fit for purpose? Regardless of how the data are gathered – from experiments, simulations or plant historians – the same fundamental considerations apply.
A simple and common way of assessing a data set is the “Vs of data” framework, which highlights the characteristics that most strongly influence model robustness and interpretability. While sources list anywhere between three and 17 Vs,1 the core ideas are consistent. In practical terms, engineers must look beyond sheer data quantity.
More data do not automatically make a better model. What matters is whether the dataset has sufficient coverage, accuracy and diversity to describe the domain of interest – and whether it captures enough variation to avoid bias. In process engineering, this often means recognising that most projects work not with “big data” but with what might be called meso-data: datasets that are modest in size but rich in meaning.
Even so, many engineering datasets fall short on one or more of the Vs. In practice, data are rarely perfect – experiments are expensive, plant operating ranges are narrow and sensors fail or drift. What we are often dealing with, therefore, is information-poor data: datasets that are small, skewed or noisy, yet still the only evidence available to build a useful model – we have done ML modelling with as few as 15–30 data points.
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. John Tukey, US statistician
When data are limited, the emphasis shifts from collecting more to extracting the most from what exists. To do this, it must be ensured that the non-volume Vs are considered. Volume and variance are probably the most important and widely explored of the Vs. Common descriptors for data characteristics presented in Table 1, with data types shown in italics, provide a useful way of thinking about and classifying data – and each may have different pre-processing requirements. Four challenging characteristics of data can be identified in Figure 1.2
The upper right quadrant of Figure 1 is evidently the ideal space. Experimental data, given experiments are expensive and time consuming, will tend to the lower right quadrant. The use of Design of Experiments (DoE) to uniformly fill the region of interest, providing variance and excluding bias, is essential. Adaptive DoE can be enormously beneficial in directing additional experiments into the optimal part of the experimental domain.
In 1973, statistician Francis Anscombe demonstrated that four datasets with identical means and standard deviations could have entirely different distributions – now famous as Anscombe’s Quartet.3 See Datasaurus Dozen (bit.ly/datasaurus-dozen) for a recent, humorous take on this.
“Series” data, such as time domain, spectra (eg infra-red, acoustic) and particle size distributions (PSD) are inherently sampled. Simply, time series data may be sampled at, say, 1 Hz, or 1/minute etc. What though is the optimal sampling strategy? Under-sampling results in the features being masked. Over-sampling unnecessarily increases model complexity and may perversely disguise important features.
By analogy, in process control for a system that has a response time of five minutes there is no benefit, for example, in measuring and adjusting every second. Indeed, this complicates the control problem. Conversely, under-sampling, such as every 30 minutes, would evidently not enable adequate control.
As an example, consider a PSD. This may be rendered into, say, 100 size bins by the particle size measurement equipment software, with smoothing commonly applied during the mathematical reconstruction. Using the traditional d10, d50 and d90 might be under-sampling (especially for a multi-modal distribution). Should the full 100 bins be used though? The volume fraction values for a given bin of course have an error margin. Figure 3 shows this scenario. The error bars extend beyond the mean of the adjacent bin. The values for adjacent bins are not actually independent. This data set could thus be “down-sampled” to fewer bins with no loss of information. Maintaining 100 bins adds unnecessary complexity and will not contribute to improving model quality.
A study based on time domain, spectra, PSDs and the like should therefore explore and rationalise the sampling frequency. It may require down-sampling, up-sampling by interpolation or re-collecting your raw data at the more granular level. Autocorrelation plots can be useful here: if the datapoints at time t are highly correlated with the value at t-1, but that correlation quickly drops off after a few lags, then there is probably oversampling.
Figure 4 shows the typical activities that may occur during data processing, split into collection, pre-processing and feature engineering activities.
One of the first steps in working with data is cleaning, and a common pitfall is failing to identify or handle duplicates and missing values. Duplicates occur for a few reasons, for example pesky clock changes in the UK. Missing values probably have a broader range of more complex causes (see Table 3).
Good practice is considering the frequency and pattern (systematic or random) of missing data on a feature-by-feature basis to determine appropriate handling. Using a blanket approach, rather, is poor practice. There are a number of ways to deal with missing values, from simply dropping those that are missing, inferring the values from other correlated variables or even building a separate data-driven model to predict them. Understanding how and why values are missing is, however, key to making a more informed decision as to how you handle them.
While feature scaling should be implemented in the model training pipeline (this will be discussed in part 3 of this series), it can be explored and understood in the pre-processing phase. Scaling is important for most ML models to avoid bias towards large features. To illustrate, features like pressure (in thousands) can overshadow others like valve position (0.0–1.0) and thus will dominate any gradient-based ML model. A prevalent scaling method is standardisation (aka z-score normalising), whereby for each feature (x) subtract its mean (μ) and divide by its standard deviation (σ):
Most ML models have an underlying assumption that the features are normally distributed with little skew. An often-forgotten fact is that the described standardisation in Equation 1 does not change the distribution. Transformations, such as taking the log, using power transforms, or the more complex Box-Cox transforms, should therefore also be explored. Using some reaction time data as an example, we can see in Figure 5 that the log transformation has removed the positive skew while standardising using Equation 1 has had no impact. Our message is that effort is required in the pre-processing to assess and address distributions – the model performance will benefit.
Outliers impair model performance. Detecting them in data cleaning has the added benefit of enhancing understanding of the data; the outliers themselves may indicate errors or rare events of significance. The effect of an outlier is clear to most on a simple linear regression line, but the impact is even greater when training gradient-based models like Neural Networks. Despite its importance, outlier detection is often skipped or treated superficially.
A common pitfall is using a single univariate outlier detection approach, for instance removing outliers beyond three standard deviations from the mean. While there are limitations with that specific method (eg assuming normality), there are broader impacts from considering only univariate approaches.
The widely used Tennessee Eastman Process (TEP)4 benchmark data set illustrates this. A number process and sensor faults are introduced to the simulated plant data, some of which are notoriously hard to detect. As an example, Fault #8 is a variation in the ratios of the feed components. Univariate approaches, considering only each stream in isolation, are either slow to detect this fault (outlier data in the present context) or simply fail.
Principal Component Analysis (PCA), the bedrock of fault detection in process engineering and beyond,5 successfully identifies the above fault (see Figure 6). This period of anomaly operation may be useful for modelling, or it may need discarding. Critically though, outliers need to be detected first and multivariate methods are generally required.
Feature selection and engineering are, in some ways, two sides of the same coin. The first involves removing non-informative features, as having too many of these can artificially reduce variance within the dataset. The second focuses on creating more informative features – embedding useful information directly into the data that would otherwise need to be learned – often improving model performance and interpretability in the process.
There are myriad ways to do feature selection. Leaning on domain expertise is certainly not a bad start. A common oversight is failing to identify and remove features that are highly correlated with one another, a problem known as collinearity. For example, molar flow rate of A is collinear with feed molar flow rate and feed mole fraction of A, hence their redundancy: over-specified in old terminology. Collinear features offer little new information, complicate model interpretation and feature selection and can dilute coefficients. Before modelling, collinear features should be identified and removed. Dimensionality reduction techniques, like PCA, can be effective.
Often, we see a reliance on simply removing features which are not correlated with what you are modelling as the sole feature selection method. The problem with reliance on univariate correlation-based feature selection methods is that they are….univariate. They ignore features that may be extremely informative when combined with others.
Another limitation is demonstrated in Figure 7. Correlation alone misses non-linear relationships, which would have resulted in the quadratic feature being dropped, whereas something like mutual information does not.
Overall, it is better ML practice to diversify and explore a range of feature selection techniques, such as recursive feature elimination, feature selection techniques inherent in your model (eg regularisation), or by the feature weights/importances in your model if you have them.
The main pitfall in feature engineering, arguably even more so than in feature selection, is neglecting domain expertise. Chemical engineers have been doing feature engineering simultaneously with dimensional reduction for over a century: dimensionless numbers. Using dimensional analysis and physical laws to pre-process the data allows us to simultaneously encode the governing physics (feature engineering) while using fewer variables (feature reduction).
By way of example, in a fermentation project, we knew oxygen transfer was important. By asking the question, “What phenomena are relevant to what I am trying to model?”, we introduced the overall gas liquid mass transfer coefficient (kLa) using the agitation speed, air flow rate and impeller dimensions through the Van’t Riet empirical correlation.6 A simple linear ML model was now able to predict final biomass yield across different scales. Without this feature engineering step, significantly more data and more complex non-linear ML algorithms would have been needed to uncover the relationships embedded in the kLa correlation (a caveat: using individual dimensionless numbers such as Reynolds, Weber or specific power as inputs is ineffective, as they are interdependent and all vary with speed and diameter). This highlights the advantage of having domain experts lead the data science effort – or, at the very least, remain closely involved in it.
In the next article we divert our attention to the more glamorous model development and testing and look at possibly the largest pitfall of all: data leakage.
1. The 17 V’s Of Big Data: bit.ly/17-vs-of-big-data
2. Maximizing information from chemical engineering data sets: Applications to machine learning: https://doi.org/10.1016/j.ces.2022.117469
3. Graphs in Statistical Analysis: FJ Anscombe (1973), The American Statistician; 27 (1), 17-21
4. A Plant-wide Industrial Process Control Problem: JJ Downs, EF Vogel (1993): https://www.abo.fi/~khaggblo/RS/Downs.pdf
5. Data-driven Methods for Batch Data Analysis – A Critical Overview and Mapping on the Complexity Scale: bit.ly/data-driven-methods
6. Review of Measuring Methods and Results in Nonviscous Gas-Liquid Mass Transfer in Stirred Vessels: bit.ly/review-of-measuring-methods
Catch up on the latest news, views and jobs from The Chemical Engineer. Below are the four latest issues. View a wider selection of the archive from within the Magazine section of this site.