Data-driven predictor variable selection for industrial soft-sensors

Supervisor: Dr. Tim Offermans

Industrial processes need to be carefully controlled, which often starts with monitoring a key performance indicator (KPI) of the process, such as product quality or purity. Measuring such KPIs at high frequency can however be costly. In such cases, it is common to use a multivariate regression model, that predicts the KPI from physical process variables that are measured already anyway, such as temperatures, pressures, flow rates, etc. Such a model is called a soft-sensor and can be a key enabler in providing accurate and timely control to a production process, but only if it is well-calibrated.

Key to calibrating an accurate soft-sensor is to select only those process variables as predictors for the regression model that are actually predictive of the KPI. Quite some methods to perform such variable selection are explored in the literature, but their exploration generally lack in two regards.

  1. They compare different methods for approximating the best variable selection, but they do not compare these methods to the actual best variable selection (golden truth). Finding this golden truth is computationally intensive, which is likely why it is never implemented at an actual production plant, hence the approximation methods. However, for the sake of benchmarking less intensive methods, it should be included.
  2. They compare different methods for variable selection in terms of validated performance of the eventual model (prediction quality), but not in terms of calculation time. There is a growing interest in the field in soft-sensors that automatically recalibrate regularly. As such, the computation time for a variable selection method has become an important aspect to take into account when choosing a variable selection method.

In this internship, you will perform literature research on variable selection methods, and will use the most common ones on a few example datasets. You will also determine the golden variable selection for each dataset using a full factorial experiment. All these analyses will be automated in MATLAB, and you will critically compare the variable selection methods in terms of prediction quality and computation speed. Your findings will be reported in spoken and written word during your internship.

This internship will help to increase your:

  • Experience with both fundamental and advanced chemometrics models
  • Ability to practically handle large datasets
  • Programming experience (in Matlab)
  • Ability to critically review scientific literature
  • Skill in communicating research in both spoken and written word

The length and specific objectives of the internship can be adapted to both bachelor’s and master’s internships of various lengths, in consultation with the supervisor at the start of the internship.