Training machine learning models

Data scientists will each develop their own approach to training machine learning models. Training generally starts with preparing the data, identifying the use case, selecting training algorithms and analyzing the results. Following is a set of best practices developed by Shehab for PwC:

  • Start simple. Model training should begin with the simplest approach. Complexity can then be added in the form of model features, feature sophistication and advanced learning algorithms. The simpler model serves as a basis for determining if the performance obtained through the added complexity will be worth the additional investment in time and technical costs.
  • Create a consistent model development process. Given its highly iterative nature, a consistent development process should be supported with tools that provide comprehensive experiment tracking so data scientists can more readily pinpoint where their models can be improved.
  • Identify the right problem to solve. Look for improperly defined objectives, wrong areas of focus and unrealistic expectations, all of which are often responsible for a model's poor performance or failure to produce tangible value. Building a model requires solid grounding to properly assess its development.
  • Understand the historical data. The model is only as good as the data it will be trained on, so start with a firm understanding of how that data behaves, the overall quality and completeness of the data, important trends or elements of the data set related to the task at hand and any biases that may be present.
  • Ensure accuracy. To avoid introducing bias, providing the model with inappropriate feedback or reinforcing the wrong behavior, carefully set measurable benchmarks for model performance. A machine learning algorithm learns through feedback from an objective or outcome set in the training data. If the calculations that generate the feedback aren't carefully defined and aligned to the expected values, the result could be a poor or non-functioning model.
  • Focus on explainability. Data scientists who focus on why a model performs the way it does will produce better models. This approach requires more comprehensive model validation and testing. Explainability also provides insights into a model's underperformance, hypotheses of how to enhance performance and a global view of how a model functions to help develop trust among consumers.
  • Continue training. Model training is an ongoing process over the life of the model, including the production stage, so it can be continuously improved.