Rate this post

You can record a scratch.

The frame is frozen.

My boss ‘s career is probably over because the business lost more than $ 500,000 with our fraud detection system because it wrongly flagged fraudulent transactions as legitimate. The guy in the chair is an engineer. You might be wondering how we got here.

My story began with an image that has been seen over a thousand times.

Model monitoring ML LifecycleModel monitoring ML LifecycleSource: Author

After months of perfecting our model, we finally deployed to production a few months ago. I told myself and my colleague that our hard work had paid off. It was so exciting.

Our model was returning results in batches and serving them in real-time.

We got to realize that in a relatively dramatic fashion.

I am not going to tell you why the typical way of deployment works with machine learning applications. I can help you not end up in a hospital bed after a successful model deployment if I can help you recover from the injuries that my boss left on me.

I will tell you all about it.

  • deployment is n’t the final step for you.
  • You need to own and monitor models in production.
  • How to monitor production and what to look out for.
  • Different platforms for monitoring and observability.
  • Logging and alerting can be done.
  • There are challenges and best practices for monitoring your models.
  • And a lot more!

By the end of this article, you should know how to monitor your models in production, how to spot problems, and how to approach the “ life ” of your model beyond monitoring.

It does n’t stop at deployment.

After deployment, take a look at your traditional software application.

You almost do n’t have to worry. It should work as expected because you have tested it and deployed it. As you upgrade to meet new system requirements or new business needs, your team may decide on a steady and periodic release of new versions.

You might be concerned about system metrics, error rates, traffic volume, app loading times, infrastructure, and things like that. Your application scales to a lot of users and works as it was intended to solve the problem it was built to solve.

That is conventional software. Now, enter the machine learning world.

It was likely a hassle to deploy your model. The software might have a team that monitors and maintains the system in production but with machine learning applications, you ca n’t just hand your deployed models off to the ops team.

Modified by the author and adapted from the source.

Machine learning models degrade over time. They are sensitive to real changes in the real world.

The models are similar to cars. When a person leaves a dealership, the value of a new car goes down 10 %. Same with your new car.

When you deploy your model to production, it degrades in terms of performance. A model is ready to be deployed to production. deployment should not be your final step.

There needs to be a way for you to consistently report back to business stakeholders on whether or not your machine learning solution is solving any problems.

Your model ‘s performance in production wo n’t be fully justified by validation results. To make sure they keep performing as well as they are supposed to, you have to monitor your models after deployment.

There could be situations where you deploy models in multiple geographic regions. How do you make sure that your models serve the intended results for each client ?

There is a lot of work to be done after deployment, and you should n’t hand off your models too easily. It is not the end of working on your model after deployment.

You need to keep an eye on your machine learning models.

I convinced you that models perform worse in production than they do in development. As a data scientist or machine learning engineer, you need to be aware of this challenge. Monitoring is where you take the next step after deployment, and you will find the solutions to this challenge in production.

Why do you need to keep an eye on your models ?

Let ‘s look at some of the challenges your model will face in production.

Production challenge

Key questions

There is a production challenge.

There are changes to the data distribution.

Key questions

There are sudden changes in the values of my features.

There is a production challenge.

There is model ownership in production.

Key questions

Who is in charge of the model in production ?

There is a production challenge.

Training-serving skews.

Key questions

The model is giving poor results despite being tested.

There is a production challenge.

The model/concept drift is a model.

Key questions

Why did my model ‘s performance decline over time ?

There is a production challenge.

There are black box models.

Key questions

How can I explain my model ‘s predictions to relevant stakeholders ?

There is a production challenge.

Concerted adversaries.

Key questions

How can I make sure my model is protected ?

There is a production challenge.

The model is ready.

Key questions

Is it possible to compare results from a newer version of my model against the in-production version ?

There is a production challenge.

There are health issues.

Key questions

Why does a retraining job take so long to complete ?

There is a production challenge.

The system is underperforming.

Key questions

Why am I getting different latencies for different models ?

There is a production challenge.

There are cases of extreme events.

Key questions

I do n’t know how I will be able to track the effect and performance of my model.

There is a production challenge.

There are issues with data quality.

Key questions

How can I make sure the production data is the same as the training data ?

This should give you an idea of the challenges that you may encounter after deployment, and why you need to continue your good work by monitoring your model in production.

Monitoring your models in production is the goal.

  • Problems with your model and the system that serves it should be detected before they start to generate negative business value.
  • To take action by triaging models in production or the inputs and systems that enable them.
  • They want their predictions and results to be reported and explained.
  • The model ‘s prediction process needs to be transparent for proper governance.
  • To give a path for maintaining and improving the model.

Let us get more specific.

Functional monitoring and operational monitoring can be used to monitor production challenges.

Monitoring the progress and quality of your machine learning application over the course of its lifetime is important.

Everything that enables your model in production, including infrastructure, input data, resources, and other upstream and/or downstream services, is a machine learning application.

You should think about two things if you are wondering about what to watch.

  1. What could happen ?
  2. What could go wrong ?

What could happen ?

The planning phase of your machine learning project is where you want to plan around the user and the business.

Everything that could go right guarantees that the business objective and user needs are met.

Depending on the business use case, how do you monitor and measure this ?

  • How do you define success in your business ?
  • What were the expectations before deployment ?
  • How quickly should my model be delivered if it returns a prediction to a client ?

You need metric selection criteria to answer the questions. Success for a business could mean an entire system in which your model only plays a small role.

To get specific with your metric selection criteria, you can follow some of the best practices.

  • Pick a metric that is similar across models.
  • It is simple and easy to understand.
  • It can be collected in real-time.
  • It allows for alerting on problems.

To think about what success means for a business, you have to think about a good user experience. Think about how your model contributes to a good user experience in the context of the entire system.

A good business objective would be if we were building a loan approval system.

Quickly approving a loan for a customer that will repay it at a certain time.

It would be difficult to monitor how well our model contributes to a goal like this. There are a lot of variables that affect whether a customer pays back a loan or not. We do n’t have the ability to monitor that.

A good metric to monitor could be :

How quickly does the model return a prediction ?

There are constraints like these.

  • Predicting a score between 0 and 1 is what our model should return. 6 and 0. Loans with high interest rates should be used to mitigate risk. 71 and 0. 90 loans should come with a mid-level interest rate. The loans should come with low-interest rates. We ca n’t get real-time feedback on whether this customer will actually pay, so we ‘re going to use the model ‘s score as the real-time metric to monitor.
  • A successful response to the client.
  • A response time of 100 ms would be considered a good one.
  • There is a service level agreement for a microservice architecture.

I picked a metric in this way.

  • Can be compared across models.
  • It is simple and easy to understand.
  • It can be collected in real-time.
  • There are issues that may arise in production.

Everything that could go wrong should be considered a bad user experience. The model could return a likelihood score of zero. 55 for a loan request should either alert someone that can take a look at the loan request or alert a different system to load up another model for scoring.

Monitoring the things that could go wrong is fun.

What could go wrong ?

What could go wrong with your machine learning model can be monitored.

  • Monitoring model performance, inputs, and outputs is called functional level monitoring.
  • Monitoring at system and resource level.

Functional monitoring is done.

You are primarily responsible for monitoring at this level as a data scientist. You monitor the performance of your model in relation to inputs, as well as the prediction results and what goes on in the model while learning in production.

Functional MonitoringFunctional MonitoringSource: author

Functional monitoring is important in production because your model reacts to inputs. It will most likely influence performance if the inputs are n’t what your model expects. Monitoring and measuring input level challenges is the first step in solving functional performance issues.

There are three cases you may want to keep an eye on.

There are issues with data quality.

Data quality issues are usually caused by changes in the data pipeline. We have to monitor certain metrics based on data properties to verify production data integrity. If the input data is n’t what the model expects, an alert can be triggered for the data team or service owner.

The primary goal of monitoring here is to flag any data quality issues before the data is sent to your model, which would cause unreliable predictions in response.

There are some causes of data quality issues.

Preprocessing production data.

In some cases, you will be able to ingest data from multiple sources. It is easy for a change in one source to cause a break in the data preprocessing step.

There are changes to the source data.

When valid changes have been made to the data at the source, preprocessing works just fine, but it is not the sort of input configuration the model was trained on.

A database admin can change a feature column to catch new data or add another column to catch old data. It would be difficult for the model to predict, so it will likely give a partial response. The model needs to be updated before it can map the relationship between the feature column and the old ones.

Data loss and corruption can be found at the source.

Changes in a data source upstream, or data not being recorded, may cause a data pipeline to not be able to ingest data from one source. There is a chance that the upstream data source has corrupted or missing features. It is important to monitor these issues as they will affect the overall performance of your system.

Data quality issues are detected.

Data quality issues can be detected by writing tests. Data quality checks can include :

  • Testing for duplicate data.
  • Testing data for missing values.
  • Catching errors.
  • Data type and format errors are caught.
  • The feature names are checked for semantic errors.
  • Data profiling for complex dependencies is effective.
  • Does the data meet the requirements of downstream services or consumers ?

There are possible solutions after detecting data quality issues.

  • Provide an alert after a change.
  • Data owners should implement proper data validation practices.
  • Ensure everyone is aware of their role in getting the data to the pipeline and enable effective communication between data owners so that when a change is made at the data source, the model owner ( s ) and other service owners are aware.

Data/feature drift.

The most important aspect of functional monitoring is monitoring your input. It could tell you about the changing landscape and context of the business case you are working on. Unless they are constantly retrained and updated, models are n’t smart enough to adjust to a changing world.

Data drift is a change in distribution between training and production data. The process of changing input data distribution is slower than the process of data quality issues.

It is advisable to monitor this drift at the feature level, since it is possible to monitor it at the level of the entire dataset.

There is a feature/attribute drift.

Monitoring at the feature level is the best way to detect issues. It is helpful when you are looking for explanations for model performance and behavior.

You can detect changes in the statistical properties of each feature value over time. Standard deviation, average, Frequency, and so on are some of the properties.

Changes in the preferences of business customers can cause feature drift. An example of feature drift is shown below, where a historical set of attributes are used as a baseline and newer attributes are compared so that changes in the distribution of the attributes can be detected.

Modified by the author and adapted from this source.

The changes that degrade model performance the most are the changes that the model uses to connect the dots and make predictions. Monitoring input drift closely can give you a heads-up on model drift. When the distribution of the data changes, features drift, or other issues happen with the input, you can be alerted before they start degrading the performance of your model.

Data drift detection techniques are used.

Distribution tests are used to detect data drift.

  • The mean/average value, standard deviation, minimum and maximum values can be used to test drift between historical and current features.
  • Dilution and distance tests can be used for continuous features.
  • The chi-squared test is used for categorical features.
  • Out-of-the-box monitoring solutions for outlier detection are now being provided by some platforms.
  • If the features are large, as is the case with a lot of datasets, you may want to perform a statistical test on them.

Aparna provides a conceptual and detailed look into these statistical checks in an article.

Most monitoring platforms give some of the metrics out-of-the-box so you do n’t have to script them.

There are possible solutions after data drift detection.

  • Triggering an alert and sending a notification to the service owner is the most likely solution. If the distribution change is really large, you might want to build another model with your new data.
  • New data wo n’t be large enough for retraining. You could combine and prepare your new data with training data and assign higher weights to features that were drifting from each other.
  • You may be lucky to have enough production data for the task. You can build a challenger model, deploy it, and use shadow testing or A/B testing to determine if it is better than the current model.


Monitoring outliers in your data is very difficult. You are monitoring events that are one-off or a group of one-off events. This can affect model performance, as outliers do n’t have enough learnable structure across the entire dataset, which will cause the model to return an unreliable response.

Outlier detection.

You do n’t have to set specific monitors to detect outliers.

  • To determine if values and distributions of features are different from normal benchmark periods, use the tests we discussed in the previous section.
  • A small number of recent events can be used to perform statistical distance tests.
  • The most important features your model learned after training have changed a lot.
  • To determine how far off the features are from the features in the training set, use any of the suitable distribution tests.
  • Supervised learning methods can be used to categorize model inputs and predictions. It is possible to detect outliers that your test ca n’t catch.

There are possible solutions after outlier detection.

  • Data slicing methods can be used to check model performance. You can use your monitoring tool to automate this process.
  • If your model is performing poorly because of your metrics, you might want to consider training a new challenger model.
  • If this is a seasonal outlier or one-off outlier, document the issue and track it so you can figure out how to fix it in the future.
  • If the performance of the model ca n’t be improved after retraining or the new model ca n’t cut it, you might want to consider the model ‘s performance benchmark and perhaps have a human in the loop, assisting the decision process for that period.


The model is the most important piece to monitor. Your model can be monitored in production, components, version, and security threats.

Monitoring model drift.

Model drift occurs when the relationship between features and labels no longer holds because the learned relationship/patterns have changed over time. It is when a model returns unreliable and less accurate results over time.

A sentiment classification model is being deployed. People ‘s sentiment about a topic changes over time. If you trained your model on positive or negative sentiment with words and certain topics, some of it may evolve to be negative over time.

Concept DrfitConcept DrfitModified and adapted from source

The answers to business questions are always evolving and model drift happens because the real-world changes. We are expected to reflect this fact in our machine learning applications.

Model drift can be gradual, like when the business climate changes naturally, and it can also be sudden.

There are different ways in which model drift can happen.

  • There is a sudden drop in model performance. It could be a bug that causes data quality issues, a model being deployed in a new domain, or a global crisis.
  • The natural consequences of a changing business landscape are what cause Gradual model drift. It could happen as a result of user preferences changing over time, new demographic of customers adopting your product, or newly introduced features that skew the underlying pattern in the data.
  • A recurring model drift is the result of seasonal events that are periodic and recurring over a year. There could be holidays and yearly discounts. User preferences can be seasonal or one model serves different regions.
  • It is difficult to detect temporary model drift using rule-based methods. Users using the product in a way that was not intended, a model temporarily serving newer clients, or system performance issues are some of the reasons it happens.

There is a model of drift detection.

  • The same statistical tests can be used to detect model/concept drift.
  • Predictable performance is reduced over time. If your model consistently returns unreliable results, you can confirm it by setting a predictive metrics threshold.
  • Monitoring data drift can tell you if you should analyze your model for degradations or drifts.
  • When you can compare ground truth/actual labels to your model ‘s prediction to analyze trends and new interpretations of data, you can monitor label drift.

There are possible solutions after detecting model drift.

  • According to your business reality, keep monitoring and retraining deployed models. If your business objectives and environment change frequently, you may want to automate your system to schedule and execute retraining at preset intervals compared to more stable businesses.
  • If retraining does n’t improve performance, you may want to rethink your models.
  • If you are working on larger scale projects with a good budget and little trade-off between cost and performance, you may want to consider online learning algorithms for your project.

There are artifacts and model configuration.

The model configuration file and artifacts contain all the components that were used to build that model.

  • The location and version of the training dataset.
  • The location and version of the dataset should be tested.
  • Hyperparameters are used.
  • There are default feature values.
  • When model failure is caused by dependency changes, you want to be able to easily find them.
  • Environment variables.
  • The model type is classification vs regression.
  • A model author.
  • Target variable name.
  • There are features to choose from the data.
  • Data and code for testing scenarios.
  • There is a code for the model and its preprocessing.

Track the configurations for relevance, especially the hyperparameter values used by the model.

There are model versions.

If you want to be sure that the right version is deployed, you need to monitor model versions in production.

It is possible to monitor model versions by configuring your retraining program to automatically report versionhe way to make problems easier to tie to model versions is to log version history to an evaluation store.

You can learn more.

It is possible to have your model development under control.

Concerted adversaries.

Every business is vulnerable to security threats. With machine learning applications becoming the central decision system of most companies, you have to be concerned about the security of your model in production. Machine learning models are vulnerable to attacks.

For example, fraudsters may attempt to fool a model that is tasked with detecting suspicious credit card transactions.

Other applications are vulnerable to attacks.

  • Artificial intelligence is being used to detect fake news and other inappropriate content.
  • The applications deal with audio or image recognition.

The systems or hackers that engage your system through attacks are the concerted adversaries. They cause the system to make mistakes by misleading it with unreliable examples. There is a specific problem in the safety of your machine learning application in production that needs to be monitored.

You can keep an eye on your system for attacks.

  • The same steps you use to flag inputs with outlier events are used to flag atypical events.

There are possible solutions after detecting adversaries.

Most of the research on methods that could defend models from threats is still at an early stage. Before predictions are used to stop adversaries, they need to be detected by a human supervisor. Subject matter experts can use these cases to defend the model.

Speed is important in business-critical applications. How fast experts can detect a threat, study it, patch it, and redeploy the model may be the difference between success and failure.

It is worth checking out the Adversarial Robustness Toolbox.

The predicted output.

Monitoring model output in production is more than just the best indicator of model performance. Model performance in line with business metrics is the most important thing to monitor.

Evaluation metrics for model evaluation.

Monitoring your model in production involves using metrics to evaluate performance. There are different metrics that can be used here.

When you have a ground truth/label to compare your model with, we typically evaluate the model using model scoring metrics.

It might be useful.

Track machine learning model metrics in your projects.

There are ground truth/actual labels.

The correct solution to the problem your model tries to solve in the real world is your ground truth/actual label.

An example would be :

Will this user click on the ad ?

The actual label is yes if a user clicks on the ad. An overview of how well your model is doing in production can be provided by an aggregation of the comparison between your model ‘s predictions and what the correct solution is. Real-time feedback is available in this case because your system can tell almost immediately if a user clicked on an ad. A comparison could look like this.

It is difficult to compare actuals to predictions in production. If you do n’t have dedicated and expert annotators to log the actual labels for you, they ‘re impossible to get in real-time. Loan repayment options might take months, even years, so a loan approval system would n’t give you “ correct solutions ” in real-time. This process involves a sophisticated feedback loop for us to be able to measure the performance of our model.

Sometimes, your ground truth can be influenced by your model ‘s predictions.

If you build a model to predict which customers will repay a loan, it will perform well if you approve loans for customers that will pay back the loan. The ground truth is not the most appropriate source of truth when it comes to the performance of such a model.

When ground truth is available, scoring models are used.

If you want to gauge the functional performance of your model in production, you need to compare it to what it is in the real world.

A concrete idea about performance in a production environment is given by model predictions and ground truth. An example of how you can set up a monitoring system that collects ground truth data from various sources is below.

Modified by the author and adapted from source

The ground truth service typically involves real-time ground truth generated by your system, for example, logging if a user clicked on an ad when the model predicted they would, a human label annotator, or other.

The event id that tracks prediction and model details is tagged with a ground truth event. The monitoring platform computes the model performance metric based on the model prediction and actual label.

You probably already know the metrics for a classification model.

  • It is accurate.
  • There is a confusion matrix.
  • The ROC-AUC Score is a score.
  • There are precision and recall scores.
  • The score is F1-Score.

There are metrics for a regression model.

  • RMSE is a root mean square error.
  • R-Squared and adjusted R-Square metrics.
  • There is a mean absolute error.
  • TheMAPE is the mean absolute percentage error.

When the ground truth is available, it ‘s possible to calculate the model metrics above.

When ground truth is not available, scoring models are used.

When the ground truth is n’t available or is compromised, we use the prediction results distribution as a performance proxy.

Hopefully, your monitoring platform is set up in a way that model predictions can also be seen. Every piece of input data for every model version is held in a model evaluation store. You will be able to monitor model predictions over time and compare the distribution using metrics such as Hellinger Distance, Kullback-Leibler Divergence, and Population Stability Index.

There are more than one case for monitoring models in production. The model performance in production can be monitored.

The level of monitoring is operational.

The IT operations people are responsible for monitoring at the operations and system level. It has to be a shared responsibility between you and the team. You can either get a copy of the alert or forward it to the team to act on it.

At this level, you monitor the resources your model runs on and make sure they are healthy. System performance metrics, such as I/O, disk utilization, memory, and traffic, are things that ops people typically care about.

Operational MonitoringOperational MonitoringCc* means “carbon copy” in the context of who gets an alert. Source: author

Performance monitoring for models in production.

Machine learning in production is an infrastructure problem. Your infrastructure is the foundation of a successfully deployed and maintained machine learning solution if you do n’t monitor model performance and data.

Monitoring your application performance can help you answer questions.

  • Does the application meet the requirements ?
  • Is it enough to serve requests quickly ?
  • Is it feasible to use resources and save costs ?
  • Can it handle changes in code dependency ?
  • Does it meet the requirements ?
  • What are its limitations ?

System performance metrics.

We expect our model and data to face challenges during utilization, so what we monitor is influenced by that. You need to be aware of certain metrics that can give you an indication of how your model performs in line with the rest of the application stack. The speed of the system will be affected by the model ‘s return predictions.

The system/application performance metrics will give you an idea of model performance.

  • When the model is computing predictions on incoming data from each call, it tells you how much your model is consuming per request.
  • For faster I/O performance when the model cache data or input data is cached in memory, memory utilization is important.
  • There were failed requests for an event.
  • The total number of calls.
  • The model server or prediction service has a response time.

There is an example Neptune dashboard that tracks some notable performance metrics.

Neptune model monitoring CPUNeptune model monitoring CPUNeptune dashboard showing application performance metrics | Source

System reliability.

We are monitoring the infrastructure and network. Which machines are running, and how many clusters are running. You monitor the number of clusters up and running at any given time, and which prediction services get the most requests.

This is n’t our primary concern, but do n’t you want to be aware of the reliability of the entire system that powers your model ?

There are pipes.

You can monitor the health of your data. Unhealthy data pipelines can affect data quality.

Data pipelines

Data quality issues can arise from bad or unhealthy data pipelines, so it ‘s important to monitor their health. This is very difficult to monitor for your IT ops team and may require empowering your data engineering team to monitor and fix issues.

It has to be more than one responsibility. Communication with your DataOps team can help you tighten up your system and drive positive results.

If you are charged with the responsibility of monitoring your data, here are some metrics and factors you may want to track.

  • Do you have data validation tests and checks in place to make sure the data is correct ?
  • In terms of the number of files and file types, how long does a task take to run ?
  • What is the typical file size for an output file ?
  • The metrics that are tracked are based on the data that flows in. The data could have basic statistical properties such as mean, standard deviation, correlation, and so on. The metric will be dependent on the dimensions of the data expected, a couple of features or several features.
  • How long it took to run, and the state of the job are all variables.

There is a model pipeline.

Key factors that can cause your model to break in production are tracked. This also includes :

  • You do n’t want a situation where your model was built with Tensorflow 2. A dependency update by someone else on your team is bundled with Tensorflow 2. Part of your retraining script can fail because of 4. dependency updates that cause failure can be easier to fix if you log the versions of each dependency.
  • How long it took the retraining job to run, resources usage of the job, and the state of the job are all related to the actual time a retraining job was triggered.

You need to keep an eye out for how much it costs to host your entire machine learning application, including data storage and compute costs, retraining, or other types of orchestrated jobs. If they are not being tracked, these costs can add up fast. It takes a lot of computational power for your models to make predictions.

Also read it.

The Machine Learning Guide has a list of the best Workflow and Pipeline Orchestration Tools.

It is possible that your application is hosted at a cloud vendor. Track your bills and how much each service is costing you with the help of these vendors. Monitoring system usage and cost can help you analyze what part of the application is most costly, and see what compromises can be made.

There are service level agreements in your application. The prediction service may have a set service level agreement. When the SLA is not met, make sure to set a threshold.

In the next section, you will learn what to monitor at your current maturity level and this will give you an actionable idea of where to start.

Monitoring and observability in machine learning systems.

Monitoring a system and observing the same system are two different approaches.

It is possible to look at the metrics you have been monitoring and perform root-cause analysis on them to understand why they are a certain way, and what threat they pose to the overall performance of your system.

Monitoring is the most important thing that happens before observability.

  • Performance metrics are collected.
  • Tracking them.
  • detecting problems.
  • alerting the right user

You can monitor without observing, but you can not observe your system ‘s performance. Monitoring and observability are related to collecting the dots.

Christopher wrote an article about observability and monitoring for machine learning production systems.

How to start monitoring your machine learning models.

After you deploy your models, you need to start monitoring them. Now that you know why you need to monitor, you should take a step-by-step approach.

Step-by-Step Monitoring ApproachStep-by-Step Monitoring ApproachSource: author

What to watch for based on your maturity level.

Monitoring a model in production is a sophisticated process because of the large number of things to track and by different users. Tools and platforms are needed to make things simpler. I do n’t know what the right tools and platforms are.

There is a constantly evolving landscape. I can assure you that there are no universally good tools.

Laying out requirements and figuring out what your needs are upfront can be very difficult, so I thought it would be simpler if I met you at the maturity stage. I put an illustration in the “ Bringing Them All Together… ” section.

You are at level 0 in your maturity stage.

You ‘re training and deployment models manually if you ‘re at this level. At this stage, you probably are n’t even thinking of monitoring your model yet, perhaps just finding a way to validate your model on the test set and hand it off to your IT ops or software developers.

I know because I was there. It has ended in tears on the hospital bed a couple of months after I handed it off, but as I mentioned at the beginning of the article, I celebrated when I handed it off.

I propose that you prioritize the lowest hanging fruit. Although less informative, and wo n’t help you monitor model performance, it can still serve as a reasonable performance proxy to tell you if your general application is working as intended.

You do n’t want to spend long hours focusing on monitoring your model ‘s metrics or try to justify its performance in line with a business KPI when your workflows is still in its manual deployment stage ; such metrics will get easier to measure and analyze when your system gets mature.

You are at level 1 in your maturity stage.

Being at this level means that you have automated the machine learning pipeline to enable continuous training of your machine learning models based on criteria or a defined threshold.

I think you focus more on monitoring.

  • If you ca n’t spend them on getting ground truth for monitoring model metrics, the business metric used to gauge your model ‘s performance is pretty difficult to measure.
  • The properties of your production data and your model ‘s performance in production can be used to train models with new production data.
  • If you are going to manually deploy a retrained model, you want to make sure you can monitor the properties of that model before redeploying it to production.
  • As retraining steps are automated, you need to keep an eye on your production pipeline health.
  • You should start monitoring how much your continuous training process is costing so you do n’t wake up with a huge bill one day.

You are at level 2 in your maturity stage.

Being at this level shows that you are mature in your implementation and that the entire pipeline is a robust, automated CI/CD system. In a complimentary feedback loop, your training, validation, and deployment phases are all automated.

At this stage, you should keep an eye on everything, but make sure that your team is focused on the more informative metrics and not on the less informative ones.

Bringing them all together…

I think you should focus on monitoring depending on the maturity of your project.

Essential_Signals_to_MonitorEssential_Signals_to_MonitorOriginal concept here. Source: author

While this is n’t a hard and fast rule and can take a bit of thinking to figure out with your team, the way I think of it is for you to ask some questions.

  • What do we want from a tool ?
  • At this time, what are the most important metrics ?
  • Are these insights in line with improving the performance of our application to ensure continuous positive business value ?
  • 80 % of the effort we can put in right now is positive business value.
  • Do we have a dedicated team of ITOps engineers ?
  • How easy is it to integrate a monitoring solution into a deployment ?

These are difficult to think through and have to be done. I think you should start with the question, “ What ‘s 20 % of effort we can put in right now to get 80 % of the result ? ”

Tracking the necessary metrics is difficult if you do n’t start with a simple solution. A solution that helps you use less tools to monitor your metrics. As you learn about the tools and your needs, you can add richer tools if you need them and know how to use them.

There are existing tools that are surveyed.

You know what to look out for based on your maturity level. Survey your organization for existing tools that your engineering team uses to monitor deployed applications

For example, if you are in an organization where your ops team already uses Prometheus and Grafana for monitoring system performance and other metrics, you could be in that situation. You do n’t have to worry about looking for a platform as you just set up and define your metrics, and write test cases for your model and data.

This is rarely the case. You may have to be the one to set out and look for a platform that can help you monitor the metrics you need based on your needs analysis.

Before you start looking for a platform that will fit your needs, be sure to survey your company, understand engineering and ops culture, and talk to the ops team if you have one.

Selecting a monitoring/observability platform.

It can be difficult to choose an observability platform. There are costs, security concerns, and other challenges.

Thankfully, it can be simplified.

  • By this point in the article, you probably understand your monitoring needs.
  • Knowing what you have available, as well as the budget on the ground for you to work with, is important. Some platforms are expensive with features that you do n’t need, and there are also open-source solutions available.
  • The necessary qualities and observability platform should be taken into account.

Before selecting a monitoring/observability platform, there are some things to consider.

For now, principles that have been used for monitoring and observing software in production are being used for monitoring data and machine learning models.

The platforms that are most suited for monitoring and observing data quality are Monte Carlo and Anomalo. Ai and arize. There are machine learning observability platforms.

Arize-drift-monitorArize-drift-monitorArize monitoring platform | Source

You should check out Neptune if you want a model monitoring solution for your experimentation and training processes. It is free to use and you can try it out.

Get started with NeptuneGet started with NeptuneCustomized dashboard in Neptune

Your platform for monitoring/observability should be this one.

  • Put a model to production, do n’t make things harder for yourself, be easy and intuitive to use, configurable by an individual service owner, instead of a central ops team.
  • Smart defaults for metrics on your registered model are provided by out-of-the-box metrics.
  • Make sure the platforms you use can integrate your production data with their systems through a variety of methods.
  • If you want to have out-of-the-box visualization and dashboard, provide prepopulated dashboards and graphs. You should be able to add new panels, create visualization with the necessary information, and create custom queries.
  • You can have flexible access through a management console or user interface.
  • You should be able to define your metrics, write your test cases and checks, create custom dashboards and visualizations, set custom thresholds, and generally set up custom integrations.
  • This is n’t a one- person job. You should be able to give reports and data to service owners. The platform should be able to give each stakeholder insights into how best they can continue to provide positive business value based on our model predictions.
  • Data slicing is one of the tools that can help you explain model decisions.
  • Model predictions can be tested to detect bias and threats.
  • You want a platform that works regardless of what technology you used in building your model or what environment it is running on in production.
  • You can easily integrate with an existing organization.
  • Automatic detection of outliers is one of the ways platforms are beginning to use.
  • Most of the platforms offer either cloud or on-premise deployment solutions. Some offer both.
  • It ‘s a very hard place to make models. You do n’t want a platform that ca n’t keep up with models that make thousands to millions of predictions per second.
  • The platform/tool should be able to give you a detailed look at your data and model performance. Is it possible to drill into each feature in your production data to detect problems and drift at the feature level ?
  • Log auditing can be done from models, data, and other system logs.

Picking one tool requires a lot of analysis. Some of the tools I have seen tick all or most of the boxes above. If you want to learn more about the deployment platforms you can use, Dr. Most of them have been explained by Ori Cohen in the article.

Monitoring Machine Learning systems in production can be a challenge.

There are challenges at the input level.

Monitoring your input data can be a challenge in production.

  1. If your data sources are scattered and unreliable, it might be difficult to unify them into a single source of truth or data mesh.
  2. You do n’t have clear data requirements in terms of what the data structure should be, or what an acceptable service level agreement is for the upstream data source.
  3. If someone makes an update to a data source and does n’t inform others, it could cause miscommunication between teams.
  4. Tracking data lineage becomes very difficult due to the fact that the Metadata for your production data workflows is not discoverable. You would n’t know the producer and consumer of the data in question if you were logging events.

Some of the challenges can be cultural as well. If it ‘s a product, then it ‘s data.

  • People own the product and are assigned roles.
  • Team members communicate with each other.
  • They record them through logging.
  • Everyone can help solve cultural issues by seeing that data is accessed and utilized effectively.

You can learn more about data as a product here.

There are some challenges when monitoring model quality.

Monitoring your input data can be a challenge in production.

  • If you do n’t have real-time ground truth collection, you might have to rely on upstream systems or even human annotators to collect actual labels.
  • There are model biases that stem from biases in the data.
  • There are black box models that could cause compliance issues.
  • Not tracking model data in production. It does n’t have to be an issue when you use a store like Neptune. Learn more here.

Data slicing can be used to analyze segments of your model ‘s predictions and test for bias. Some tools can do this for you. Production data validation test cases can prevent and/or eliminate biases peculiar to the dataset. The model metrics can be used to detect biases peculiar to the model.

According to them, Know Your Data helps researchers, engineers, product teams, and decision-makers understand datasets to improve data quality, and help mitigate fairness and bias issues.

Machine Learning models can be monitored in production.

There is still a lot of work to be done after you deploy your model to production.

If your model is performing as expected, if it needs to be retrained, or if you need to redefine what good business performance looks like, this will help you. I recommend you start with some of the best practices.

Monitoring the best practices.

  • First, focus on people. If you build a culture where data is treated as the product in your organization, people will most likely be inclined to take ownership of the product to ensure it serves its intended purpose end-to-end. You can learn a lot from cultural change.
  • Do n’t give the application ‘s monitoring power to one person if it ‘s possible. Let everyone handle their service and communicate effectively if you have a cross-functional team. When use cases scale, no one will be overwhelmed.
  • Do n’t use too many tools and take a lean approach. Everyone stays on top of a task if you decentralize the team.
  • Monitoring begins when you begin experimentation. Monitoring model experimentation metrics, logs, and so on should be built into the model development stage.
  • When you encounter a crucial decision-making point, always consider what is optimal for the productivity of your team.
  • Encourage your team to document their framework and create a framework for effective model maintenance.

There are best practices for data monitoring.

  • The same process should be used for both batches and streaming data so that it ‘s easier to figure out what ‘s wrong with the data.
  • If you want to get more insights, look at the feature drift gradually and check for the drift for an entire dataset.
  • It will help you tackle challenges with streaming and maintaining reliable data quality if you invest in a global data catalog. It will make tracking lineages easier.
  • Before moving your model to production, perform a pre-launch validation on your evaluation set.

There are best practices for model monitoring.

  • If you want to detect a big dip in performance which is often indicative of something wrong, you can use tools that detect this automatically.
  • shadow deployment and testing with the challenger model vs the champion model and log the predictions so that performance on the new model can be tracked alongside the current model in production
  • You can use a store like Neptune. Hyperparameters can be stored for models that have been retrained in production.

There are best practices for monitoring predictions.

  • When ground truth is n’t available to collect, prediction drift can be a good performance proxy, but it should n’t be used as the sole metric.
  • Track the outputs from your model. For example, your classification model predicting the wrong class for a set of inputs with a high confidence score, or your regression model predicting a negative score for a given set of features.

The sidebar shows how to set alerts the right way.

It ‘s important that alerting is a part of monitoring. You definitely want to be notified when something goes wrong.

Different things with different degrees of priority will go wrong. How do you separate the wheat from the chaff ? Some tools offer out-of-the-box and smart alerting features but sometimes, it will boil down to what makes sense to a particular business, based on that application.

Setting up notifications in the correct way.

Ernest and Peco Karayanev gave me the best set of advice on actionable alerts.

  1. Your alert should be tested before they go into production.

In the case of data drift, you should write test cases that mimic the statistical metrics you are using to monitor distribution changes between a development dataset and production data. All you have to do is register your development data and integrate your production data with your simulation data, because some platforms automatically apply these statistical checks for you. Set a threshold for a particular statistical check.

Image modified by the author and adapted from this source.

  1. As part of your needs analysis, monitor the primary metrics.
  2. Every service owner is comfortable with their medium if they agree with the media for the alert. Other team chats ? E-mails ?
  3. The alert should include descriptive information and action by the service owner.
  4. It ‘s important to set up a feedback loop that makes your monitoring better. If a data drift is greater than a set threshold, you might want to use a tool like Apache Airflow to kick-off retraining. The team might have to manually retrain on the new data.

There are best practices for alerting.

  • Make sure you and your team are clear on who gets what alert. If possible, the data engineering team, model quality team, data scientist, and system performance team should get the data quality alerts.
  • When you know there is a condition that requires intervention, set alerts. When the threshold condition is not met, the thresholds for data quality and alerts should only be triggered.
  • Understand what has real business impact and be alert for anything that goes wrong with the application.
  • The team should document the actions they took as well as the outcomes of any alert they get.
  • You want to avoid “ alert hell ”, a flurry of irrelevant alerts that may have you losing track of the real, business-impacting alert in the noise.

Be on the lookout for this feature because the best tools will help you manage your alerts properly.

You should log everything.

If I told you that you could save a lot of time putting out a fire outbreak with your machine learning application in production by just knowing where the source of the fire is, what would you think ?

Log monitoring can give you a huge head start in figuring out what ‘s wrong with your entire system.

Why should you log in ?

Simple, because you want to make sure you can put out the fire quickly, identify the source of the fire, and take measures to prevent future fires.

In machine learning, you log because you want to be able to quickly identify the cause or potential cause of a problem, take the necessary measures to solve it, and patch the loophole in your system.

Being able to log your data and model events will help create data and model lineage that can be traced and audited for compliance requirements and standards.

What should you log in to ?

The volume will always end up being a problem if you log every activity of every component in your system. Storage and resources will be used to host and process the log files.

You should be very careful about what you log. You probably do n’t have enough money to log every activity of your system. You should n’t log everything. The business value of your application should be affected by what you log.

Mem modified by the author from this source

You should log some objects for the data and model components.

  • There are data events.
  • If possible, include the data along with it.
  • This includes the model version and configuration details.
  • The results of the model.
  • If applicable to your system, prediction results from shadow tests.
  • If it ‘s available, the ground truth label.
  • Standard monitoring systems have general operational performance.

Keeping a close eye on the volume is the most important thing to note. You want to make sure that the budget you set out at the beginning includes the amount of logged files that can grow up to gigabytes.

Take into account

Neptune is one of the useful logging tools. New Relic and Honeycomb. Datadog and io.

The best practices for logging.

  • You should be logging runs from scheduled time to start time, end time, job failure errors, the number of runs, and so on to make your pipeline easier to fix.
  • You should be logging the predictions along with the ground truth, details on a prediction call, and the time the model was deployed to production.
  • You should be logging the number of requests served by the champion model in production.
  • For your data, log the version of every preprocessed data that was successful so that they can meet audited and their lineage can be traced.
  • If you want your logs to be easily searched, consider using a JSON format with an actual structure.
  • Consider rotating log files for better management, as well as deletion of unneeded logs that you wo n’t need again for auditing or other reasons.
  • The best logs are formatted, have good time stamps and severity levels, and provide plenty of context to a user.

It does not stop at monitoring.

To continue providing positive business value, you continuously monitor your model in production, because there is a reason you deployed it to production in the first place.

When regulators demand to know how your system comes up with decisions and you ca n’t provide it, you ‘re fined a hefty amount by regulators.

For 30 days, your recommendation system will serve the same recommendation to all users.

To be truly complete, you need to :

  • Make sure the model does n’t slope towards negative business value.
  • Make sure the model can be audited and governed.

You can learn more about model management in production.

Getting practical in a live environment.

We have covered a lot of practical concepts in this article, it is time to put that into use by checking out, working with, and/or configuring a monitoring solution of your own.

  1. Try a cloud-based solution for machine learning monitoring. You can use ai ‘s live sandbox to check metrics from events like data drift, model performance, and so on.
  1. If you want to use an open-source solution, Prometheus is perhaps your best bet.
  2. DoorDash has a great article on how a company approaches Machine Learning monitoring and observability.


Monitoring is something I did n’t do, which is how I ended up with the DevOps guy trying to not get involved in all the insanity.

Meme modified by the author from this source.

I hope you were able to understand why deployment is never the final step, and you now have enough clarity on where to start in terms of monitoring your models.

I would like to give solutions to the challenges from an earlier section.

Production challenge

Key questions

There are solutions.

There is a production challenge.

There are changes to the data distribution.

Key questions

There are sudden changes in the values of my features.

There are solutions.

Statistical checks can be used to detect data drift.

There is a production challenge.

There is model ownership in production.

Key questions

Who is in charge of the model in production ?

There are solutions.

The responsibility of model ownership in production is shared by

There is a production challenge.

Skew training.

Key questions

The model is giving poor results despite being tested.

There are solutions.

Make sure your production and training data are processed the same way.

There is a production challenge.

There is a model or concept drift.

Key questions

Why did my model ‘s performance decline over time ?

There are solutions.

If a model does not work on new data, try another model.

There is a production challenge.

There are black box models.

Key questions

How can I explain my model ‘s predictions to relevant stakeholders ?

There are solutions.

There are segments of model predictions.

There is a production challenge.

Concerted adversaries.

Key questions

How can I make sure my model is protected ?

There are solutions.

Use statistical checks to protect your system from security threats.

There is a production challenge.

The model is ready.

Key questions

Is it possible to compare results from a newer version of my model against the in-production version ?

There are solutions.

Shadow testing can be used to test challenger model vs champion model.

There is a production challenge.

There are health issues.

Key questions

Why does a retraining job take so long to complete ?

There are solutions.

Logs can be used to audit errors and alert the service owner.

There is a production challenge.

The system is underperforming.

Key questions

Why am I getting different latencies for different models ?

There are solutions.

Logs can be used to audit services that are not meeting requirements.

There is a production challenge.

There are cases of extreme events.

Key questions

I do n’t know how I will be able to track the effect and performance of my model.

There are solutions.

Before taking action, understand that it is an instantons drift.

There is a production challenge.

Data quality issues

Key questions

How can I make sure the production data is the same as the training data ?

There are solutions.

Write a data integrity test.

Happy monitoring!

There are references and other resources for monitoring in production.

Read the next chapter.

The best tools to do model monitoring.

7 minutes read by Jakub Czakon.

You will start looking for model monitoring tools if you deploy models to production sooner or later.

Visibility into “ how things work ” is what you need when your models impact the business.

Things stop working the first time you feel it. If you do n’t have a model monitoring set up, you may not know what to look for. People want you to fix it.

What do things and work mean in this context ?

People mean different things depending on the team.

One benefit of working at an MLOps company is that you can talk to many teams. When people say they want to monitor models, they may want to.

  • See how accurate the predictions of your model are. If the model performance degrades over time, you should re-train it.
  • The input/output distribution could be connected to the data and concept drift.
  • See learning curves, trained model predictions distribution, or confusion matrix during training and re-training.
  • Log metrics, charts, prediction, and other Metadata are included in the monitor model evaluation and testing.
  • You can monitor hardware metrics to see how much your models use.
  • Look at the evaluations from your CI/CD jobs and compare them. The metrics only tell you so much, and someone needs to see the results.

Which model monitoring did you mean ?

We will look into tools that can help with those use cases.

But first…

Depending on what you want to monitor, your needs will change but there are some things that you should definitely consider.

  • How easy is it to connect it to your training and deployment tools ?
  • Is it possible to log and see what you want and how you want it ?
  • How much overhead does logging impose on your model training and deployment infrastructure ?
  • Can you compare multiple models that are running at the same time ?
  • Is it possible to alert when the performance goes crazy ?

Let ‘s look at the actual model monitoring tools.

Continue reading ->

Source: https://nhadep247.net
Category: Machine