Skip to content

Pipeline monitoring

As data scientists, the ability to comprehensively understand and monitor machine learning executions is paramount for delivering successful and impactful models.

This page is a resource that equips data scientists with the necessary knowledge and tools to track, monitor, and analyze the executions of their machine learning models on the Craft AI platform.

This page delves into essential topics, including obtaining execution details, tracking input and output, accessing metrics and logs, and comparing multiple executions. By mastering these techniques, data scientists can make informed decisions, optimize their models, and unlock the true potential of the MLOps platform.

How to find an execution and obtain the details ?

When using the platform for experimentation or production, you can find the list of all your executions on the Execution > Execution Tracking page (remember to select a project first).

On this page you will find the list of all executions in all environments of the selected project. All executions are listed, whether in progress, failed or succeeded, whether triggered by a run, endpoint or CRON.

Warning

Please note that deleting the pipeline or deployment will delete all attached executions.

Get general information on an execution

Once you are in the execution tracking page, you need to select an environment using the selector at the top left. Once the mouse is over the environment, you will see another popup on the right with two lists in a row:

  • The first contains the list of pipelines that have 'run' attached to them.
  • The second contains the list of deployments that have executions attached to them.

Once a pipeline or deployment has been selected, the list of executions appears in the left-hand column, from the most recent to the oldest.

Tip

You can click on an environment directly to get all the associated executions.

understand_ML_exec_1

Info

You can also retrieve all this information using the sdk.get_pipeline_execution(execution_id) function via the SDK.

Track input and output of execution

If you want to see the inputs and outputs of a execution, you can view them in the tab of the same name. The inputs/outputs of the pipeline are displayed in a table with their:

  • Name
  • Type
  • Source/destination type (where the value entered for this execution comes from)
  • Source/destination value (what is the value entered for this execution)

understand_ML_exec_2

Info

For the SDK, this information can be obtained using the function mentioned above sdk.get_pipeline_execution(execution_id).

More information can be obtained using the:

  • sdk.get_pipeline_execution_input(execution_id, input_name)
  • sdk.get_pipeline_execution_output(execution_id, output_name)

Get metrics and logs of execution

In the metrics tab, you can retrieve the pipeline metrics if you have defined them in your code.

Note that the 'simple metrics' are shown in a table, but the 'lists metrics' are shown with graphs so that you can see how they change during execution. For example, here we follow the evolution of loss and accuracy over the epochs of our model training.

understand_ML_exec_3

Info

It is also possible to retrieve this information from the SDK using the functions sdk.get_metrics(name, pipeline_name, deployment_name, execution_id) and sdk.get_list_metrics(name, pipeline_name, deployment_name, execution_id), more information here.

Finally, the execution logs are also available in the associate tab. Note that the logs, like the other information, are not automatically reflected in the web interface, hence the buttons with arrows for refreshing the page.

Info

Here again, an SDK function is available with sdk.get_pipeline_execution_logs(execution_id, from_datetime, to_datetime, limit).

How to compare multiple executions?

Compare execution with a table of comparison

Let's go back to the source code of the pipeline from the beginning. This code is designed to train a deep learning model. This model has three hyper-parameters (the learning rate, the number of epochs, and the batch-size) which are associated with inputs to the pipeline.

We can also see that in the pipeline code, there are simple metrics and lists to track performance during and at the end of training.

Here, we'll vary the hyperparameters over several training sessions to find the best values. We'll vary the learning rate and batch size with these values:

Learning rate Number of epochs Batch size
0.01 10 32
0.01 10 64
0.01 10 128
0.001 10 32
0.001 10 64
0.001 10 128
0.0001 10 32
0.0001 10 64
0.0001 10 128

Each line in this table represents an execution with its hyper-parameters and therefore a training run of the model.

Instead of looking at the executions one by one in execution tracking, we go to Executions > Execution Comparison. Then select its environment, to finally see the table with all the executions.

understand_ML_exec_4

There are 3 important elements on this page:

  1. The table, the central element of the page. In this table, each row represents an execution, except for the first row, which is the column header. Each column represents information about the executions.
  2. These selectors can be used to add more or less information to the table, allowing inputs, metrics, etc. to be displayed or not. The more information you select, the more columns the table will have.
  3. Another tab is available on this page for viewing the metrics lists, but we'll come back to this later.

To start with, I'm going to select the meta-data, inputs, simple metrics, and list metrics. We don't need the outputs in this case. If I want, I can even select precisely the inputs and metrics I've used in my executions.

Then, in the header of the table, I'll filter the pipeline names so that I only have the executions from the pipeline I've used.

Note

All filter settings are available from the Filters button at the top right of the screen.

Finally, I'll sort according to precision by clicking on the little arrow in the column header.

That's it, I've sorted my executions to find the parameters that give me the best accuracy for my model. In my case, the best result is obtained with a learning rate of 0.001 and a batch size of 128.

understand_ML_exec_5

We could also have sorted according to metric lists, with the difference that you have to select your calculation mode before sorting. In fact, since we're just displaying a number representing the list (with the average, the last number, the minimum, etc.), you do this by clicking on the tag at the top of the column (here with last in the screenshot below):

understand_ML_exec_6

Compare the list metrics of several executions

If you want to see all the values in the execution lists, you can also display the list metrics in their entirety to compare them between executions. To do this, select the executions you want to view using the eye on the left of the table, then go to the visualize tab (top right).

Note

Only list metrics executions have a selectable eye.

On this screen, there is a graph for each available metrics list, and each execution is represented by a color. So you can compare the evolution of your metrics between each training session.

understand_ML_exec_7

Info

You can hide executions by clicking on their names in the legend.

How to follow a pipeline in production?

Pipeline monitoring

Let's take the example of a classic prediction Machine Learning model. Once your model has been trained and selected, we're going to expose it in an endpoint so that it can be used from any application. To do this, we're already going to need source code for an inference pipeline, this is the 2nd Python code given at the beginning. Note that this code reuses the validation dataset, to do the inference, for simplicity. We can, therefore, score each prediction and put the result in a score metric:

  • 1: The prediction is accurate
  • 0: The prediction is false

We create the pipeline and the endpoint deployment with the associated input/output. Your model is now ready to make predictions. For each prediction, we can track executions using the tools we've already seen.

In addition, you can also monitor your executions more globally over time by going to Monitoring > Pipeline metrics. On this page, you can see the evolution of the single metrics over time for any selected deployment.

In our case, we can track, execution after execution, the prediction score (true or false) of the model:

understand_ML_exec_8