Choose an execution rule
A deployment is a way to run a Machine Learning pipeline in a repeatable and automated way.
For each deployment, you can configure an execution rule:
- by endpoint (web API) : the pipeline will be executed by a call to a web API. In addition, this API will allow, if necessary, to retrieve data as input and deliver the result of the pipeline as output. Access to the API can be securely communicated to external users.
- by periodic trigger (CRON) : rules can be configured to trigger the pipeline periodically.
Summary
Function name | Method | Return type | Description |
---|---|---|---|
create_deployment | create_deployment(deployment_name, pipeline_name, execution_rule, mode, outputs_mapping=[], inputs_mapping=[], description, enable_parallel_executions=None, max_parallel_executions_per_pod=None, ram_request=None, gpu_request=None, timeout_s=180) | Dict | Function that deploys a pipeline by creating a deployment which allows a user to trigger the pipeline execution |
Deploy with execution rule: Endpoint
Definition function
To create an auto-mapping deployment where all inputs and outputs are
based on API calls, you can use the create_deployment
function. To
create a deployment with manual mapping, you can use the
create_deployment
function with the additional parameters
inputs_mapping
to specify the precise mapping between input and
source.
CraftAiSdk.create_deployment(
pipeline_name,
deployment_name,
execution_rule="endpoint",
mode=DEPLOYMENT_MODES.ELASTIC,
inputs_mapping=None,
outputs_mapping=None,
description=None,
enable_parallel_executions=None,
max_parallel_executions_per_pod=None,
ram_request=None,
gpu_request=None,
timeout_s=180
)
Parameters
deployment_name
(str) -- Name of endpoint chosen by the user to refer to the endpointpipeline_name
(str) -- Name of pipeline that will be run by the deployment / endpointexecution_rule
(str) - Execution rule of the deployment. Must beendpoint
orperiodic
. For convenience, members of the enumerationDEPLOYMENT_EXECUTION_RULES
could be used too.-
mode
(str) – Mode of the deployment. Can be "elastic" or "low_latency". Defaults to "elastic". For convenience, members of the enumeration DEPLOYMENT_MODES can be used. This defines how computing resources are allocated for pipeline executions:-
elastic
: Each pipeline execution runs in a new isolated container (“pod”), with its own memory (RAM, VRAM, disk). No variables or files are shared between executions, and the pod is destroyed when the execution ends. This mode is simple to use because it automatically uses computing resources for running executions, and each execution starts from an identical blank state. However, it takes time to create a new pod at the beginning of each execution (tens of seconds), and computing resources can become saturated when there are many executions. -
low_latency
: All pipeline executions for the same deployment run in a shared container (“pod”) with shared memory. The pod is created when the deployment is created, and deleted when the deployment is deleted. Shared memory means that if one execution modifies a global variable or a file, subsequent executions on the same pod will see the modified value. This mode allows executions to respond quickly (less than 0.5 seconds of overhead) because the pod is already up and running when an execution starts, and it is possible to preload or cache data. However, it requires care in the code because of possible interactions between executions. Additionally, computing resources must be managed carefully, as pods use resources continuously even when there is no ongoing execution, and the number of pods does not automatically adapt to the number of executions. During the lifetime of a deployment, a pod may be re-created by the platform for technical reasons (including if it tries to use more memory than available). This mode is not compatible with pipelines created with acontainer_config.dockerfile_path
property increate_pipeline()
.
-
-
description
(str, optional) -- Text description of usage of pipeline for user only outputs_mapping
(List) - List of all OutputDestination objects with information for each output mapping.inputs_mapping
(List, optional) - List of input mappings, to map pipeline inputs to different sources (such as constant values, endpoint inputs, data store or environment variables). SeeInputSource
for more details. For endpoint rules, if an input of the pipeline is not explicitly mapped, it will be automatically mapped to an endpoint input with the same name.description
(str, optional) – Description of the deployment.enable_parallel_executions
(bool, optional) – Whether to run several executions at the same time in the same pod, if mode is "low_latency". Not applicable if mode is "elastic", where each execution always runs in a new pod. This is disabled by default, which means that for a deployment with "low_latency" mode, by default only one execution runs at a time on a pod, and other executions are pending while waiting for the running one to finish. Enabling this may be useful for inference batching on a model that takes much memory, so the model is loaded in memory only once and can be used for several inferences at the same time. If this is enabled, then global variables, GPU memory, and disk files are shared between multiple executions, so you must be mindful of potential race conditions and concurrency issues. For each execution running on a pod, the main Python function is run either as an asyncio coroutine withawait
if the function was defined withasync def
(recommended), or in a new thread if the function was defined simply withdef
. Environment variables are updated whenever a new execution starts on the pod. Using some libraries with async/threaded methods in your code may cause logs to be associated with the wrong running execution (logs are associated with executions through Python contextvars).max_parallel_executions_per_pod
(int, optional) – Only applies ifenable_parallel_executions
is True. The maximum number of executions that can run at the same time on a deployment’s pod in "low_latency" mode whereenable_parallel_executions
is True: if a greater number of executions are requested at the same time, then onlymax_parallel_executions_per_pod
executions will actually be running on the pod, and the other ones will be pending until a running execution finishes. The default is 6.ram_request
(str, optional) – The amount of memory (RAM) requested for the deployment in KiB, MiB and GiB. The value must be a string with a number followed by a unit, for example “512MiB” or “1GiB”. This is only available for “low_latency” deployments mode.gpu_request
(int, optional) – The number of GPUs requested for the deployment. This is only available for “low_latency” deployments mode.timeout_s
(int) – Maximum time (in seconds) to wait for the deployment to be ready. 3 minutes (180 seconds) by default, and at least 2 minutes (120 seconds).
Returns
Information about the deployment just create in a dict Python format. In this data, you will have :
- name - Name of the deployment.
- endpoint_token - Token of the endpoint used to trigger the deployment. Note that this token is only returned if execution_rule is “endpoint”.
Example
Example auto mapping
sdk.create_deployment(
deployment_name="my_deployment",
pipeline_name="my_pipeline",
execution_rule="endpoint",
outputs_mapping=[],
inputs_mapping=[],
)
> {
> 'name': 'name-endpoint',
> 'endpoint_token': 'S_xZOKU ... KHs'
> }
Example manual mapping
sdk.create_deployment(
deployment_name="my_deployment",
pipeline_name="my_pipeline",
execution_rule="endpoint",
inputs_mapping=[
seagull_endpoint_input,
big_whale_input,
salt_constant_input,
],
outputs_mapping=[prediction_endpoint_ouput],
)
> {
> 'name': 'name-endpoint',
> 'endpoint_token': 'S_xZOkCI ... FIg'
> }
Deploy with execution rule: Periodic
Definition function
To create an auto-mapping deployment where all inputs and outputs are
based on periodicity, you can use the create_deployment
function. To
create a deployment with manual mapping, you can use the
create_deployment
function with the additional parameters
inputs_mapping
to specify the precise mapping between input and
source.
CraftAiSdk.create_deployment(
pipeline_name,
deployment_name,
execution_rule="periodic",
mode=DEPLOYMENT_MODES.ELASTIC,
schedule=None,
inputs_mapping=None,
outputs_mapping=None,
description=None
)
Warning
Input and output mapping must always be precise. Auto mapping isn't available for periodic deployment.
Parameters
-
deployment_name
(str) -- Name of the deployment chosen -
pipeline_name
(str) -- Name of pipeline that will be run by the deployment -
description
(str, optional) -- Text description of usage of pipeline for user only. -
execution_rule
(str) - Execution rule of the deployment. Must beendpoint
orperiodic
. For convenience, members of the enumerationDEPLOYMENT_EXECUTION_RULES
could be used too. -
mode
(str) – Mode of the deployment. Can be "elastic" or "low_latency". Defaults to "elastic". For convenience, members of the enumeration DEPLOYMENT_MODES can be used. This defines how computing resources are allocated for pipeline executions:-
elastic
: Each pipeline execution runs in a new isolated container (“pod”), with its own memory (RAM, VRAM, disk). No variables or files are shared between executions, and the pod is destroyed when the execution ends. This mode is simple to use because it automatically uses computing resources for running executions, and each execution starts from an identical blank state. However, it takes time to create a new pod at the beginning of each execution (tens of seconds), and computing resources can become saturated when there are many executions. -
low_latency
: All pipeline executions for the same deployment run in a shared container (“pod”) with shared memory. The pod is created when the deployment is created, and deleted when the deployment is deleted. Shared memory means that if one execution modifies a global variable or a file, subsequent executions on the same pod will see the modified value. This mode allows executions to respond quickly (less than 0.5 seconds of overhead) because the pod is already up and running when an execution starts, and it is possible to preload or cache data. However, it requires care in the code because of possible interactions between executions. Additionally, computing resources must be managed carefully, as pods use resources continuously even when there is no ongoing execution, and the number of pods does not automatically adapt to the number of executions. During the lifetime of a deployment, a pod may be re-created by the platform for technical reasons (including if it tries to use more memory than available). This mode is not compatible with pipeliness created with acontainer_config.dockerfile_path
property increate_pipeline()
.
-
-
schedule
(str, optional) - Schedule of the deployment. Only required ifexecution_rule
is "periodic". Must be a valid: cron expression. The deployment will be executed periodically according to this schedule. The schedule must follow this format:<minute> <hour> <day of month> <month> <day of week>
. Note that the schedule is in UTC time zone. "*" means all possible values. Here are some examples:"0 0 * * *"
will execute the deployment every day at midnight."0 0 5 * *"
will execute the deployment every 5th day of the month at midnight.
-
inputs_mapping
(List of instances of [InputSource], optional) - List of input mappings, to map pipeline inputs to different : sources (such as constant values, endpoint inputs, or environment variables). SeeInputSource
for more details. For endpoint rules, if an input of the pipeline is not explicitly mapped, it will be automatically mapped to an endpoint input with the same name. For periodic rules, all inputs of the pipeline must be explicitly mapped. -
outputs_mapping
(List of instances of [OutputDestination], optional) - List of output mappings, to map pipeline outputs to different :
destinations. SeeOutputDestination
for more details. For endpoint execution rules, if an output of the pipeline is not explicitly mapped, it will be automatically mapped to an endpoint input with the same name. For other rules, all outputs of the pipeline must be explicitly mapped. -
description
(str, optional) – Description of the deployment. enable_parallel_executions
(bool, optional) – Whether to run several executions at the same time in the same pod, if mode is "low_latency". Not applicable if mode is "elastic", where each execution always runs in a new pod. This is disabled by default, which means that for a deployment with "low_latency" mode, by default only one execution runs at a time on a pod, and other executions are pending while waiting for the running one to finish. Enabling this may be useful for inference batching on a model that takes much memory, so the model is loaded in memory only once and can be used for several inferences at the same time. If this is enabled, then global variables, GPU memory, and disk files are shared between multiple executions, so you must be mindful of potential race conditions and concurrency issues. For each execution running on a pod, the main Python function is run either as an asyncio coroutine withawait
if the function was defined withasync def
(recommended), or in a new thread if the function was defined simply withdef
. Environment variables are updated whenever a new execution starts on the pod. Using some libraries with async/threaded methods in your code may cause logs to be associated with the wrong running execution (logs are associated with executions through Python contextvars).max_parallel_executions_per_pod
(int, optional) – Only applies ifenable_parallel_executions
is True. The maximum number of executions that can run at the same time on a deployment’s pod in "low_latency" mode whereenable_parallel_executions
is True: if a greater number of executions are requested at the same time, then onlymax_parallel_executions_per_pod
executions will actually be running on the pod, and the other ones will be pending until a running execution finishes. The default is 6.ram_request
(str, optional) – The amount of memory (RAM) requested for the deployment in KiB, MiB and GiB. The value must be a string with a number followed by a unit, for example “512MiB” or “1GiB”. This is only available for “low_latency” deployments mode.gpu_request
(int, optional) – The number of GPUs requested for the deployment. This is only available for “low_latency” deployments mode.timeout_s
(int) – Maximum time (in seconds) to wait for the deployment to be ready. 3 minutes (180 seconds) by default, and at least 2 minutes (120 seconds).
Returns
Information about the deployment just create in a dict Python format.
- name - Name of the deployment.
- schedule - Schedule of the deployment. Note that this schedule is only returned if execution_rule is “periodic”.
- human_readable_schedule - Human readable schedule of the deployment. Note that this schedule is only returned if execution_rule is “periodic”.
Example
Set up deployment to be triggered automatically every 14 days.
sdk.create_deployment(
deployment_name="my_deployment",
pipeline_name="my_pipeline",
execution_rule="periodic",
schedule="0 14 * * *"
)
> {
> 'name': 'produit-endpoint-periodic',
> 'schedule': '*/2 * * * *',
> 'human_readable_schedule': 'Every 2 minutes'
> }
Tips for Managing Hardware Resources
When deploying pipelines, especially with low-latency or multiple elastic executions, it’s essential to monitor and manage hardware resource usage effectively.
If RAM/VRAM is Fully Used
This is indicated by errors such as "Out of Memory". You can address this by:
- Reducing the number of parallel low-latency deployments.
- Decreasing the number of simultaneous elastic executions.
- Optimizing your pipeline to use less memory.
If CPU is Fully Used
This can be identified by abnormally slow execution times or observed in resource usage metrics. If this becomes an issue:
- Reduce the number of ongoing executions.
- Optimize your code to be more CPU-efficient.
- Consider upgrading your hardware resources if needed.