What has changed
Amazon SageMaker AI Async Inference has added a feature to send payloads directly in the request body using the InvokeEndpointAsync API. This eliminates the need for traditional S3 uploads, reducing network round trips for payloads up to 128,000 bytes and simplifying client code. (Source: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/)
Detailed mechanism
The InvokeEndpointAsync API has a new Body parameter, which includes the payload directly in the API request body when specified. The Body parameter is mutually exclusive with the InputLocation parameter, and requests that specify both will be rejected with a ValidationError. Output is still written to the S3 OutputLocation as before. The maximum payload size is limited to 128,000 bytes (raw payload). (Source: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/)
Migration procedure and code example
The change from traditional S3 upload code to using the Body parameter is as follows:
Traditional code
s3.put_object(Bucket="my-async-bucket", Key=input_key, Body=payload)
input_location = f"s3://my-async-bucket/{input_key}"
response = sagemaker_runtime.invoke_endpoint_async(
EndpointName="my-async-endpoint",
InputLocation=input_location,
ContentType="application/json",
)
New code
response = sagemaker_runtime.invoke_endpoint_async(
EndpointName="my-async-endpoint",
Body=payload,
ContentType="application/json",
)
This change eliminates the S3 upload step, reducing network overhead. (Source: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/)
Performance characteristics
For payloads under 128,000 bytes, the reduction in S3 upload network round trips is expected to decrease processing latency. This feature is available in 31 AWS regions and is compatible with traditional async endpoints. (Source: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/)
Summary
- Use the
Bodyparameter with theInvokeEndpointAsyncAPI to avoid S3-based payload uploads - Reduce network round trips for payloads under 128,000 bytes, allowing for decreased processing latency
- Available in 31 AWS regions, compatible with traditional async endpoints
- Simplifies client code and reduces operational burden
- Engineers can check the official documentation for detailed setup procedures