What has changed

Amazon SageMaker AI Async Inference has added a feature to send payloads directly in the request body using the InvokeEndpointAsync API. This eliminates the need for traditional S3 uploads, reducing network round trips for payloads up to 128,000 bytes and simplifying client code. (Source: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/)

Detailed mechanism

The InvokeEndpointAsync API has a new Body parameter, which includes the payload directly in the API request body when specified. The Body parameter is mutually exclusive with the InputLocation parameter, and requests that specify both will be rejected with a ValidationError. Output is still written to the S3 OutputLocation as before. The maximum payload size is limited to 128,000 bytes (raw payload). (Source: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/)

Migration procedure and code example

The change from traditional S3 upload code to using the Body parameter is as follows: Traditional code

s3.put_object(Bucket="my-async-bucket", Key=input_key, Body=payload)
input_location = f"s3://my-async-bucket/{input_key}"
response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName="my-async-endpoint",
    InputLocation=input_location,
    ContentType="application/json",
)

New code

response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName="my-async-endpoint",
    Body=payload,
    ContentType="application/json",
)

This change eliminates the S3 upload step, reducing network overhead. (Source: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/)

Performance characteristics

For payloads under 128,000 bytes, the reduction in S3 upload network round trips is expected to decrease processing latency. This feature is available in 31 AWS regions and is compatible with traditional async endpoints. (Source: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/)

Summary

  • Use the Body parameter with the InvokeEndpointAsync API to avoid S3-based payload uploads
  • Reduce network round trips for payloads under 128,000 bytes, allowing for decreased processing latency
  • Available in 31 AWS regions, compatible with traditional async endpoints
  • Simplifies client code and reduces operational burden
  • Engineers can check the official documentation for detailed setup procedures