NVIDIA Nemotron 3 Ultra Support on Amazon SageMaker JumpStart

NVIDIA Nemotron 3 Ultra is now available for one-click deployment on Amazon SageMaker JumpStart. This model features a Hybrid Transformer-Mamba Mixture-of-Experts (MoE) architecture with 550B total parameters, using only 55B active parameters.

Designed specifically for agent-based AI workloads, it achieves 5 times faster inference speed and up to 30% cost reduction compared to traditional models. It supports a maximum context length of 1 million tokens, making it ideal for complex inference and orchestration processing in autonomous agents that run for extended periods.

(Reference: AWS Machine Learning Blog)

Mechanism of Agent-Specific Architecture

Unlike traditional models, agents do not end with a single response. They repeat planning, tool invocation, sub-agent task delegation, and result verification over hundreds of turns. Since tokens and computation accumulate at each step, task completion accuracy, completion time, and task cost become important metrics.

The MoE architecture of Nemotron 3 Ultra activates only 55B of the 550B parameters per forward pass. This maintains high throughput even with a context length of 1 million tokens, allowing agents to sustain planning, tool invocation, and self-correction loops over hundreds of turns.

Optimization for the NVFP4 precision format achieves hosting acceleration and cost efficiency.

(Reference: AWS Machine Learning Blog)

Implementation Steps on SageMaker JumpStart

To deploy from SageMaker Studio using the GUI, select Nemotron 3 Ultra in the Foundation Models section of JumpStart and create an endpoint with one click.

Using the Python SDK, implement it with the following code:

import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(
    model_id="huggingface-reasoning-nvidia-nemotron-3-ultra-550b-a55b-nvfp4",
    role=sagemaker.get_execution_role(),
)
predictor = model.deploy(accept_eula=True)

When running inference, send requests in the following format:

payload = {
    "messages": [{
        "role": "user",
        "content": "Break this task into subtasks, identify which tools are needed, and run them in sequence."
    }],
    "max_tokens": 20480,
    "temperature": 0.6,
    "top_p": 0.95,
}
response = predictor.predict(payload)

After use, delete the endpoint with predictor.delete_endpoint() to avoid continuous billing.

(Reference: AWS Machine Learning Blog)

Enterprise Use Cases and Cost Considerations

Nemotron 3 Ultra excels in workloads requiring sustained multi-stage inference. Specific enterprise use cases include automating complex business processes, supporting long-term decision-making, and performing multi-step analysis tasks.

When deploying, note that GPU instances like ml.p5en.48xlarge incur costs of several dollars per hour. It is essential to confirm the Amazon SageMaker AI pricing structure beforehand and delete the endpoint after use.

Model deployment requires agreement to the EULA (End User License Agreement), specifying the accept_eula=True parameter.

(Reference: AWS Machine Learning Blog)

Summary

  • Deploy Nemotron 3 Ultra instantly using SageMaker JumpStart’s GUI or Python SDK to start developing agent-based AI applications
  • Leverage the MoE architecture’s characteristics to efficiently process long-term, multi-stage inference tasks that are difficult to achieve with traditional dense models
  • Achieve significant cost reduction while maintaining equivalent quality compared to traditional dense models, making it applicable to automating complex business processes in enterprises