AWS Strands Evals’ New Multimodal Evaluation Feature

The multimodal evaluation feature of the newly announced Strands Evals SDK by AWS automates the quality verification of AI applications that combine images and text. Unlike traditional text-only evaluations, which may miss visual hallucinations or factual inaccuracies, this feature uses a judgment model that directly references images to identify such issues.

This feature provides four evaluation metrics: Overall Quality, Correctness, Faithfulness, and Instruction Following. Each evaluator processes image and text responses simultaneously, returning a score of 1-5 on the Likert scale or a binary score, along with the inference process.

According to Gartner’s predictions, by 2030, 80% of enterprise software will be multimodal, representing a significant increase from less than 10% in 2024. To adapt to this change, there is a growing need for image-based automatic evaluation systems.

(Reference: Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals)

Limitations of Traditional Text-Only Evaluations

Text-only LLM-as-a-Judge evaluations can overlook critical failures rooted in images. In applications such as invoice reading, dashboard summarization, and screenshot explanation, text evaluators can assess the fluency and structure of outputs but cannot determine if they accurately match the image content.

Specific failure examples include verifying if a caption accurately describes an image, if the extracted total from an invoice matches the document, or if a screen summary hallucinates non-existent buttons. Text-only judgments can approve outputs without verifying the truth within images.

Furthermore, even if overall quality evaluations yield low scores, it’s unclear what the specific issues are. Different failure modes, such as factual inaccuracies, fabricated details, or instruction disregard, require different remedies.

(Reference: Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals)

How the Multimodal Judgment Framework Works

The new framework takes images, text queries, and model-generated responses as inputs. It constructs a multimodal evaluation prompt that integrates these elements and sends it to an MLLM (Multimodal Large Language Model)-based judgment model.

During judgment, the model verifies the plausibility of text responses while directly referencing images. It handles cases with and without reference answers, returning scores and inference strings for debugging. This framework can be integrated as a drop-in replacement into Strands Evals’ existing CaseExperimentReport workflow.

By incorporating it into continuous integration (CI) pipelines, visual hallucinations, factual inaccuracies, and instruction violations can be automatically detected, replacing manual reviews or unreliable text-only proxy evaluations with automated multimodal evaluation.

(Reference: Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals)

Prerequisites and Setup for Implementation

To use this feature, installation of the strands-agents-evals and strands-agents packages is necessary. In an AWS environment, authentication setup via the aws configure command and InvokeModel permission for model services like Amazon Bedrock are required.

Evaluation execution follows the existing Strands Evals workflow in three stages: Case, Experiment, and Report. At each stage, multimodal evaluators can be used as alternatives to traditional text evaluators.

From creating evaluation cases that include images to generating result reports, a unified SDK interface manages the process. This allows the quality control process for AI applications that handle visual content to be integrated into existing development flows.

(Reference: Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals)

Summary

  • The four multimodal evaluators (Overall Quality, Correctness, Faithfulness, Instruction Following) in the Strands Evals SDK enable automatic verification of the quality of AI applications that combine images and text.
  • By integrating into the existing CaseExperimentReport workflow as a drop-in replacement, visual hallucinations and factual inaccuracies that cannot be detected by text-only evaluations can be automatically detected in CI/CD pipelines.
  • Ahead of 2030, when 80% of enterprise software is expected to be multimodal, introducing image-based automatic evaluation systems can help build scalable quality management systems, reducing dependence on manual reviews.