MULTI-TAP

Efficiently evaluating image-text alignment—especially for long and complex text inputs—while reflecting human preferences across multiple aspects is a significant challenge for developing reliable vision-language applications. It is especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, the research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking one or more key properties: (1) Alignment with humans, (2) Inference efficiency, (3) Long-text understanding, and (4) Applicability to multi-objective scoring. To address these challenges, we propose MULTI-TAP Multi-Objective Task-Aware Predictor), a reward model capable of both multi-objective and single-objective scoring. MULTI-TAP uses the last hidden states from large vision language models (LVLMs) to produce a single overall score. In the following stage, the frozen hidden states are combined with a lightweight, task-aware ridge regression layer, producing fine-grained scores across multiple human-interpretable objectives. Our framework shows strong performances across different LVLM architectures and various tasks, achieving superior alignment with human judgments than existing metrics (e.g., +8.1 Kendall's τ_c compared to CLIP-Score in FlickrExp) and on par with the GPT-4o-based predictor, G-VEval. To validate our method in practical settings, we release EYE4ALL, a 1k text-image-to-text dataset evaluated by 25 humans. Each sample includes a text request, image, and text response, assessed with multi-objective scores based on the response quality and alignment to the request/image. EYE4ALL is applicable for two benchmarking purposes: (1) EYE4ALLMulti, seven-dimensional human rating alignment, and (2) EYE4ALLPref, pairwise preference alignment (chosen or rejected). Our contributions can guide future research for developing human-aligned predictors.

MULTI-TAP: MULTI-OBJECTIVE TASK-AWARE PREDICTOR FOR IMAGE-TEXT ALIGNMENT

Abstract

Advantages of our MULTI-TAP over existing image-text evaluation metrics.

Schematic Diagram of MULTI-TAP Architecture.

EYE4ALLMulti Sample.

EYE4ALLPref Sample.

Performance Comparison between VisionREW-S and MULTI-TAP.

BibTeX