TRUST-VL: A Unified and Explainable Vision–Language Model for General Multimodal Misinformation Detection
Zehong Yan, Peng Qi, Wynne Hsu and Mong Li Lee
National University of Singapore
EMNLP, 2025 / PDF / Project Page / Code / Data
We present a unified and explainable vision–language model for various multimodal misinformation detection tasks across textual, visual, and cross-modal distortions, enhanced by the Question-Aware Visual Amplifier (QAVA) and supported by the large-scale TRUST-Instruct dataset of 198K reasoning samples.
Abstract
Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model’s ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
Highlights
- We propose TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. It integrates a novel Question-Aware Visual Amplifier (QAVA) module to extract task-specific visual features and support reasoning across misinformation detection tasks.
- We construct TRUST-Instruct, a large-scale instruction dataset of 198K samples with structured reasoning chains aligned with human fact-checking workflows, enabling effective joint training across diverse distortion types.
- Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, with superior generalization and interpretability compared to existing detectors and general VLMs.
Method Overview
Given an image–text claim, TRUST-VL retrieves external evidence and performs structured, step-wise reasoning. Text, evidence, and targeted questions are encoded by a textual encoder, while the image is processed by a visual encoder with a general projector and the QAVA module. The resulting tokens are fed into an LLM to yield a final judgment and explanation.
Question-Aware Visual Amplifier (QAVA)
QAVA inserts a small set of learnable, question-conditioned tokens. Via self-attention (on the question) and cross-attention (to image features), these tokens extract precise, task-relevant cues (e.g., subtle facial edits) without degrading performance on other distortion types. The enhanced features act as soft visual prompts to guide the LLM’s reasoning.
TRUST-Instruct
We construct TRUST-Instruct by prompting a strong VLM with a structured reasoning template and then verifying outputs for label consistency. Each reasoning chain begins with shared steps (text analysis, image description) and branches into task-specific checks for textual, visual, and cross-modal distortions.
Performance Study
Main Results
TRUST-VL achieves state-of-the-art average accuracy across diverse datasets, with particularly large gains on fine-grained visual manipulation and robust generalization to out-of-domain benchmarks.
Case Studies
Representative examples across textual, visual, and cross-modal distortions show how TRUST-VL pinpoints the deceptive elements and justifies its verdict with interpretable, step-wise explanations.
BibTeX
If you find this work helpful, please cite:
@inproceedings{yan2025trustvl,
title={TRUST-VL: A Unified and Explainable Vision–Language Model for General Multimodal Misinformation Detection},
author={Yan, Zehong and Qi, Peng and Hsu, Wynne and Lee, Mong Li},
booktitle={To appear},
year={2025}
}