TRUST-VL: A Unified and Explainable Vision–Language Model for General Multimodal Misinformation Detection

Zehong Yan, Peng Qi, Wynne Hsu and Mong Li Lee
National University of Singapore

arXiv Homepage Homepage Hugging Face Hugging Face

EMNLP, 2025 / PDF / Project Page / Code / Data

We present a unified and explainable vision–language model for various multimodal misinformation detection tasks across textual, visual, and cross-modal distortions, enhanced by the Question-Aware Visual Amplifier (QAVA) and supported by the large-scale TRUST-Instruct dataset of 198K reasoning samples.


Abstract

Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model’s ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

Highlights

  • We propose TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. It integrates a novel Question-Aware Visual Amplifier (QAVA) module to extract task-specific visual features and support reasoning across misinformation detection tasks.
  • We construct TRUST-Instruct, a large-scale instruction dataset of 198K samples with structured reasoning chains aligned with human fact-checking workflows, enabling effective joint training across diverse distortion types.
  • Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, with superior generalization and interpretability compared to existing detectors and general VLMs.

Method Overview

Given an image–text claim, TRUST-VL retrieves external evidence and performs structured, step-wise reasoning. Text, evidence, and targeted questions are encoded by a textual encoder, while the image is processed by a visual encoder with a general projector and the QAVA module. The resulting tokens are fed into an LLM to yield a final judgment and explanation.

Question-Aware Visual Amplifier (QAVA)

QAVA inserts a small set of learnable, question-conditioned tokens. Via self-attention (on the question) and cross-attention (to image features), these tokens extract precise, task-relevant cues (e.g., subtle facial edits) without degrading performance on other distortion types. The enhanced features act as soft visual prompts to guide the LLM’s reasoning.

TRUST-Instruct

We construct TRUST-Instruct by prompting a strong VLM with a structured reasoning template and then verifying outputs for label consistency. Each reasoning chain begins with shared steps (text analysis, image description) and branches into task-specific checks for textual, visual, and cross-modal distortions.

Performance Study

Main Results

TRUST-VL achieves state-of-the-art average accuracy across diverse datasets, with particularly large gains on fine-grained visual manipulation and robust generalization to out-of-domain benchmarks.

Case Studies

Representative examples across textual, visual, and cross-modal distortions show how TRUST-VL pinpoints the deceptive elements and justifies its verdict with interpretable, step-wise explanations.

BibTeX

If you find this work helpful, please cite:

@inproceedings{yan2025trustvl,
  title={TRUST-VL: A Unified and Explainable Vision–Language Model for General Multimodal Misinformation Detection},
  author={Yan, Zehong and Qi, Peng and Hsu, Wynne and Lee, Mong Li},
  booktitle={To appear},
  year={2025}
}

Acknowledgement

Written on August 29, 2025