TRUST-VL: A Unified and Explainable Vision–Language Model for General Multimodal Misinformation Detection

Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee
▶ National University of Singapore

EMNLP, 2025 (Oral) / PDF / Project Page / Code

We present a unified and explainable vision–language model for various multimodal misinformation detection tasks across textual, visual, and cross-modal distortions, enhanced by the Question-Aware Visual Amplifier (QAVA) and supported by the large-scale TRUST-Instruct dataset of 198K reasoning samples.

Abstract

Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model’s ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

Highlights

We propose TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. It integrates a novel Question-Aware Visual Amplifier (QAVA) module to extract task-specific visual features and support reasoning across misinformation detection tasks.
We construct TRUST-Instruct, a large-scale instruction dataset of 198K samples with structured reasoning chains aligned with human fact-checking workflows, enabling effective joint training across diverse distortion types.
Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, with superior generalization and interpretability compared to existing detectors and general VLMs.

Method Overview

Given an image–text claim, TRUST-VL retrieves external evidence and performs structured, step-wise reasoning. Text, evidence, and targeted questions are encoded by a textual encoder, while the image is processed by a visual encoder with a general projector and the QAVA module. The resulting tokens are fed into an LLM to yield a final judgment and explanation.

Question-Aware Visual Amplifier (QAVA)

QAVA inserts a small set of learnable, question-conditioned tokens. Via self-attention (on the question) and cross-attention (to image features), these tokens extract precise, task-relevant cues (e.g., subtle facial edits) without degrading performance on other distortion types. The enhanced features act as soft visual prompts to guide the LLM’s reasoning.

TRUST-Instruct

We construct TRUST-Instruct by prompting a strong VLM with a structured reasoning template and then verifying outputs for label consistency. Each reasoning chain begins with shared steps (text analysis, image description) and branches into task-specific checks for textual, visual, and cross-modal distortions.

Performance Study

Main Results

TRUST-VL achieves state-of-the-art average accuracy across diverse datasets, with particularly large gains on fine-grained visual manipulation and robust generalization to out-of-domain benchmarks.

Case Studies

Representative examples across textual, visual, and cross-modal distortions show how TRUST-VL pinpoints the deceptive elements and justifies its verdict with interpretable, step-wise explanations.

BibTeX

If you find this work helpful, please cite:

@inproceedings{yan-etal-2025-trust,
    title = "{TRUST}-{VL}: An Explainable News Assistant for General Multimodal Misinformation Detection",
    author = "Yan, Zehong  and
      Qi, Peng  and
      Hsu, Wynne  and
      Lee, Mong-Li",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.284/",
    pages = "5588--5604",
    ISBN = "979-8-89176-332-6",
}

Acknowledgement

Written on August 29, 2025