Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to LLMs form Vision Language Models (VLMs). However, recent research shows that the visual modality in VLMs is highly vulnerable, allowing attackers to bypass safety alignment in LLMs through visually transmitted content, launching harmful attacks. To address this challenge, we propose a progressive concept-based alignment strategy, PSA-VLM, which incorporates safety modules as concept bottlenecks to enhance visual modality safety alignment. By aligning model predictions with specific safety concepts, we improve defenses against risky images, enhancing explainability and controllability while minimally impacting general performance. Our method is obtained through two-stage training. The low computational cost of the first stage brings very effective performance improvement, and the fine-tuning of the language model in the second stage further improves the safety performance. Our method achieves state-of-the-art results on popular VLM safety benchmark.
In vision-language models (VLMs), safety alignment refers to ensuring that the model generates controlled and appropriate responses to multimodal inputs, particularly those that may include sensitive visual content. VLMs face specific vulnerabilities in their visual modality, allowing harmful or inappropriate content to bypass traditional language-based safety mechanisms. To address this issue, we propose PSA-VLM, a Progressive Safety Alignment method based on the Concept Bottleneck Model (CBM) framework. This approach enhances VLM safety by isolating safety-critical features through a controllable concept bottleneck, implemented within a layered, concept-driven architecture.
The core design of PSA-VLM revolves around the Concept Bottleneck (CB), which introduces a high-level safety concept layer between visual inputs and model outputs to achieve transparency and controllability.
During inference, PSA-VLM leverages the outputs of the safety head to dynamically intervene in visual content processing. By adjusting the generated text using joint conditional probabilities, it ensures safe responses to high-risk content. For example, when detecting explicit content, PSA-VLM offers actionable options, allowing downstream users to make decisions based on their needs, thereby enabling more flexible safety management.
Through these modules, PSA-VLM not only improves the model's ability to address unsafe content but also significantly enhances its interpretability and controllability, setting a new paradigm for safety alignment in multimodal systems. By designing efficient safety modules and alignment training strategies, PSA-VLM ensures that its enhanced safety capabilities do not compromise performance on standard tasks, achieving a balance between safety and general utility.
We evaluate the performance of vision-language models (VLMs) from two perspectives: safety performance and general domain performance. To ensure a fair comparison, we first assess our model using the RTVLM benchmark and GPT-4 scoring. Since the RTVLM dataset is limited and does not include sensitive data, we expanded the evaluation scope to incorporate additional risk datasets focusing on harmful political content, explicit material, and cyberbullying. We further combine subjective evaluations from GPT-4 and human experts to provide comprehensive and reliable assessments. For general domain performance, we evaluate the model using multiple benchmarks, including MMBench, SEEDBench, and MME.
We first used the RTVLM benchmark to analyze VLMs across various dimensions through GPT-4 scoring, covering four distinct categories to understand the models' safety capabilities. As shown in Table 1, we evaluated several open-source VLMs, GPT-4V, and our PSA-VLM. The results demonstrate that GPT-4V performed well across all categories, especially in safety-related scenarios like CAPTCHA and jailbreak detection. InternLM-XComposer2 also exhibited strong performance across multiple metrics. Aligned with these findings, the PSA-VLM based on LLaVA achieved remarkable results, particularly when leveraging LoRA to unfreeze the LLM. It attained the highest scores in political (8.36) and racial (8.43) categories. On average, PSA-VLM-7B (+LoRA) stood out with a leading score of 8.26, followed by PSA-VLM without unfreezing the LLM, scoring 8.18. Notably, the 13B model with LoRA achieved the highest average score of 8.46. These improvements highlight the effectiveness of the additional safety modules and the two-stage safety alignment strategy in enhancing the safety of PSA-VLM.
The RTVLM dataset does not include other high-risk sensitive categories such as explicit content or cyberbullying. Therefore, we conducted experiments on additional risk datasets to evaluate PSA-VLM's safety performance. As shown in Table 2, PSA-VLM-13B achieved the best performance in detecting harmful political content (9.49), explicit material (8.72), and cyberbullying (7.45), significantly outperforming the baseline model LLaVA-v1.5-13B, which scored 6.67, 1.11, and 6.16, respectively. Although PSA-VLM-7B with LoRA showed slight declines in certain tasks (e.g., 8.91 and 6.82), it still significantly outperformed LLaVA-v1.5-7B.
In PSA-VLM, enhancing safety performance does not come at the expense of general domain capabilities. Despite adopting enhanced safety measures, PSA-VLM-7B remained competitive on general benchmarks such as MMBench, SEEDBench, and MME. As shown in Table 3, PSA-VLM-7B demonstrated improvements on general benchmarks, achieving scores of 68.5 and 65.3 on MMBench and SEEDBench, respectively, indicating better general performance. Moreover, in multimodal benchmark evaluations, PSA-VLM-7B effectively identified and rejected responses to potentially risky images, showcasing its high sensitivity to unsafe content and underscoring the effectiveness of our safety alignment approach.
Images deemed unsafe were filtered, allowing for the evaluation of general performance using completely clean data. This capability to respond to unsafe content reflects PSA-VLM-7B's robust safety performance while maintaining its overall functional capacity.
@misc{liu2024enhancingvisionlanguagemodelsafety,
title={PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment},
author={Zhendong Liu and Yuanbi Nie and Yingshui Tan and Xiangyu Yue and Qiushi Cui and Chongjun Wang and Xiaoyong Zhu and Bo Zheng},
year={2024},
eprint={2411.11543},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.11543},
}