Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

Zhendong Liu^*, Yuanbi Nie^*, Yingshui Tan^*†, Jiaheng Liu, Xiangyu Yue, Qiushi Cui
Chongjun Wang, Xiaoyong Zhu, Bo Zheng

Taobao & Tmall Group of Alibaba Nanjing University Chongqing University Chinese University of Hong Kong ^*Indicates Equal Contribution ^†Corresponding Author

Abstract

Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to LLMs form Vision Language Models (VLMs). However, recent research shows that the visual modality in VLMs is highly vulnerable, allowing attackers to bypass safety alignment in LLMs through visually transmitted content, launching harmful attacks. To address this challenge, we propose a progressive concept-based alignment strategy, PSA-VLM, which incorporates safety modules as concept bottlenecks to enhance visual modality safety alignment. By aligning model predictions with specific safety concepts, we improve defenses against risky images, enhancing explainability and controllability while minimally impacting general performance. Our method is obtained through two-stage training. The low computational cost of the first stage brings very effective performance improvement, and the fine-tuning of the language model in the second stage further improves the safety performance. Our method achieves state-of-the-art results on popular VLM safety benchmark.

Key Features of PSA-VLM: Safety Based on the Concept Bottleneck Framework

In vision-language models (VLMs), safety alignment refers to ensuring that the model generates controlled and appropriate responses to multimodal inputs, particularly those that may include sensitive visual content. VLMs face specific vulnerabilities in their visual modality, allowing harmful or inappropriate content to bypass traditional language-based safety mechanisms. To address this issue, we propose PSA-VLM, a Progressive Safety Alignment method based on the Concept Bottleneck Model (CBM) framework. This approach enhances VLM safety by isolating safety-critical features through a controllable concept bottleneck, implemented within a layered, concept-driven architecture.

(1) Concept Bottleneck Model-Driven Core Architecture

The core design of PSA-VLM revolves around the Concept Bottleneck (CB), which introduces a high-level safety concept layer between visual inputs and model outputs to achieve transparency and controllability.

Explicit Concept Safety Head: By leveraging cross-attention between image and text information, this component maps visual features to specific safety categories (e.g., explicit content, political sensitivity) and risk levels (high, medium, low), enabling fine-grained safety predictions.
Implicit Concept Safety Tokens: These are additional training tokens embedded directly within the visual inputs to enhance the model’s sensitivity to implicit risk signals. These trainable tokens align the model’s attention with concept-specific indicators for unsafe visual inputs. While not semantically interpretable, they serve as implicit concept markers.

(2) Multimodal Collaborative Safety Modules

Safety Projector: Positioned after the image encoder, it focuses on extracting safety-relevant visual features and transforming them into safety-aligned representations.
Text-Visual Alignment Mechanism: By combining text inputs and safety features, this mechanism generates safety prompts under joint conditions, dynamically guiding the model to produce safe responses in high-risk scenarios.

(3) Two-Stage Training Strategy

Stage 1: Safety Feature Extraction - The large language model (LLM) and image encoder are frozen, and only the safety modules are trained. This stage emphasizes risk identification and feature alignment at the concept level. Components such as the safety projector, tokens, and head are trained to classify and extract safety-aligned features from visual inputs, ensuring consistent responses to risky content.
Stage 2: Model Fine-Tuning - The LLM is unfrozen, and the safety modules are deeply integrated with the language model. This stage enables the model to fully incorporate safety concepts, improving its performance on multimodal inputs.

(4) Dynamic Safety Control in the Inference Stage

During inference, PSA-VLM leverages the outputs of the safety head to dynamically intervene in visual content processing. By adjusting the generated text using joint conditional probabilities, it ensures safe responses to high-risk content. For example, when detecting explicit content, PSA-VLM offers actionable options, allowing downstream users to make decisions based on their needs, thereby enabling more flexible safety management.

Through these modules, PSA-VLM not only improves the model's ability to address unsafe content but also significantly enhances its interpretability and controllability, setting a new paradigm for safety alignment in multimodal systems. By designing efficient safety modules and alignment training strategies, PSA-VLM ensures that its enhanced safety capabilities do not compromise performance on standard tasks, achieving a balance between safety and general utility.

Evaluation of VLM Performance: Safety and General Domain Capabilities

We evaluate the performance of vision-language models (VLMs) from two perspectives: safety performance and general domain performance. To ensure a fair comparison, we first assess our model using the RTVLM benchmark and GPT-4 scoring. Since the RTVLM dataset is limited and does not include sensitive data, we expanded the evaluation scope to incorporate additional risk datasets focusing on harmful political content, explicit material, and cyberbullying. We further combine subjective evaluations from GPT-4 and human experts to provide comprehensive and reliable assessments. For general domain performance, we evaluate the model using multiple benchmarks, including MMBench, SEEDBench, and MME.

(1) Safety Performance

We first used the RTVLM benchmark to analyze VLMs across various dimensions through GPT-4 scoring, covering four distinct categories to understand the models' safety capabilities. As shown in Table 1, we evaluated several open-source VLMs, GPT-4V, and our PSA-VLM. The results demonstrate that GPT-4V performed well across all categories, especially in safety-related scenarios like CAPTCHA and jailbreak detection. InternLM-XComposer2 also exhibited strong performance across multiple metrics. Aligned with these findings, the PSA-VLM based on LLaVA achieved remarkable results, particularly when leveraging LoRA to unfreeze the LLM. It attained the highest scores in political (8.36) and racial (8.43) categories. On average, PSA-VLM-7B (+LoRA) stood out with a leading score of 8.26, followed by PSA-VLM without unfreezing the LLM, scoring 8.18. Notably, the 13B model with LoRA achieved the highest average score of 8.46. These improvements highlight the effectiveness of the additional safety modules and the two-stage safety alignment strategy in enhancing the safety of PSA-VLM.

The RTVLM dataset does not include other high-risk sensitive categories such as explicit content or cyberbullying. Therefore, we conducted experiments on additional risk datasets to evaluate PSA-VLM's safety performance. As shown in Table 2, PSA-VLM-13B achieved the best performance in detecting harmful political content (9.49), explicit material (8.72), and cyberbullying (7.45), significantly outperforming the baseline model LLaVA-v1.5-13B, which scored 6.67, 1.11, and 6.16, respectively. Although PSA-VLM-7B with LoRA showed slight declines in certain tasks (e.g., 8.91 and 6.82), it still significantly outperformed LLaVA-v1.5-7B.

(2) General Performance

In PSA-VLM, enhancing safety performance does not come at the expense of general domain capabilities. Despite adopting enhanced safety measures, PSA-VLM-7B remained competitive on general benchmarks such as MMBench, SEEDBench, and MME. As shown in Table 3, PSA-VLM-7B demonstrated improvements on general benchmarks, achieving scores of 68.5 and 65.3 on MMBench and SEEDBench, respectively, indicating better general performance. Moreover, in multimodal benchmark evaluations, PSA-VLM-7B effectively identified and rejected responses to potentially risky images, showcasing its high sensitivity to unsafe content and underscoring the effectiveness of our safety alignment approach.

Images deemed unsafe were filtered, allowing for the evaluation of general performance using completely clean data. This capability to respond to unsafe content reflects PSA-VLM-7B's robust safety performance while maintaining its overall functional capacity.

BibTeX

@misc{liu2024enhancingvisionlanguagemodelsafety, title={PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment}, author={Zhendong Liu and Yuanbi Nie and Yingshui Tan and Xiangyu Yue and Qiushi Cui and Chongjun Wang and Xiaoyong Zhu and Bo Zheng}, year={2024}, eprint={2411.11543}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.11543}, }

PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

Abstract

Selected examples of using unsafe images to generate. The content inside the red box is the generated unsafe answer by other VLMs, while the content inside the green box is the safe answer generated by our PSA-VLM..

Example of 10 tasks under Politics, Illegal Risk, Insults and Bullying, Fairness, Privacy, and Misleading categories in the RTVLM benchmark and other risk datasets.