The Safety Illusion? Testing the Boundaries of Concept Removal in Diffusion Models

Text-to-image diffusion models are capable of producing high-quality images from textual descriptions; however, they present notable security concerns. These include the potential for generating Not-Safe-For-Work (NSFW) content, replicating artists' styles without authorization, or creating dee...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2025) vom: 17. Okt.
Auteur principal: Pan, Yixiang (Auteur)
Autres auteurs: Luo, Ting, Li, Yufeng, Xing, Wenpeng, Chen, Minjie, Han, Meng
Format: Article en ligne
Langue:English
Publié: 2025
Accès à la collection:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:Journal Article
Description
Résumé:Text-to-image diffusion models are capable of producing high-quality images from textual descriptions; however, they present notable security concerns. These include the potential for generating Not-Safe-For-Work (NSFW) content, replicating artists' styles without authorization, or creating deepfakes. Recent advancements have proposed concept erasure techniques to eliminate sensitive concepts from these models, aiming to mitigate the generation of undesirable content. Nevertheless, the robustness of these techniques against a wide range of adversarial inputs has not been comprehensively investigated. To address this challenge, a novel two-stage optimization attack framework based on adversarial perturbations, referred to as Concept Embedding Adversary (CEA), was proposed in the present study. By leveraging the cross-modal alignment priors of the CLIP model, CEA iteratively adjusts adversarial embedding vectors to approximate the semantic expression of specific target concepts. This process enables the construction of deceptive adversarial prompts that exploit diffusion models, compelling them to regenerate previously erased concepts. The performance of concept erasure methods was evaluated, specifically when dealing with diversified adversarial prompts targeting erased concepts, such as NSFW content, artistic styles, and objects. Extensive experimental results demonstrate that existing concept erasure methods are unable to completely eliminate target concepts. In contrast, the proposed CEA framework exploits residual vulnerabilities within the generative latent space through a two-stage optimization process. By achieving precise cross-modal alignment, CEA attains significantly higher ASR in regenerating erased concepts
Description:Date Revised 17.10.2025
published: Print-Electronic
Citation Status Publisher
ISSN:1941-0042
DOI:10.1109/TIP.2025.3620665