The Safety Illusion? Testing the Boundaries of Concept Removal in Diffusion Models

Text-to-image diffusion models are capable of producing high-quality images from textual descriptions; however, they present notable security concerns. These include the potential for generating Not-Safe-For-Work (NSFW) content, replicating artists' styles without authorization, or creating dee...

Description complète

Détails bibliographiques
Publié dans:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society. - 1992. - PP(2025) vom: 17. Okt.
Auteur principal: Pan, Yixiang (Auteur)
Autres auteurs: Luo, Ting, Li, Yufeng, Xing, Wenpeng, Chen, Minjie, Han, Meng
Format: Article en ligne
Langue:English
Publié: 2025
Accès à la collection:IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Sujets:Journal Article
LEADER 01000naa a22002652c 4500
001 NLM394217152
003 DE-627
005 20251018232425.0
007 cr uuu---uuuuu
008 251018s2025 xx |||||o 00| ||eng c
024 7 |a 10.1109/TIP.2025.3620665  |2 doi 
028 5 2 |a pubmed25n1603.xml 
035 |a (DE-627)NLM394217152 
035 |a (NLM)41105541 
040 |a DE-627  |b ger  |c DE-627  |e rakwb 
041 |a eng 
100 1 |a Pan, Yixiang  |e verfasserin  |4 aut 
245 1 4 |a The Safety Illusion? Testing the Boundaries of Concept Removal in Diffusion Models 
264 1 |c 2025 
336 |a Text  |b txt  |2 rdacontent 
337 |a ƒaComputermedien  |b c  |2 rdamedia 
338 |a ƒa Online-Ressource  |b cr  |2 rdacarrier 
500 |a Date Revised 17.10.2025 
500 |a published: Print-Electronic 
500 |a Citation Status Publisher 
520 |a Text-to-image diffusion models are capable of producing high-quality images from textual descriptions; however, they present notable security concerns. These include the potential for generating Not-Safe-For-Work (NSFW) content, replicating artists' styles without authorization, or creating deepfakes. Recent advancements have proposed concept erasure techniques to eliminate sensitive concepts from these models, aiming to mitigate the generation of undesirable content. Nevertheless, the robustness of these techniques against a wide range of adversarial inputs has not been comprehensively investigated. To address this challenge, a novel two-stage optimization attack framework based on adversarial perturbations, referred to as Concept Embedding Adversary (CEA), was proposed in the present study. By leveraging the cross-modal alignment priors of the CLIP model, CEA iteratively adjusts adversarial embedding vectors to approximate the semantic expression of specific target concepts. This process enables the construction of deceptive adversarial prompts that exploit diffusion models, compelling them to regenerate previously erased concepts. The performance of concept erasure methods was evaluated, specifically when dealing with diversified adversarial prompts targeting erased concepts, such as NSFW content, artistic styles, and objects. Extensive experimental results demonstrate that existing concept erasure methods are unable to completely eliminate target concepts. In contrast, the proposed CEA framework exploits residual vulnerabilities within the generative latent space through a two-stage optimization process. By achieving precise cross-modal alignment, CEA attains significantly higher ASR in regenerating erased concepts 
650 4 |a Journal Article 
700 1 |a Luo, Ting  |e verfasserin  |4 aut 
700 1 |a Li, Yufeng  |e verfasserin  |4 aut 
700 1 |a Xing, Wenpeng  |e verfasserin  |4 aut 
700 1 |a Chen, Minjie  |e verfasserin  |4 aut 
700 1 |a Han, Meng  |e verfasserin  |4 aut 
773 0 8 |i Enthalten in  |t IEEE transactions on image processing : a publication of the IEEE Signal Processing Society  |d 1992  |g PP(2025) vom: 17. Okt.  |w (DE-627)NLM09821456X  |x 1941-0042  |7 nnas 
773 1 8 |g volume:PP  |g year:2025  |g day:17  |g month:10 
856 4 0 |u http://dx.doi.org/10.1109/TIP.2025.3620665  |3 Volltext 
912 |a GBV_USEFLAG_A 
912 |a SYSFLAG_A 
912 |a GBV_NLM 
912 |a GBV_ILN_350 
951 |a AR 
952 |d PP  |j 2025  |b 17  |c 10