The Dual Power of Interpretable Token Embeddings

The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

University of Michigan · Michigan State University

Abstract

Diffusion models excel at generating high-quality images but can memorize and reproduce harmful concepts when prompted. Although fine-tuning methods have been proposed to unlearn a target concept, they struggle to fully erase the concept while maintaining generation quality on other concepts, leaving models vulnerable to jailbreak attacks. Existing jailbreak methods demonstrate this vulnerability but offer limited insight into how unlearned models retain harmful concepts, limiting progress on effective defenses. In this work, we show that the erased concept persists as a coherent, human-interpretable linear residual subspace of the token embedding space, and that both an attack and a defense follow directly from this structure. We introduce SubAttack, a novel jailbreaking attack that reads out this subspace by learning an orthogonal set of attack token embeddings, each a linear combination of human-interpretable textual elements, revealing that unlearned models still retain the target concept through related textual components. Furthermore, our attack is also more powerful and transferable across text prompts, initial noises, and unlearned models than prior attacks. Conversely, projecting out the same subspace yields SubDefense, a lightweight plug-and-play defense mechanism that suppresses the residual concept in unlearned models. SubDefense provides stronger robustness than existing defenses while better preserving safe generation quality. Extensive experiments across multiple unlearning methods, concepts, and attack types demonstrate that our approach advances both understanding and mitigation of vulnerabilities in diffusion unlearning.

@article{chen2025dualpower, title={The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning}, author={Chen, Siyi and Zhang, Yimeng and Liu, Sijia and Qu, Qing}, journal={arXiv preprint arXiv:2504.21307}, year={2025} }

The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

**An erased concept survives as an interpretable residual subspace. SubAttack reads it out; SubDefense projects it out — both are two operations on the same structure.**

Abstract

A Shared Interpretable Residual Subspace

Learning one interpretable attack token embedding: the concept persists along a subspace spanned by non-negative combinations of existing vocabulary tokens. SubAttack learns an orthogonal basis of this subspace; SubDefense removes it by orthogonal projection.

The Residual Is Human-Interpretable

Each learned attack token decomposes into human-readable concepts. Stronger unlearning mutes explicit keywords (e.g., “nude,” “naked”) but leaves implicit associations (e.g., “slave,” “nip,” “babes”).

SubAttack: Reading Out the Subspace

SubAttack generates the target concept with high attack success rate while faithfully aligning with the text prompt — e.g., producing the concept across different backgrounds where CCE fails to follow the prompt — all while remaining fully interpretable.

The Residual Is Inherited & Transferable

Attack tokens transfer reliably across prompts, initial noise, and different unlearned models, and remain effective when transferred back to the original diffusion model — the residual is inherited from the base model rather than independently formed.

SubDefense: Projecting Out the Subspace

Removing the same subspace yields a lightweight, plug-and-play defense that lowers ASR across unlearned models and attack types — and composes on top of existing unlearners, including adversarially finetuned ones such as STEREO.

SubDefense better preserves safe generation quality (lower FID, higher CLIP) than prior defenses.

BibTeX