Explain - Delete - Defend: Attribution - Guided Token Excision for LLM Safety

Mohamed Yacine Djema, Hacene Fouchal, Olivier Flauzac, University of Reims Champagne-Ardenne, France; Mohamed Yacine Djema, Hacene Fouchal, Olivier Flauzac, University of Reims Champagne-Ardenne, France

Explain - Delete - Defend: Attribution - Guided Token Excision for LLM Safety

Authors

Mohamed Yacine Djema, Hacene Fouchal, Olivier Flauzac, University of Reims Champagne-Ardenne, France

Abstract

Large language models (LLMs) remain vulnerable to adversarial prompting, yet state-of-the-art certified defenses such as Erase-and-Check (EC) are too slow for real-time use because they must re-evaluate hundreds of prompt variants. We investigate whether a single, attribution-guided deletion can approximate ECâ€™s robustness at a fraction of the cost. Two variants are proposed. Method A keeps an external safety filter but replaces ECâ€™s exhaustive search with one SHAP/feature-ablation pass, erasing the k most influential tokens before a single re-check. Method B removes the filter entirely: we compute SHAP scores inside the generator (Vicuna-7B), excise the top-r% tokens once, and re-generate. On the AdvBench suite with Greedy-Coordinate-Gradient suffixes (|Î±| â‰¤ 20), Method A detects up to 75% of attacks when 55% of tokens are removedâ€”two forward passes instead of ECâ€™s linear-to-combinatorial explosionâ€”while SHAP consistently outperforms feature ablation. Method B, guided solely by SHAP, cuts harmful completions from 100% to 5% after deleting the top-20% tokens and sustains single-digit harm rates for 15â€“45% deletion budgets, narrowing ECâ€™s safety gap yet adding negligible latency. An explainer comparison shows SHAP recovers nearly every adversarial token within the top-5% importance ranks, whereas LIME is slightly noisier and feature ablation trails far behind. These findings expose a tunable speedâ€“safety trade-off: attribution-guided, single-pass excision delivers large latency gains with a bounded drop in worst-case guarantees. Careful explainer choice and deletion budgeting are critical, but attribution can transform explainability from a diagnostic tool into the backbone of practical, low-latency LLM defenses.

Keywords

Large Language Models, LLMs, Adversarial Prompting, Jailbreak Attacks, Explainable AI, Greedy Coordinate Gradient, Safety Certification and Robustness

CS&IT Conference Proceedings

Explain - Delete - Defend: Attribution - Guided Token Excision for LLM Safety