Authors
Mohamed Yacine Djema, Hacene Fouchal, Olivier Flauzac, University of Reims Champagne-Ardenne, France
Abstract
Large language models (LLMs) remain vulnerable to adversarial prompting, yet state-of-the-art certified defenses such as Erase-and-Check (EC) are too slow for real-time use because they must re-evaluate hundreds of prompt variants. We investigate whether a single, attribution-guided deletion can approximate EC’s robustness at a fraction of the cost. Two variants are proposed. Method A keeps an external safety filter but replaces EC’s exhaustive search with one SHAP/feature-ablation pass, erasing the k most influential tokens before a single re-check. Method B removes the filter entirely: we compute SHAP scores inside the generator (Vicuna-7B), excise the top-r% tokens once, and re-generate. On the AdvBench suite with Greedy-Coordinate-Gradient suffixes (|α| ≤ 20), Method A detects up to 75% of attacks when 55% of tokens are removed—two forward passes instead of EC’s linear-to-combinatorial explosion—while SHAP consistently outperforms feature ablation. Method B, guided solely by SHAP, cuts harmful completions from 100% to 5% after deleting the top-20% tokens and sustains single-digit harm rates for 15–45% deletion budgets, narrowing EC’s safety gap yet adding negligible latency. An explainer comparison shows SHAP recovers nearly every adversarial token within the top-5% importance ranks, whereas LIME is slightly noisier and feature ablation trails far behind. These findings expose a tunable speed–safety trade-off: attribution-guided, single-pass excision delivers large latency gains with a bounded drop in worst-case guarantees. Careful explainer choice and deletion budgeting are critical, but attribution can transform explainability from a diagnostic tool into the backbone of practical, low-latency LLM defenses.
Keywords
Large Language Models, LLMs, Adversarial Prompting, Jailbreak Attacks, Explainable AI, Greedy Coordinate Gradient, Safety Certification and Robustness