An Interpretability System for Natural Language Processing Models: Enhancing Robustness Through Adversarial Attack Analysis
Authors: Mohsen Rahimi, Giulia De Poli, Andrea Masella, Matteo Bregonzio
Abstract:
Interpretability is an important aspect of natural language processing (NLP) models and refers to the ability to understand how a model makes its predictions and why it makes certain decisions; furthermore, it is crucial for a trustworthy approach. Interpretability is particularly important in NLP since it can help to identify potential biases and errors in the model, and can also help to improve the overall performance of the model. This paper discusses the increasing complexity of NLP models and the need for interpretability to ensure their reliability, impartiality, and accuracy. The paper proposes an interpretability system that can analyse and interpret the predictions of black-box NLP models using adversarial examples. The system uses a hybrid approach that combines both local and global interpretability methods to achieve a more comprehensive understanding of the model's behaviour. The proposed system provides a state-of-art solution to the challenge of understanding how NLP models arrive at their decisions and when they can fail, thus improving trust between humans and machines in real-world applications. They can be used to identify potential biases and errors and to build more robust, trustworthy, and accurate models.
Keywords: Interpretability, trustworthy AI, XAI, Explainable AI, NLP, Natural Language Processing.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14References:
[1] N. Carlini and D. Wagner, Towards Evaluating the Robustness of Neural Networks, in 2017 IEEE Symposium on Security and Privacy (SP) (2017), pp. 39–57.
[2] B. Biggio and F. Roli, Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning, in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Association for Computing Machinery, New York, NY, USA, 2018), pp. 2154–2156.
[3] A. Athalye, N. Carlini, and D. Wagner, Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, in Proceedings of the 35th International Conference on Machine Learning (PMLR, 2018), pp. 274–283.
[4] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer, Adversarial Patch, arXiv:1712.09665.
[5] A. Ghorbani, A. Abid, and J. Zou, Interpretation of Neural Networks Is Fragile, Proc. AAAI Conf. Artif. Intell. 33, 3681 (2019).
[6] C. Rudin, Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead, Nat. Mach. Intell. 1, 5 (2019).
[7] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, Adversarial Examples for Malware Detection, in Computer Security – ESORICS 2017, edited by S. N. Foley, D. Gollmann, and E. Snekkenes (Springer International Publishing, Cham, 2017), pp. 62–79.
[8] Ribeiro, M. T., Singh, S., & Guestrin, C. (2018, July). Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 856-865).
[9] S. M. Lundberg and S.-I. Lee, A Unified Approach to Interpreting Model Predictions, in Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc., 2017).
[10] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
[11] Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 142-150). Association for Computational Linguistics.