Systematic Generation of Adversarial Datasets with Controllable Noise Levels

Mohammad Reza Norouzi

Authors

Mohammad Reza Norouzi * M.Sc. Student, Department of Computer Engineering, Kish International Campus, University of Tehran, Kish, Iran mrnorouzi@ut.ac.ir

Keywords:

Persian sentiment analysis, adversarial data, controllable noise, natural language processing, model robustness

Abstract

The rapid proliferation of user-generated textual content on social networks and digital platforms has created significant challenges for sentiment analysis systems. These challenges are more pronounced in the Persian language due to the scarcity of high-quality datasets, orthographic variability, and the high sensitivity of models to noise. One of the most critical issues is the vulnerability of machine learning models to textual noise and adversarial attacks, which can lead to substantial performance degradation. The objective of this study is to propose a systematic approach for generating adversarial textual datasets with controllable noise levels in order to evaluate and enhance the robustness of Persian sentiment analysis models. In this research, a baseline Persian sentiment analysis dataset was first preprocessed. Subsequently, a framework was designed to introduce targeted noise types, including word substitution, deletion, insertion, and permutation. For each type of noise, an intensity parameter was defined to enable precise control over the degree of perturbation. The adversarial data were generated independently of any specific model, and each instance was annotated not only with its sentiment label but also with metadata specifying the type and level of noise applied. The performance of several sentiment analysis models was then evaluated before and after training with the adversarial dataset. The results indicated that models trained exclusively on clean data experienced significant performance degradation when exposed to adversarial samples, particularly under substitution and deletion noise. In contrast, training with the generated adversarial dataset led to a considerable improvement in noise robustness and performance stability. The findings suggest that the systematic generation of adversarial data with controllable noise constitutes an effective instrument for sensitivity analysis and robustness enhancement in Persian sentiment analysis models and can play a critical role in the development of reliable systems under real-world conditions.

References

[1] M. Wankhade, A. C. S. Rao, and C. Kulkarni, "A survey on sentiment analysis methods, applications, and challenges," Artificial Intelligence Review, vol. 55, no. 7, pp. 5731-5780, 2022.

[2] A. Saxena, H. Reddy, and P. Saxena, "Introduction to sentiment analysis covering basics, tools, evaluation metrics, challenges, and applications," in Principles of social networking: the new horizon and emerging challenges, 2022, pp. 249-277.

[3] A. P. Rodrigues and N. N. Chiplunkar, "A new big data approach for topic classification and sentiment analysis of Twitter data," Evolutionary Intelligence, pp. 1-11, 2022.

[4] D. K. Jain, P. Boyapati, J. Venkatesh, and M. Prakash, "An intelligent cognitive-inspired computing with big data analytics framework for sentiment analysis and classification," Information Processing & Management, vol. 59, no. 1, p. 102758, 2022.

[5] K. Chowdhary and K. R. Chowdhary, "Natural language processing," in Fundamentals of artificial intelligence, 2020, pp. 603-649.

[6] D. Khurana, A. Koli, K. Khatter, and S. Singh, "Natural language processing: state of the art, current trends and challenges," Multimedia tools and applications, vol. 82, no. 3, pp. 3713-3744, 2023.

[7] Y. Zhou, D. Jin, and X. Ren, "A Survey of Adversarial Defenses and Robustness in Natural Language Processing," arXiv preprint, 2022.

[8] J. R. Jim, M. A. R. Talukder, P. Malakar, M. M. Kabir, K. Nur, and M. F. Mridha, "Recent advancements and challenges of NLP-based sentiment analysis: A state-of-the-art review," Natural Language Processing Journal, vol. 100059, 2024.

[9] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daumé Iii, "Deep unordered composition rivals syntactic methods for text classification," in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), 2015.

[10] J. Li, W. Monroe, and D. Jurafsky, "Understanding neural networks through representation erasure," arXiv preprint, 2016.

[11] R. K. Behera, M. Jena, S. K. Rath, and S. Misra, "Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data," Information Processing & Management, vol. 58, no. 1, p. 102435, 2021.

[12] A. Berrajaa, "Natural language processing for the analysis sentiment using a LSTM model," International Journal of Advanced Computer Science and Applications, vol. 13, no. 5, 2022, doi: 10.14569/IJACSA.2022.0130589.

[13] A. Vaswani and et al., "Attention Is All You Need," Advances in Neural Information Processing Systems, vol. 30, 2017.

[14] B. Ghojogh and A. Ghodsi, "Attention mechanism, transformers, BERT, and GPT: tutorial and survey," 2020, doi: 10.31219/osf.io/m6gcn.

[15] M. V. Koroteev, "BERT: a review of applications in natural language processing and understanding," arXiv preprint, 2021.

[16] D. Rothman, Transformers for Natural Language Processing. Packt Publishing Ltd., 2022.

[17] K. Pipalia, R. Bhadja, and M. Shukla, "Comparative analysis of different transformer based architectures used in sentiment analysis," in 2020 9th international conference system modeling and advancement in research trends (SMART), 2020: IEEE, pp. 411-415.

[18] H. Bashiri and H. Naderi, "Comprehensive review and comparative analysis of transformer models in sentiment analysis," Knowledge and Information Systems, vol. 66, no. 12, pp. 7305-7361, 2024.

[19] R. Asgarnezhad and S. A. Monadjemi, "Persian sentiment analysis: feature engineering, datasets, and challenges," Journal of applied intelligent systems & information sciences, vol. 2, no. 2, pp. 1-21, 2021.

[20] Z. Rajabi and M. Valavi, "A survey on sentiment analysis in Persian: a comprehensive system perspective covering challenges and advances in resources and methods," Cognitive Computation, vol. 13, no. 4, pp. 882-902, 2021.

[21] R. Shokrzad, "The Impact of Culture on Persian NLP: A Linguistic Perspective," Journal of Language and AI Ethics, 2023.

[22] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, "Parsbert: Transformer-based model for persian language understanding," Neural Processing Letters, vol. 53, pp. 3831-3847, 2021.

[23] M. Masumi, S. S. Majd, M. Shamsfard, and H. Beigy, "FaBERT: Pre-training BERT on Persian Blogs," ed, 2024.

[24] S. Moniri, T. Schlosser, and D. Kowerko, "Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning," Computers, vol. 13, no. 8, p. 212, 2024.

[25] M. Assadi, V. Shaghaghi, and M. Kahani, "Capabilities and Limitations of Persian Stemming in Natural Language Processing," Research in Western Iranian Languages and Dialects, vol. 13, no. 1, pp. 1-17, 2025.

[26] I. Lasri, A. Riadsolh, and M. Elbelkacemi, "Real-time Twitter Sentiment Analysis for Moroccan Universities using Machine Learning and Big Data Technologies," International Journal of Emerging Technologies in Learning, vol. 18, no. 5, 2023.

[27] D. Paulraj, P. Ezhumalai, and M. Prakash, "A Deep Learning Modified Neural Network (DLMNN) based proficient sentiment analysis technique on Twitter data," Journal of Experimental & Theoretical Artificial Intelligence, vol. 36, no. 3, 2024.

[28] S. Shumaly, M. Yazdinejad, and Y. Guo, "Persian sentiment analysis of an online store independent of pre-processing using convolutional neural network with fastText embeddings," PeerJ Computer Science, vol. 7, p. e422, 2021.

[29] R. Ahamad and K. N. Mishra, "Exploring sentiment analysis in handwritten and E-text documents using advanced machine learning techniques: a novel approach," Journal of Big Data, vol. 12, no. 1, p. 11, 2025.

[30] Y. Belinkov and Y. Bisk, "Synthetic and natural noise both break neural machine translation," in Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.

[31] S. Eger and D. Benz, "Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems," arXiv preprint, 2020, doi: 10.18653/v1/N19-1165.

[32] J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi, "Black-box generation of adversarial text sequences to evade deep learning classifiers," in Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), 2018.

[33] J. Li, S. Ji, T. Du, B. Li, and T. Wang, "TextBugger: Generating adversarial text against real-world applications," in Proceedings of the Network and Distributed System Security Symposium (NDSS), 2019.

[34] H. Hosseini, S. Kannan, B. Zhang, and R. Poovendran, "Deceiving Google's Perspective API built for detecting toxic comments," arXiv preprint, 2017.

[35] J. Wei and K. Zou, "EDA: Easy data augmentation techniques for boosting performance on text classification tasks," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.

[36] S. Edunov, M. Ott, M. Auli, and D. Grangier, "Understanding back-translation at scale," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, doi: 10.18653/v1/D18-1045.

[37] C. H. Chang and Y. C. Lin, "Code-switching sentence generation by generative adversarial networks and its application to data augmentation," in Proceedings of the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019.

[38] X. Liu and et al., "Adversarial training for large neural language models," arXiv preprint, 2020.

[39] A. Aghaebrahimian and M. Cieliebak, "Hyperparameter tuning for deep learning in natural language processing," in 4th swiss text analytics conference (swisstext 2019), 2019: SwissText.

[40] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, "Optuna: A next-generation hyperparameter optimization framework," in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623-2631, doi: 10.1145/3292500.3330701.

[41] J. Lamy-Poirier, "Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models," arXiv preprint, 2021.

[42] A. Wang and D. Xiao, "Understanding how LLMs complete a classical NLP task by gradient accumulation-based circuit discovery," in Third International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2024), 2024, vol. 13181: SPIE, pp. 859-866.

[43] A. Alarifi and et al., "A big data approach to sentiment analysis using greedy feature selection with cat swarm optimization-based long short-term memory neural networks," The Journal of Supercomputing, vol. 76, pp. 4414-4429, 2020.

[44] S. Wu and M. Dredze, "Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT," arXiv preprint, 2019, doi: 10.18653/v1/D19-1077.

[45] D. Pruthi, B. Dhingra, and Z. C. Lipton, "Combating adversarial misspellings with robust word recognition," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.

[46] S. J. Mielke et al., "Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP," arXiv preprint, 2021.

[47] C. Xie and et al., "Adversarial examples improve image recognition," in Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Systematic Generation of Adversarial Datasets with Controllable Noise Levels

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Make a Submission

Keywords

Information Table

Language

Journal Archive

Average time from submission until

Indexing & Abstracting