0% Complete
فارسی
Home
/
شانزدهمین کنفرانس بین المللی فناوری اطلاعات و دانش
Robustness Gap in NLP Models for Vulnerability Descriptions: Benchmarking and Data Augmentation
Authors :
AmirHossein Majd
1
Mahdi Yousefikia
2
Saghar Ghasemzadeh
3
Amirreza Asari
4
Arya Khoshnavataher
5
Seyedeh Leili Mirtaheri
6
1- University of Calabria
2- دانشگاه خوارزمی
3- دانشگاه خوارزمی
4- دانشگاه خوارزمی
5- دانشگاه خوارزمی
6- University of Calabria
Keywords :
Software Vulnerabilities،Natural Language Processing،Robustness Benchmark،Noise Injection،Exploitability Prediction،Data Augmentation،Cybersecurity
Abstract :
Software vulnerability descriptions from CVE/NVD are the primary corpus for analysis, prioritization, and risk management in cybersecurity. Yet natural noise (typos, synonym substitutions, lexical variety) and adversarial perturbations undermine the accuracy and trustworthiness of NLP models. This paper presents, to our knowledge, the first systematic benchmark of NLP robustness on vulnerability descriptions. We train nine diverse architectures—lightweight transformers (MiniLM, MPNet, SBERT), hybrid models (BERT-LSTM, TextRCNN), and classical recurrent networks (BiLSTM, LSTM)—on a balanced dataset of over 56,000 real-world records from NVD and Exploit-DB, and fine-tune them for exploitability prediction. For comprehensive evaluation, we inject three noise families into test sets at levels from 10% to 80%: character-level edits (substitutions/swaps), synonym replacements using WordNet, and composite adversarial attacks generated with TextAttack. Performance declines across all models as noise rises, but vulnerability profiles differ: MiniLM attains the strongest clean-data score (F1 ≈ 0.933) yet is most brittle under character noise, whereas TextRCNN, despite a lower baseline, preserves comparatively higher stability in heavily perturbed conditions. Finally, we test a pragmatic hardening strategy—data augmentation with noisy variants followed by retraining—which consistently narrows robustness gaps across architectures without materially sacrificing clean-data accuracy. The benchmark and code enable reproducible evaluation and future robust modeling in cybersecurity.
Papers List
List of archived papers
Information Technology Risk Management Model for Remote Control Vehicles
Hamid Reza Naji - Aref Ayati
A Nano-based High-Speed QCA circuit for Information Security with Image Masking
Saeid Seyedi - Hatam Abdoli
BMPA- DSL: Binary Marine Predators Algorithm to Identify Driver's Different Levels of Stress
Mahtab Vaezi - Mehdi Nasri - Farhad Azimifar - Mahdi Mosleh
بهبود تشخیص نفوذ به شبکه اینترنت اشیاء با استفاده از مدل ترکیبی الگوریتم های بهینهسازی ازدحام ذرات، گرگ خاکستری و جنگل تصادفی
مهدی علیرضانژاد - عمار عبیس حسین المعموری
Establishing security using cryptography and biometric authentication to counter cyber-attacks
Mohammed ADIL AKABR - Mehdi Hamidkhani - Mostafa Sadeghi
Improving Long-Term Engagement of Insurance Brokerages by Providing Gamified Configurations Based on The Delphi Method
Hosein Bayati - Fattaneh Taghiyareh - Sahand Hashemi
Prompt-Based Composed Fashion Image Retrieval via Gated Detail-Enhanced Dual Cross-Attention Difference Modeling
Kosar Keshavarz - Reza Azmi
Challenges of Specification Mining-based Test Oracle for Cyber-Physical Systems
Maryam Raiyat Aliabadi - Dr Mojtaba Vahidi - Dr Ramak Ghavamizadeh
Knowledge Distillation through a Knowledge Representation Approach (Knowledge Engineering)
Mohammad Hadi Safari Nader
Classical-Quantum Multiple Access Wiretap Channel with Common Message: One-shot Rate Region
Hadi Aghaee - Dr Bahareh Akhbari
more
Samin Hamayesh - Version 43.8.0