Titulo Estágio
Enhancing Software Vulnerability Detection using Graph Neural Networks (GNN) and Large Language Models (LLMs)
Local do Estágio
DEI-FCTUC
Enquadramento
Software vulnerabilities are a persistent threat for software development and can lead to data losses, financial losses, among other consequences when exploited. Existing vulnerability detection techniques are typically divided into static (e.g., source code analysis) and dynamic (e.g., execution-time analysis) approaches. However, these methods often suffer from incomplete coverage and high false positive or false negative rates.
Prior research has demonstrated the potential of using Deep Graph Convolutional Neural Networks (DGCNNs) to detect vulnerable code by leveraging graph-based representations like Control Flow Graphs (CFGs). A previous study within this research line used DGCNNs to classify vulnerable C functions, but was limited in the diversity of analyzed projects and in the extracted features. While promising, these graph-only representations may lack deeper semantic understanding.
Recent breakthroughs in Large Language Models (LLMs), such as CodeBERT or GPT, have opened up new opportunities for extracting rich semantic features directly from raw code. These models, pre-trained on vast amounts of source code, can be leveraged to produce contextual embeddings that complement or even rival graph-based representations.
This research aims to explore and evaluate the integration of graph-based and language-based representations to improve the detection of software vulnerabilities. Graph representations will be extracted using tools like Joern, while LLMs will be employed to generate semantic embeddings from code. The hypothesis is that a combination of structural and semantic embeddings may outperform traditional approaches.
This hybrid approach will be evaluated across different datasets and exploring multiple classifiers (e.g., VGG, Random Forest, SVM), with a focus on measuring generalizability and false positive reduction. The outcome will help determine whether and how LLMs can enhance vulnerability detection when used alone or in conjunction with graph-based models.
Objetivo
The primary learning objectives of this research are as follows:
• Investigate the applicability of Large Language Models (LLMs) such as CodeBERT and GPT-based models for vulnerability detection.
• Compare the effectiveness of graph-only, LLM-only, and hybrid (combined) representations.
• Apply dimensionality reduction and/or fusion strategies to align the different embedding types.
• Evaluate performance using standard classifiers (Random Forest, SVMs, etc.) and explore alternatives.
• Propose a concrete integration pipeline of LLM and GNN embeddings for future research.
• Assess whether LLM embeddings can reveal vulnerabilities missed by GNN-based models.
• Test generalization on unrelated datasets (e.g., SARD, NVD) and analyze how LLMs impact false positives and false negatives.
Plano de Trabalhos - Semestre 1
T1. [09/09/2025 to 30/09/2025] Literature Review and Tool Familiarization.
Survey advanced uses of LLMs in code representation and vulnerability detection. Set up transformer models (CodeBERT, GraphCodeBERT, GPT-4 via API or open models). Review methods for embedding comparison and combination.
T2. [01/10/2025 to 31/10/2025] LLM-Based Embedding Extraction
Extract contextual embeddings from each code function using LLMs. Evaluate different tokenization strategies and embedding layers
T3. [01/11/2025 to 30/11/2025] Fusion Strategy Design
Design and implement methods to combine LLM embeddings with existing DGCNN embeddings (from DDGs, ASTs, CFGs). Prepare both early and late fusion setups.
T4. [01/12/2025 to 10/01/2025] Write the intermediate report.
Plano de Trabalhos - Semestre 2
T5. [11/01/2026 to 28/02/2026] Model Integration and Classifier Design
Integrate hybrid embeddings into existing classifier architectures. Compare separate, fused, and attention-based variants of the model.
T6. [01/03/2026 to 31/03/2026] Training and Evaluation
Train and evaluate the models using precision, recall, F1-score, and confusion matrices. Focus on understanding performance shifts when LLMs are included
T7. [01/04/2026 to 30/04/2026] Cross-Dataset Validation
Evaluate models on datasets beyond the training set, including SARD and NVD. Focus on performance degradation or improvements related to unseen data and noise.
T8. [01/05/2026 to 30/06/2026] Report and Documentation.
Produce a comprehensive thesis and technical documentation. Emphasize the proposed methodology for future researchers combining structural and semantic code representations.
Condições
- You will have a position in the SSE Laprie Lab (G4.5)
- Proposal in the scope of the CSLab (Cybersecurity Laboratory)
- Computational infrastructure will be provided to work
Observações
Recommended Bibliography:
- José D'Abruzzo Pereira, Nuno Lourenço, and Marco Vieira. 2023. On the Use of Deep Graph CNN to Detect Vulnerable C Functions. In Proceedings of the 11th Latin-American Symposium on Dependable Computing (LADC '22). Association for Computing Machinery, New York, NY, USA, 45–50. https://doi.org/10.1145/3569902.3569913
- Jiaqi Yan, Guanhua Yan, and Dong Jin. 2019. Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 52–63. https://doi.org/10.1109/DSN.2019.00020
- Xuan, C. D. (2023). A new approach to software vulnerability detection based on CPG analysis. Cogent Engineering, 10(1). https://doi.org/10.1080/23311916.2023.2221962
- Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics.
Orientador
José Alexandre D'Abruzzo Pereira
josep@dei.uc.pt 📩