Titulo Estágio
Labelling Vulnerability Categories of Reported Vulnerabilities
Local do Estágio
Coimbra
Enquadramento
Software vulnerabilities are a persistent threat for software development and can lead to data losses, financial losses, among other consequences when exploited. Existing vulnerability detection techniques are typically divided into static (e.g., source code analysis) and dynamic (e.g., execution-time analysis) approaches. However, these methods rely on rules to identify a pattern in the source code or in the software system execution to identify a security violation.
Known software vulnerability databases (e.g., CVE Details and NVD) makes available the known vulnerabilities of several software systems. Additional information such as the vulnerability type, the CWE (Common Weakness Enumeration), the CVSS (Common Vulnerability Scoring System) are usually present in such databases. However, such information is not always available when the software vulnerability is disclosed. Without this information, Software Vulnerability Detection (SVD) mechanisms can not be developed properly.
This research aims to explore and evaluate an automatic labelling of software vulnerabilities without an assigned vulnerability category. This should be done by using the available information from the databases when the vulnerabilities are disclosed. Supervised and semi-supervised learning algorithms should be used, in addition to an evaluation using Large Language Models (LLMs). An analysis of the software vulnerabilities without a category should be performed to understand what data are usually disclosed along with the vulnerabilities. Also, and also to understand how much it usually takes for a vulnerability to have a vulnerability category assigned to it.
To do that, a software vulnerability dataset [1] should be used. It includes vulnerabilities from C/C++ open-source projects. Nevertheless, this dataset can be extended if needed. As this dataset contains vulnerabilities collected from CVE Details, most vulnerabilities are already labelled with vulnerability categories, which can be used to validate the experiments.
Objetivo
The primary learning objectives of this research are as follows:
• Understand the software vulnerability lifecycle.
• Perform a study to understand the time it usually takes for a software vulnerability to have its vulnerability category assigned.
• Gain practical knowledge of using open source LLMs to assign a vulnerability category.
• Acquire hands-on experience in obtaining information from online software vulnerability datasets.
Plano de Trabalhos - Semestre 1
T1. [09/09/2025 to 15/10/2025] Literature Review.
During this initial phase, an extensive literature review will be conducted to understand the state of the art regarding the software vulnerability lifecycle and semi-supervised learning algorithms used for classification
T2. [16/10/2025 to 30/11/2025] Dataset Setup and Preliminary Evaluation
Dataset setup to perform initial experiments to understand the time it takes to have a software vulnerability category assigned. It may be needed to obtain (scrape) data from the online databases.
T3. [01/12/2025 to 10/01/2026] Write the intermediate report.
Plano de Trabalhos - Semestre 2
T4. [11/01/2026 to 31/01/2026] Research and Design Prompts to be used
Research different prompt engineering techniques and create a few of them to be used in the experiments.
T5. [01/02/2026 to 30/04/2026] Experiments with open source LLMs and fine tuning of hyperparameters
Perform experiments with different LLMs to assign the vulnerability category.
T6. [01/05/2025 to 31/05/2025] Write a technical paper
Write a paper to submit to a journal/conference reporting the main finding of this research.
T7. [01/06/2025 to 30/06/2025] Report and Documentation.
The final phase will involve documenting the research findings, methodologies, and results. A comprehensive report summarizing the research outcomes will be prepared.
Condições
- You will have a position in the SSE Laprie Lab
- Proposal in the scope of the CSLab (Cybersecurity Laboratory)
- Computational infrastructure will be provided to work
Observações
Recommended Bibliography:
1. J. D. Pereira, J. H. Antunes and M. Vieira, "A Software Vulnerability Dataset of Large Open Source C/C++ Projects," 2022 IEEE 27th Pacific Rim International Symposium on Dependable Computing (PRDC), Beijing, China, 2022, pp. 152-163, doi: 10.1109/PRDC55274.2022.00029.
2. W. X. Zhao et al., “A Survey of Large Language Models”, 2023. https://arxiv.org/abs/2303.18223
3. Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 33, 8, Article 220 (November 2024), 79 pages. https://doi.org/10.1145/3695988
4. Considerations for Evaluating Large Language Models for Cybersecurity Tasks, February 2024, Jeff Gennari, Shing-hon Lau, Samuel J. Perl, Joel Parish (OpenAI), and Girish Sastry (OpenAI), https://insights.sei.cmu.edu/library/considerations-for-evaluating-large-language-models-for-cybersecurity-tasks/
This work is co-supervised by André Grégio (UFPR/Brazil) and Paulo Ricardo Lisboa de Almeida (UFPR/Brazil). André is from Cybersecurity, and Paulo is from Intelligent Systems.
Orientador
José Alexandre D'Abruzzo Pereira
josep@dei.uc.pt 📩