Propostas atribuidas 2024/2025

Gerado a 2025-07-02 01:54:26 (Europe/Lisbon).

Titulo Estágio

On the Use of Historical Static Data to Predict Software Vulnerable Code

Local do Estágio

DEI-SSE

Enquadramento

Software vulnerabilities are caused by a design flaw or implementation bug (OWASP). When they are exploited, they can cause consequences such as unauthorized authentication, data losses, financial losses, among others. Current techniques can detect vulnerabilities by analyzing the source code (static techniques) or by executing the software (dynamic techniques) (B. Liu, L. Shi, Z. Cai, and M. Li, “Software Vulnerability Discovery Techniques: A Survey,” in 2012 Fourth International Conference on Multimedia Information Networking and Security, Nov 2012). However, they are not effective, as they can't reveal all the vulnerabilities. Additionally, they identify alerts that can be actual vulnerabilities or false alarms (false positives).

Machine Learning (ML) algorithms have been employed to detect software vulnerabilities in code units (e.g., functions). However, they suffer from the same issue of other static techniques (e.g., static analysis tools), i.e., they report a high number of false positives (FPs). FPs are items reported as vulnerable, but that are not actual vulnerable code units. Consequently, the software development teams must spend some time trying to separate the actual vulnerable code units from the non-vulnerable ones. Additionally, ML techniques consider the vulnerable code units all in the same manner. For example, a vulnerable code unit of 20 years has the same importance as a vulnerable code units disclosed few months ago. However, the technology evolves to address widely known vulnerabilities, and old vulnerabilities should be considered different from the newest ones.

Through this research, we aim to study the use of ML algorithms to detect software code units, considering the historical range of software vulnerable code units. To do so, an environment to run ML algorithms should be configured. Adaptations on the algorithms may need to be performed, to increase the weight of more recent code units, and decrease the weights of older one. The outcome of this study would help understanding if the use of historical information can provide a better detection of code units than when such information is not considered.

Objetivo

The primary learning objectives of this research are as follows:

• Understand the most frequent types of vulnerabilities per projects and programming languages
• Gain practical knowledge of software vulnerability detection techniques.
• Acquire hands-on experience in evaluating the performance of ML algorithms to detect vulnerable code units.
• Develop practical skills in the development of ML algorithms to reflect the historical properties of the software vulnerabilities.
• Use software vulnerability dataset as input of ML algorithms and update them (extracting new features) if needed.

Plano de Trabalhos - Semestre 1

T1. [09/09/2024 to 15/10/2024] Literature Review.
During this initial phase, an extensive literature review will be conducted to understand the state of the art regarding detecting vulnerable code units considering the historical information.

T2. [16/10/2024 to 30/11/2024] Tool Setup and Preliminary Evaluation
Setup of the experiments with already available datasets considering a sliding window to train and test. At this point, all the code units (samples) are available to train and have the same weights. Additionally, experiments without considering the time-specific information should also be evaluated.

T3. [01/12/2024 to 10/01/2025] Write the intermediate report.

Plano de Trabalhos - Semestre 2

T4. [11/01/2025 to 28/02/2025] Experiments considering historical information
Perform experiments with different weights to each code unit (sample). Adaptations in ML algorithms may be performed. A comparison with the preliminary evaluation should be performed.

T5. [01/03/2025 to 30/04/2025] Evaluation using other datasets
Expand the evaluation with other vulnerability datasets. Understand if the approach using historical information work in the same manner for the different programming languages.

T6. [01/05/2025 to 30/06/2025] Report and Documentation.
The final phase will involve documenting the research findings, methodologies, and results. A comprehensive report summarizing the research outcomes, including the adaptations performed on the ML algorithms, will be prepared.

Condições

- You will have a position in the SSE Laprie Lab
- Computational infrastructure will be provided to work

Observações

Recommended Bibliography:
- J. D. Pereira, J. H. Antunes and M. Vieira, "A Software Vulnerability Dataset of Large Open Source C/C++ Projects," 2022 IEEE 27th Pacific Rim International Symposium on Dependable Computing (PRDC), Beijing, China, 2022, pp. 152-163, doi: 10.1109/PRDC55274.2022.00029.
- J. R. Campos, M. Vieira and E. Costa, "Propheticus: Machine Learning Framework for the Development of Predictive Models for Reliable and Secure Software," 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), Berlin, Germany, 2019, pp. 173-182, doi: 10.1109/ISSRE.2019.00026.
- Walden, James, Jeff Stuckman, and Riccardo Scandariato. "Predicting vulnerable components: Software metrics vs text mining." 2014 IEEE 25th international symposium on software reliability engineering. IEEE, 2014.

Reference data sources to be used during the masters:
- Links to the data source (also available in the first reference): https://vulnerabilitydataset.dei.uc.pt and https://github.com/JoaoRafaelHenriques/CVEDetailsScrapeDataset

Co-orientado pelo professor João Campos (jrcampos@dei.uc.pt)

Orientador

José Alexandre D'Abruzzo Pereira
josep@dei.uc.pt 📩