Propostas com alunos

Gerado a 2025-07-18 23:32:18 (Europe/Lisbon).

Titulo Estágio

On the use of Deep Graph Convolution Neural Networks (DGCNN) to Detect Software Vulnerabilities

Local do Estágio

DEI-FCTUC

Enquadramento

Software vulnerabilities are a big issue for software development. When they are exploited, they can cause consequences such as unauthorized authentication, data losses, financial losses, among others. Current techniques can detect vulnerabilities by analyzing the source code (static techniques) or by executing the software (dynamic techniques) (B. Liu, L. Shi, Z. Cai, and M. Li, “Software Vulnerability Discovery Techniques: A Survey,” in 2012 Fourth International Conference on Multimedia Information Networking and Security, Nov 2012). However, they are not effective, as they can't reveal all the vulnerabilities. Additionally, they reveal alerts that can be actual vulnerabilities or false alarms (false positives).

A previous study created a dataset with data about projects in C/C++ with information about vulnerabilities and the software metrics about the source code (J. D. Pereira, J. H. Antunes and M. Vieira, "A Software Vulnerability Dataset of Large Open Source C/C++ Projects," PRDC 2022) support the development of software detection techniques. A way of identifying vulnerabilities by the analysis of the source code is through software metrics and other static properties extracted from the source code (José D'Abruzzo Pereira, Nuno Lourenço, and Marco Vieira. 2023. “On the Use of Deep Graph CNN to Detect Vulnerable C Functions”. LADC 2022). Previous studies extracted properties from the Control Flow Graphs (CFG) of functions to feed a Deep Graph Convolution Neural Networks (DGCNN), whose goal was to predict the presence of software vulnerabilities in such functions. However, such studies are limited in the number of evaluated projects and the features extracted from the source code.

Through this research, we aim to evaluate the use of DGCNN to detect vulnerable code units (e.g., functions). To do so, features from more projects should be extracted, in addition to the development and the extraction of new features. The use of classical static and dynamic techniques can be explored to improve the performance metrics (e.g., precision, recall, F-Measure) of the suggested approach. The outcomes of this research would help understanding if an approach based on DGCNN can be used to detect vulnerable code units.

Objetivo

The primary learning objectives of this research are as follows:

• Gain practical knowledge of static techniques to detect vulnerable code units.
• Acquire hands-on experience in evaluating the accuracy and efficiency of software vulnerability detection approaches.
• Develop practical skills in using DGCNN techniques to detect software vulnerable code units, including the extraction and definition of features from the source code.
• Explore practical strategies and techniques to improve the performance metrics based on the use of classical static and dynamic techniques

Plano de Trabalhos - Semestre 1

T1. [09/09/2024 to 30/09/2024] Literature Review and Tool Familiarization.
During this initial phase, an extensive literature review will be conducted to understand the existing techniques based on DGCNN to detect software vulnerable code units, their capabilities, and limitations. Other Machine Learning (ML) approaches can also be considered. Additionally, hands-on evaluation will be performed on already available mechanisms.

T2. [01/10/2024 to 31/10/2024] Extraction of Features for Other Projects.
Features for other projects from the dataset should be extracted. The projects should be of the same programming language as the project defined in the initial evaluation, whose results were presented in LADC2022.

T3. [01/11/2024 to 30/11/2024] Definition and Extraction of New Features for the Projects of the Dataset
Additional features should be defined and extracted from the source code. These features can be extracted from structures such as the Abstract Syntax Tree (AST), Control Flow Graph (CFG), Program Dependence Graph (PDG), and Code Property Graph (CPG), the latter combining elements of the ASTs, CFGs, and PDGs. Such features should consider either the structure of the functions, or the properties that indicate specific types of vulnerabilities. For example, input validation features can be used to help identifying vulnerabilities that are a consequence of improper or lack of input validation.

T4. [01/12/2024 to 10/01/2025] Write the intermediate report.

Plano de Trabalhos - Semestre 2

T5. [11/01/2025 to 28/02/2025] Creation of networks (e.g., DGCNN) to predict vulnerable code units
In this hands-on phase, the definition of other networks or adjustment in the network being used will be performed.

T6. [01/03/2025 to 31/03/2025] Optimization of the Hyperparameters of the DGCNN.
Using the networks created in the previous activity, the hyperparameters should be optimized to obtain better performance metrics for the models

T7. [01/04/2025 to 30/04/2025] Use of Classical Static and Dynamic techniques to improve the performance
For a reduced number of samples (e.g., code units) of the dataset, classical detection techniques, such as static and dynamic analysis, will be used to improve the performance of the mode. The main goal of this task is to reduce the number of false positives.

T8. [01/05/2025 to 30/06/2025] Report and Documentation.
The final phase will involve documenting the research findings, methodologies, and practical recommendations. A comprehensive report summarizing the research outcomes, including the evaluation of techniques and optimization of hyperparameters, will be prepared. The report will also include hands-on guidelines for software development teams to apply in real-world scenarios in order to avoid adding vulnerabilities when writing the source code.

Condições

- You will have a position in the SSE Laprie Lab
- Computational infrastructure will be provided to work

Observações

Recommended Bibliography:
- José D'Abruzzo Pereira, Nuno Lourenço, and Marco Vieira. 2023. On the Use of Deep Graph CNN to Detect Vulnerable C Functions. In Proceedings of the 11th Latin-American Symposium on Dependable Computing (LADC '22). Association for Computing Machinery, New York, NY, USA, 45–50. https://doi.org/10.1145/3569902.3569913
- Jiaqi Yan, Guanhua Yan, and Dong Jin. 2019. Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 52–63. https://doi.org/10.1109/DSN.2019.00020
- Xuan, C. D. (2023). A new approach to software vulnerability detection based on CPG analysis. Cogent Engineering, 10(1). https://doi.org/10.1080/23311916.2023.2221962

Orientador

José Alexandre D'Abruzzo Pereira
josep@dei.uc.pt 📩