Titulo Estágio
Data Curation and Analysis of a Python Vulnerability Dataset for Enhanced Reliability in Security Research
Áreas de especialidade
Engenharia de Software
Engenharia de Software
Local do Estágio
DEI
Enquadramento
The effectiveness of data-driven approaches in software security—particularly machine learning models for vulnerability detection—heavily depends on the quality and reliability of the underlying datasets. Existing vulnerability datasets often suffer from inconsistencies, inaccuracies, missing information, or labeling errors. Moreover, the classification and characterization of these vulnerabilities is a complex and non-trivial task.
This research project focuses on the critical task of curating a dataset of Python code vulnerabilities, specifically the VAITP dataset, available at https://netpack.pt/vaitp/dataset/. All vulnerabilities in this dataset are manually characterized, a process that is not only prone to errors but also highly exhaustive.
The work involves a meticulous process of data review, cleansing, correction, and validation to improve the dataset’s integrity and usefulness to the research community. Additionally, the project explores the application of AI models for the automatic classification of newly identified vulnerabilities.
Objetivo
Perform a systematic search and selection of relevant Python vulnerability datasets. However, this work primarily relies on our VAITP dataset, available at https://netpack.pt/vaitp/dataset/, which currently includes 1,236 Python vulnerabilities. This step may also result in the addition of new vulnerabilities to the existing dataset.
Develop and apply rigorous data cleaning methodologies to identify and rectify inconsistencies, duplicates, and missing entries.
Conduct thorough data validation, potentially involving manual inspection and cross-referencing with established vulnerability databases (e.g., CVE), to verify the correctness of vulnerability types, locations, and associated code snippets. The data curation process—including employed methodologies and decisions made—should be carefully documented during this phase.
Perform exploratory data analysis on the curated dataset to uncover patterns, distributions, and characteristics of Python vulnerabilities. This phase may involve leveraging AI models to classify vulnerabilities (e.g., using the ODC classification scheme).
Prepare the enhanced dataset for public release or internal use, ensuring adherence to data-sharing best practices.
Plano de Trabalhos - Semestre 1
Perform a systematic review and selection of a relevant Python vulnerability dataset.
Develop and apply rigorous data cleaning methodologies to identify and rectify inconsistencies, duplicates, and missing entries.
Plano de Trabalhos - Semestre 2
Conduct thorough data validation, potentially involving manual inspection and cross-referencing with vulnerability databases (e.g., CVE), to verify the correctness of vulnerability types, locations, and associated code snippets.
Document the data curation process, including the methodologies employed and decisions made.
Perform exploratory data analysis (EDA) on the curated dataset to uncover patterns, distributions, and characteristics of Python vulnerabilities. This phase involves the utilization of AI models.
Prepare the enhanced dataset for public release or internal use, adhering to data-sharing best practices.
Condições
The secondary area for this proposal is Intelligent Systems; therefore, the student interested in this proposal will also have a supervisor with expertise in AI.
Candidate Profile:
- Strong data analysis and manipulation skills.
- Proficiency in Python, particularly with data analysis libraries (e.g., Pandas, NumPy).
- Experience with data cleaning, data wrangling, and database management is advantageous.
- Meticulous attention to detail and a systematic approach to problem-solving.
- Interest in Software Security, Cybersecurity, and Data Quality assurance.
- Proficiency in English (reading and writing).
The selected student will be integrated in the Software and Systems Engineering (SSE) group of CISUC and the work will be carried out in the facilities of the Department of Informatics Engineering at the University of Coimbra (CISUC - SSE and IS Groups), where a work place and necessary computer resources will be provided.
Observações
Please contact the advisors for any questions or clarification needed.
Orientador
Naghmeh Ivaki
naghmeh@dei.uc.pt 📩