Titulo Estágio
Investigating LLMs for the verification of coding quality rules
Áreas de especialidade
Engenharia de Software
Local do Estágio
DEI-SSE
Enquadramento
Coding conventions are guidelines for software development that impose constraints on how to write source code in a certain programming language. In this project we are mostly interested in rules that enforce quality properties like security or performance. In general, the adherence to precise coding rules avoids introducing known bugs, and it is a fundamental practice for ensuring the reliability of complex software systems. Coding quality rules exist for different purposes; the SEI Cert Oracle Coding Standard for Java is a good example of a coding standard that focuses on security [1].
Some rules can be checked by static analysis tools (SATs), such as PMD [3] or SonarQube [4]. However, such tools only support a reduced number of rules. Further, coding conventions are not static artifacts; rather, they evolve over time, following the introduction of new language features or the discovery of new vulnerabilities.
Large Language Models (LLMs) [2] is a recently coined term in the field of machine learning applied to the processing of natural language. The term identifies machine learning models with a large number of parameters, typically in the scale of billions. Those models have shown impressing performance in different tasks related to manipulation of text, including generation and understanding of both natural language and source code. ChatGPT is one of the most prominent applications of LLMs.
Research on the use of LLMs for software engineering tasks is emerging, for example for code refactoring tasks, writing test cases, etc. The objective of this project is to investigate the use of LLMs for the verification of coding rules. The idea is to exploit the generalization abilities of LLMs and their ability to handle textual data to automate the process of checking whether a certain source code satisfy a coding rule specified in natural language.
The main objective of this research is to understand if LLMs can be used as a tool to verify coding conventions in the source code. Also, a comparison with SATs should be performed to understand which technique presents the best performance in the software development life cycle (SDLC).
Objetivo
The primary learning objectives of this research are as follows:
• Understand the use of machine learning for the verification of coding rules.
• Gain practical knowledge of using open source LLMs to identify coding rules violations.
• Acquire hands-on experience in evaluating the performance of ML algorithms to identify coding rules violations and compare the rules with static analysis tools.
• Use software vulnerability dataset as input of LLMs to identify coding rules.
The long-term research objective linked to this activity is to build a framework that can automate the implementation of checkers for new coding rules specified in natural language.
Plano de Trabalhos - Semestre 1
T1. [09/09/2024 to 15/10/2024] Literature Review.
During this initial phase, an extensive literature review will be conducted to understand the state of the art regarding the use of machine learning for the verification of coding rules.
T2. [16/10/2024 to 30/11/2024] Tool Setup and Preliminary Evaluation
Select available datasets and setup of the experiments with open source LLMs. Preliminary results should be analyzed and compared with static analysis tools.
T3. [01/12/2024 to 10/01/2025] Write the intermediate report.
Plano de Trabalhos - Semestre 2
T4. [11/01/2025 to 28/02/2025] Experiments with open source LLMs and fine tuning of hyperparameters
Perform experiments with different LLMs considering the idiosyncrasies of each project and the programming language.
T5. [01/03/2025 to 30/04/2025] Use the output of LLMs for configuring Static Analysis Tools (SATs)
Use the output of LLMs to create rules for specific SATs or use the output of LLMs to filter SAT results to improve the performance of SATs.
T6. [01/05/2025 to 30/06/2025] Report and Documentation.
The final phase will involve documenting the research findings, methodologies, and results. A comprehensive report summarizing the research outcomes, including the configuration of LLMs, will be prepared.
Condições
- You will have a position in the SSE Laprie Lab
- Computational infrastructure will be provided to work
Co-supervised by professor Leonardo Montecchi (NTNU-Norway / leonardo.montecchi@ntnu.no)
Observações
Recommended Bibliography:
1. SEI CERT Oracle Coding Standard for Java https://wiki.sei.cmu.edu/confluence/display/java/SEI+CERT+Oracle+Coding+Standard+for+Java
2. W. X. Zhao et al., “A Survey of Large Language Models”, 2023. https://arxiv.org/abs/2303.18223
3. PMD – An extensible cross-language static code analyzer. https://pmd.github.io/
4. SonarQube https://www.sonarsource.com/products/sonarqube/
Orientador
José Alexandre D'Abruzzo Pereira
josep@dei.uc.pt 📩