Propostas Submetidas

Gerado a 2025-07-07 05:16:17 (Europe/Lisbon).

Titulo Estágio

Investigating LLMs for the generation of Static Code Analysis rules

Local do Estágio

DEI-FCTUC

Enquadramento

Coding conventions are guidelines for software development that impose constraints on how to write source code in a certain programming language. In this project we are mostly interested in rules that enforce quality properties like security or performance. In general, the adherence to precise coding rules avoids introducing known bugs, and it is a fundamental practice for ensuring the reliability of complex software systems. Coding quality rules exist for different purposes; the SEI Cert Oracle Coding Standard for Java is a good example of a coding standard that focuses on security [1].

Some rules can be checked by static analysis tools (SATs), such as PMD [3] or SonarQube [4]. Each SAT implement their rules using different programming languages, such as XPath or Java code. However, such tools only support a reduced number of rules. Further, coding conventions are not static artifacts; rather, they evolve over time, following the introduction of new language features or the discovery of new vulnerabilities.

Large Language Models (LLMs) [2] is a recently coined term in the field of machine learning applied to the processing of natural language. The term identifies machine learning models with a large number of parameters, typically in the scale of billions. Those models have shown impressing performance in different tasks related to manipulation of text, including generation and understanding of both natural language and source code. Examples of LLMs include ChatGPT (GPT-4), Llama, Gemini, DeepSeek, Perplexity.

Research on the use of LLMs for software engineering tasks is emerging, for example for code refactoring tasks, writing test cases, etc. The objective of this project is to investigate the use of LLMs for the generation of rules to be used in SATs from the textual coding rules. The idea is to exploit the generalization abilities of LLMs and their ability to handle textual data to automate the process of generating rules to be used in SATs to check whether a certain source code satisfy a coding rule specified in natural language.

The main objective of this research is to understand if LLMs can be used as a tool to automate the process of generating OCLs to verify coding conventions in the source code. Also, a comparison with LLMs directly should be performed to understand which technique presents the best performance in the software development life cycle (SDLC).

Objetivo

The primary learning objectives of this research are as follows:

• Understand the use of LLMs for the generation of DSL (Domain Specific Language).
• Gain practical knowledge of using open source LLMs to generate coding rules.
• Acquire hands-on experience in evaluating the performance rules generated by LLMs and compare the LLM ability to identify code rule violations.
• Use software vulnerability dataset as input of LLMs to identify coding rules.
The long-term research objective linked to this activity is to build a framework that can automate the implementation of checkers for new coding rules specified in natural language.

Plano de Trabalhos - Semestre 1

T1. [09/09/2025 to 15/10/2025] Literature Review.
During this initial phase, an extensive literature review will be conducted to understand the state of the art regarding the use of LLMs for the generation of DSLs

T2. [16/10/2025 to 30/11/2025] Tool Setup and Preliminary Evaluation
Select available datasets and setup of the experiments with open source LLMs. Preliminary results should be analyzed and compared with the rules generated by human experts.

T3. [01/12/2025 to 10/01/2026] Write the intermediate report.

Plano de Trabalhos - Semestre 2

T4. [11/01/2026 to 28/02/2026] Experiments with open source LLMs and fine tuning of hyperparameters
Perform experiments with different LLMs considering the idiosyncrasies of each project and the programming language. Fine-tune the LLMs accordingly.

T5. [01/03/2026 to 30/04/2026] Use the output of LLMs for configuring Static Analysis Tools (SATs)
Use the rules generated by LLMs to configure specific SATs and analyze the results.

T6. [01/05/2025 to 30/06/2025] Report and Documentation.
The final phase will involve documenting the research findings, methodologies, and results. A comprehensive report summarizing the research outcomes, including the set of OCLs generated by the LLMs, will be prepared.

Condições

- You will have a position in the SSE Laprie Lab
- Proposal in the scope of the CSLab (Cybersecurity Laboratory)
- Computational infrastructure will be provided to work

Observações

Recommended Bibliography:
1. SEI CERT Oracle Coding Standard for Java https://wiki.sei.cmu.edu/confluence/display/java/SEI+CERT+Oracle+Coding+Standard+for+Java
2. W. X. Zhao et al., “A Survey of Large Language Models”, 2023. https://arxiv.org/abs/2303.18223
3. PMD – An extensible cross-language static code analyzer. https://pmd.github.io/
4. SonarQube https://www.sonarsource.com/products/sonarqube/

This work is co-supervised by Leonardo Montecchi (Norwegian University of Science and Technolgy - NTNU)

Orientador

José Alexandre D'Abruzzo Pereira
josep@dei.uc.pt 📩