Titulo Estágio
Automated Red-Teaming of LLMs through Prompt-Based Attack Simulation
Local do Estágio
DEI
Enquadramento
Large Language Models (LLMs) have seen rapid integration into a wide range of applications, from chatbots and code assistants to enterprise systems and cybersecurity operations. However, alongside this growth has come increased attention to their vulnerabilities—particularly their susceptibility to adversarial prompt-based attacks. These attacks can manipulate an LLM’s output through carefully crafted input text, with consequences ranging from biased or unsafe content to prompt injection, data leakage, and circumvention of usage policies.
As LLMs are integrated into critical workflows, the ability to systematically evaluate their robustness under adversarial conditions becomes essential. In conventional software engineering, red-teaming refers to the practice of simulating real-world adversaries to test system resilience. This concept is now being adapted to AI systems. Automated red-teaming of LLMs offers the opportunity to proactively uncover weaknesses before deployment and to iteratively improve the safety and alignment of these systems.
This thesis addresses the challenge of automating adversarial testing for LLMs using a modular and extensible framework. The focus is on prompt-based attack generation and execution, enabling structured and repeatable testing pipelines. The system will support attack strategies such as prompt injection, semantic confusion, and goal hijacking. It will evaluate how LLMs respond across tasks (e.g., summarization, classification, code generation) and usage contexts (e.g., chat agents, tool-augmented coding).
Through this work, we aim to bridge the gap between LLM development and real-world red-teaming practices by developing tools that allow LLMs to be stress-tested early and often. Additionally, this work will feed into responsible AI evaluation and hardening pipelines, offering developers insight into robustness and security attributes through empirical benchmarks.
Objetivo
The main learning objectives of this thesis are:
1. Security and Red-Teaming Foundations: Understand foundational concepts in system security, adversarial testing, and red-teaming practices.
2. LLM Vulnerabilities: Explore known vulnerabilities of LLMs, including prompt injection, hallucination, and misuse.
3. Adversarial NLP Techniques: Study techniques for generating adversarial examples in the context of natural language processing and generation.
4. Framework Development: Design and implement a modular framework for automating red-teaming experiments against LLMs.
5. Experimental Research Design: Develop skills in designing, running, and analyzing rigorous experimental campaigns.
6. LLM Benchmarking: Develop evaluation metrics and protocols to quantify the robustness and reliability of LLMs under adversarial prompts.
Plano de Trabalhos - Semestre 1
Literature Review
Study foundational topics on LLMs, prompt engineering, red-teaming in AI, and adversarial attacks in NLP. Review existing security evaluation methods for LLMs.
[13/10/2025 a 09/11/2025] Threat Modeling and Attack Taxonomy
Define the taxonomy of prompt-based attacks (e.g., prompt injection, jailbreak, data leakage), attacker capabilities (black/gray/white-box), and threat models. Select representative LLMs and use cases for study.
[10/11/2025 a 07/12/2025] System Architecture and Design Specification
Design the modular framework structure: attack modules, evaluation engine, model interface layer. Specify requirements for automation, reproducibility, and extensibility.
[08/12/2025 a --/01/2026] Dissertation Plan Draft
Write and submit the thesis plan detailing the scope, methods, and initial findings.
Plano de Trabalhos - Semestre 2
Implementation of Red-Teaming Framework
Implement core modules for prompt generation, attack simulation, model querying, and result logging. Include support for multiple LLM APIs (e.g., OpenAI, HuggingFace).
[02/03/2026 a 19/04/2026] Experimentation and Attack Campaign
Run large-scale adversarial campaigns against selected LLMs. Vary attack vectors, prompt templates, and attack models. Record model behavior and failures.
[20/04/2026 a 10/05/2026] Data Analysis and Benchmarking
Analyze experimental results to evaluate model robustness, identify vulnerabilities, and define key metrics (e.g., attack success rate, evasiveness, sensitivity).
[11/05/2026 a --/06/2026] Thesis Writing
Document methodology, framework architecture, experimental results, and conclusions. Discuss ethical concerns, limitations, and future work.
Condições
This work occurs within the context of the AI-SSD (2024.07660.IACDC) project and depending on the evolution of the internship a studentship may be available to support the development of the work. The work is to be executed at the laboratories of the CISUC’s Software and Systems Engineering (SSE) Group and Cyber Security Laboratory (CS-Lab).
Orientador
Joao Campos
jrcampos@dei.uc.pt 📩