Propostas atribuidas 2024/2025

DEI - FCTUC
Gerado a 2024-11-24 01:56:44 (Europe/Lisbon).
Voltar

Titulo Estágio

Exploring and Comparing Low-cost Alternatives to ChatGPT for Text Classification and Information Extraction

Áreas de especialidade

Sistemas Inteligentes

Local do Estágio

DEI / CISUC

Enquadramento

With the release of ChatGPT, there has been a huge change in the adoption of Artificial Intelligence (AI) in our society, with Large Language Models (LLMs) being extensively used, not only for personal reasons, but also by companies that incorporate it in their applications, replacing old processes, and leading to the creation of new LLM-based applications.

LLMs are a product of Natural Language Processing (NLP) and excel in many tasks of this area, including: Text Classification, where, towards better indexation, documents are assigned a category from a predefined set; and Information Extraction (IE), where structured information is acquired from given documents, becoming ready for database population. These tasks can be performed in zero-shot prompting, based exclusively on a natural language description; few-shot prompting, also including a small set of examples; or the model can be fine-tuned in a larger dataset.

However, ChatGPT and GPT-4, the underlying model, are proprietary solutions, about which much is unknown (e.g., training data, number of parameters, additional layers), and have an associated utilisation cost. Additionally, GPT-4's impressive capabilities come with a high computational demand, resulting in a substantial ecological footprint.

On the other hand, there are smaller and open-source LLMs based on the same architecture (Transformer) that can be used for similar tasks. Most of such models are available from the Huggingface hub (https://huggingface.co/), which further provides open-source libraries that make the direct utilisation and fine-tuning of these models straightforward. Among others, the following models are available from this hub: T5, Llama-3, Pythia, Mistral, Phi-3, Gemma or Falcon-2.

Objetivo

Towards strong conclusions on open and less expensive alternatives to proprietary LLMs, in this dissertation, solutions based on open-source LLMs will be studied and tested in the tasks of Text Classification (TC) and Information Extraction (IE). Proposed solutions may leverage on prompt-based strategies or on model fine-tuning. Their performance should be compared in datasets popular in TC (e.g., https://huggingface.co/datasets/valurank/News_Articles_Categorization, https://huggingface.co/datasets/go_emotions) and IE (https://huggingface.co/datasets/ProfessorBob/relation_extraction, https://github.com/NLP-CISUC/QA-4-Domain-Agnostic-IE-Dataset), most of which also open and available from the HuggingFace hub.

After reviewing the landscape of LLMs, experimentation will be conducted, in order to take conclusions on the suitability of open solutions, when compared to proprietary. Besides the actual (i) performance of a set of open LLMs on the target tasks, through selected datasets, conclusions should also consider (ii) the consistency of results, (iii) monetary and computational cost, as well as (iv) the level of control.

On top of this, strategies for increased control and trustworthiness on the results should be considered. Among other possibilities, this may include experiments with the injection of knowledge from external sources, model alignment, or attempts for explaining the obtained results.

Plano de Trabalhos - Semestre 1

* Literature review;
* Identification of benchmarks for TC and IE;
* Identification of LLMs to test, covering different types and sizes;
* Familiarisation with LLMs through preliminary experimentation;
* Writing of the intermediate report.

Plano de Trabalhos - Semestre 2

* Experimentation with prompt-based approaches for TC;
* Experimentation with prompt-based approaches for IE;
* Plan and consider fine-tuning some of the models;
* Evaluation of target aspects;
* Explore strategies for increased control and trustworthiness;
* Writing of the Master Thesis.

Condições

The workplace will be in a CISUC laboratory, where there will be regular communication with the supervisors. Since the dissertation will contribute to a funded research project, the selected student may apply for a research grant of 990€ / month, with a duration of 6 to 9 months. In the scope of the project, hardware is available for more computational demanding experiments.

Orientador

Hugo Oliveira, Catarina Silva, Bruno Ferreira
hroliv@dei.uc.pt 📩