Propostas submetidas

DEI - FCTUC
Gerado a 2025-06-25 12:12:02 (Europe/Lisbon).
Voltar

Titulo Estágio

Development of Algorithms for Detecting Similar Texts and Goal-Oriented Text Generation

Local do Estágio

Rua Dom João Castro n.12, 3030-384 Coimbra, Portugal

Enquadramento

With the exponential growth of textual data on the internet and in private repositories, there is a significant need for automatic tools that can analyze, compare, and generate text in a semantic context.

Detecting similar texts or identifying texts with similar goals is important in applications such as document clustering, content recommendation, and text summarization. The key challenge here is to go beyond superficial word matching and focus on understanding the meaning of the text, as well as its intent or goal.

On the other hand, text generation plays a crucial role in applications like content creation, automated response systems, and interactive storytelling. An algorithm capable of generating contextualized, purpose-driven text can greatly enhance user experience, support content creation, or automate repetitive tasks like email drafting or report generation.

This internship aims to develop algorithms to detect semantically similar texts or texts with similar goals, and to create a text-generation system that produces high-quality content based on specific prompts or purposes.

Objetivo

The main objective of this master’s internship is to develop and evaluate algorithms that enhance text analysis capabilities in the context of text similarity detection and goal-oriented text generation. Specifically, the internship will:
1. Text Similarity Detection: Develop an algorithm to detect similar texts and identify shared goals or intentions using semantic similarity models (e.g., BERT, Sentence-BERT, or other embeddings), even when the texts are not identical but convey similar objectives or meanings.
2. Goal-Oriented Text Generation: Create an algorithm for text generation based on specific goals or prompts, ensuring the generated content is relevant, contextually appropriate, and coherent, while experimenting with transformer-based models like GPT-3, GPT-4, or T5 to meet these objectives.
3. Evaluation and Metrics: Implement evaluation methods to assess the quality of both the text similarity detection and generated text, using metrics such as precision, recall, F1-score, and semantic accuracy.
4. Prototyping and Validation: Build a prototype application or tool to demonstrate the practical application of both the text similarity detection and generation algorithms.

Plano de Trabalhos - Semestre 1

1. Literature Review: Review the state of the art in text similarity detection, semantic matching, and text generation models (e.g., BERT, T5, GPT). Understand various semantic representation models and methods for text comparison.
2. Data Collection: Select or create datasets with goal-oriented texts, such as task descriptions, FAQs, or any dataset relevant to the domain (e.g., customer support tickets, project goals, etc.).
3. Experimentation with Pre-trained Models: Test and compare pre-trained NLP models for detecting text similarity (e.g., Sentence-BERT, Universal Sentence Encoder). Develop initial prototypes for goal-oriented text generation using pre-trained transformer-based models (e.g., GPT-3, T5, or fine-tuned BERT).
4. Fine-Tuning: Begin fine-tuning models for specific domain tasks if applicable (e.g., fine-tune GPT or T5 on your dataset to generate text relevant to your domain).
5. Write the Intermediate Thesis: Document the research findings, methodology, initial experiments, and the proposed solution.

Plano de Trabalhos - Semestre 2

1. Refinement of Algorithms: Improve the similarity detection algorithm by incorporating advanced techniques like semantic search or fine-tuning models to better capture goal-oriented intent. Enhance the text generation algorithm to make it more coherent, relevant, and specific to the user-defined goal.
2. Evaluation and Testing: Evaluate the similarity detection model using standard metrics (e.g., cosine similarity, F1-score, or BLEU score for text generation). Evaluate the generated text based on quality, coherence, and relevance to the goal using automatic metrics and possibly human evaluation.
3. Prototyping: Develop a simple application (e.g., a web tool or API) that allows users to input texts and see similarity detection results or generate goal-oriented text.
4. Final Thesis: Complete and submit the final thesis with comprehensive details on methodology, experiments, results, and potential applications of the developed algorithms.

Condições

The trainee will have all the necessary conditions to carry out the planned tasks, being integrated into the research and development teams within European research projects in which OneSource is involved.

Orientador

Luís Miguel Batista Rosa
luis.rosa@onesource.pt 📩