Propostas atribuídas ano letico 2025/2026

DEI - FCTUC
Gerado a 2025-08-31 01:42:22 (Europe/Lisbon).
Voltar

Titulo Estágio

Detecting Entities from the Web with Conditional Random Fields

Área Tecnológica

Inteligência Artificial

Local do Estágio

DEI

Enquadramento

Named Entity Recognition (NER) is the task of identifying and classifying names in text. Supervised Machine learning techniques (e.g Cross-Random Fields and Conditional Markov Models) and some heuristics are used to detect place, person and organization names. NER involves two tasks, which is firstly the  identification of proper names in text, and secondly the classification of these names into a set of predefined categories of interest, such as person
names, organizations (companies, government organisations, committees, etc), locations (cities, countries, rivers, etc), date and time expressions. For humans, NER is intuitively simple, because many named entities are proper names and most of them have initial capital letters and can easily be recognized by that way, but for machine, it is so hard.

Once NER algorithms are mainly implemented/trained over Natural Language Texts, we are interested in training and extending State-of-the-Art threse NER systems to extract Named Entities from Web Pages in Portuguese and English. In this context some challenges are posed: e.g., the capitalization on the web is fairly erratic. Words are often in all caps (“GERMANY”) or lower-cased (“germany”); typical web pages are also much less dense in entities than texts; web pages are much more structured than texts, containing tables, lists, links, etc.

Objetivo

The role of the student in this project is to implement a NER system to process Web Pages content in English and Portuguese language. This information will be used in other projects (Semantics and the City [Pereira et al., 2009] Semantic Enrichment of Places [Alves et al.; 2009] ) related to apply Web Mining to Places and Events descriptions. Beside this, some knwon contest data sets (MUC and HAREM) will be tested in order to submit a Paper to a Conference in the area.

Plano de Trabalhos - Semestre 1

The tentative plan for this project (semester 1) is the following:
- October 15th – State of the art (1.5 months)
- October 31st - Experimentation of Open NER systems (1 month)
- December 15th - Adaptation and Training a model for NER on the Web, Part I - English Version (1.5 months)
- January 15th -  Refinement and Tests over MUC data set (1 month)
- February 27th - Intermediate report. (1 month)

Plano de Trabalhos - Semestre 2

The tentative plan for this project (semester 2) is the following:
- March 31st - Adaptation and Training the model for NER on the Web, Part I - Portuguese Version (1 month)
- April 30th -  Refinement and Tests over HAREM data set (1 month)
- May 31th - Experiments report. Paper submission. (1 month)
- June 30th - MSc thesis delivery. (1 month)

Condições

Strong skills in programming (Java), Machine Learning and Statistics.

Other interesting skills include Information Extraction techniques and Artificial Intelligence.

Orientador

Prof. Francisco Câmara Pereira & Ana Alves
camara@dei.uc.pt 📩