Propostas de Estágio 2014/2015 - Plurianual

DEI - FCTUC
Gerado a 2024-03-29 09:24:29 (Europe/Lisbon).
Voltar

Titulo Estágio

Semantic Information Extraction from Documents

Área Tecnológica

Inteligência Artificial

Local do Estágio

DEI - Laboratório de Inteligência Artificial

Enquadramento

Only 33% of the queries are answered by keyword search, which represents a big area of discovery for improvement of search engines. Semantic interpretation of documents is a task that can improve the precision and accuracy of search engines in a dramatic way. But in order to do this, processing and analysis of documents using Natural Language Processing techniques and Semantic Web technologies are needed. Several information can be extracted from documents and then be represented using semantic web technologies. Some types of information are:
• Concept Identification.
• Named Entity Recognition.
• Event Recognition.
• Topic Extraction.
• Opinion Extraction.
• …
What we want to explore in this thesis proposal, is the use of Semantic Web and Natural Language Processing mechanisms in information extraction from documents. This work proposal includes the use and creation of several tools and libraries to extract semantic information from documents, and represent this information using Semantic Web technologies in order to be used by search engines, for example. The work is intended to be applied to documents in general, with the main document formats as target. The addressed language is English and Portuguese, but several extraction tools, libraries and resources already exist, which will allow the candidate to perfect the ones that need to be improved.

Objetivo

The objective of this thesis is to address most of the topics mention before: Concept Identification, Named Entity Recognition, Event Recognition, Topic Extraction, Opinion Extraction, etc. With the main result of this thesis being a system for the automatic extraction of semantic information from documents. This information must be represented using ontologies in the Semantic Web format, and stored in triple stores. Some of the main requirements of the system to be developed are:
• Word Sense Disambiguation (WSD): selection of the most adequate sense of a word in a context.
• Information Extraction (IE): the generic task of automatically extracting structured information from unstructured natural language inputs. IE generally encompasses several subtasks, like: concept identification, event recognition, topic extraction, and sentiment analysis.
• Named Entity Recognition (NER): identification and (sometimes) classification of proper nouns, more precisely names of persons, organisations, places, events and pieces of art, expressions of time, quantities and monetary values, and sometimes even abstractions.
• Anaphora resolution: identification of anaphoras and determination of the expressions or entities they are referring to.

Plano de Trabalhos - Semestre 1

- State of the Art [Set – Nov]
- Natural Language Processing
- Semantic Web Technologies
- Tools and Resources
- Related Work
- Analysis and Specification [Dec]
- Definition of System Requirements
- Use Case Definition
- Design and Specification
- Thesis Proposal Writing [Dec – Feb]

Plano de Trabalhos - Semestre 2

- Prototype Development [Mar – Jun]
- Prototype Experimentation [Jun – Jul]
- Thesis Writing [Jun – Jul]

Condições

This project has a scholarship associated.

Observações

This thesis is part of the project iCIS (http://icis.uc.pt)
The research work will take place in the Knowledge and Intelligent Systems laboratory of the Cognitive and Multimedia Systems group of CISUC.

Este é um Estágio de carácter científico que irá decorrer em consonância com o projeto europeu FP7 ConCreTe, pelo que o/a aluno/a pode vir a colaborar com outros investigadores/doutorandos da UC e de outras universidades europeias que estejam a trabalhar neste tópico.

Orientador

Paulo Gomes
pgomes@dei.uc.pt 📩