Titulo Estágio
Semantic Information Extraction from Documents
Área Tecnológica
Inteligência Artificial
Local do Estágio
DEI - Laboratório de Inteligência Artificial
Enquadramento
Only 33% of the queries are answered by keyword search, which represents a big area of discovery for improvement of search engines. Semantic interpretation of documents is a task that can improve the precision and accuracy of search engines in a dramatic way. But in order to do this, processing and analysis of documents using Natural Language Processing techniques and Semantic Web technologies are needed. Several information can be extracted from documents and then be represented using semantic web technologies. Some types of information are:
• Concept Identification.
• Named Entity Recognition.
• Event Recognition.
• Topic Extraction.
• Opinion Extraction.
• …
What we want to explore in this thesis proposal, is the use of Semantic Web and Natural Language Processing mechanisms in information extraction from documents. This work proposal includes the use and creation of several tools and libraries to extract semantic information from documents, and represent this information using Semantic Web technologies in order to be used by search engines, for example. The work is intended to be applied to documents in general, with the main document formats as target. The addressed language is English and Portuguese, but several extraction tools, libraries and resources already exist, which will allow the candidate to perfect the ones that need to be improved.
Objetivo
The objective of this thesis is to address most of the topics mention before: Concept Identification, Named Entity Recognition, Event Recognition, Topic Extraction, Opinion Extraction, etc. With the main result of this thesis being a system for the automatic extraction of semantic information from documents. This information must be represented using ontologies in the Semantic Web format, and stored in triple stores. Some of the main requirements of the system to be developed are:
• Word Sense Disambiguation (WSD): selection of the most adequate sense of a word in a context.
• Information Extraction (IE): the generic task of automatically extracting structured information from unstructured natural language inputs. IE generally encompasses several subtasks, like: concept identification, event recognition, topic extraction, and sentiment analysis.
• Named Entity Recognition (NER): identification and (sometimes) classification of proper nouns, more precisely names of persons, organisations, places, events and pieces of art, expressions of time, quantities and monetary values, and sometimes even abstractions.
• Anaphora resolution: identification of anaphoras and determination of the expressions or entities they are referring to.
Plano de Trabalhos - Semestre 1
- State of the Art [Set – Nov]
- Natural Language Processing
- Semantic Web Technologies
- Tools and Resources
- Related Work
- Analysis and Specification [Dec]
- Definition of System Requirements
- Use Case Definition
- Design and Specification
- Thesis Proposal Writing [Dec – Feb]
Plano de Trabalhos - Semestre 2
- Prototype Development [Mar – Jun]
- Prototype Experimentation [Jun – Jul]
- Thesis Writing [Jun – Jul]
Condições
This project has a scholarship associated.
Observações
This thesis is part of the project iCIS (http://icis.uc.pt)
The research work will take place in the Knowledge and Intelligent Systems laboratory of the Cognitive and Multimedia Systems group of CISUC.
Este é um Estágio de carácter científico que irá decorrer em consonância com o projeto europeu FP7 ConCreTe, pelo que o/a aluno/a pode vir a colaborar com outros investigadores/doutorandos da UC e de outras universidades europeias que estejam a trabalhar neste tópico.
Orientador
Paulo Gomes
pgomes@dei.uc.pt 📩