Titulo Estágio
Semantic Topic Modelling
Áreas de especialidade
Sistemas Inteligentes
Local do Estágio
DEI/FCTUC
Enquadramento
Topic models have retained much attention when dealing with large text collections - topics are multinomial distributions over words/key phrases, which aim at capturing the meaning of huge volumes of data unsupervisely. Classic algorithms, as LDA [1], rely on the co-occurrences of surface words to generate topic distributions. A surface word is likely to be highly associated to more than one topic and presents different senses in different topics. LDA considers a surface word to be identical in both contexts and leverages on its co-occurrences with other words in the context to differentiate topics. Combining semantics [2,3] and topic modeling is an attempt to generate topic distribution with synonymous and semantically similar words unified as a recurrence of the same concept.
This work is part of the InfoCrowds project, which explores interactions between online information on public events and mobility data to build explanatory and predictive models of people flows in a city.
Objetivo
With the current limitations in word representation of topic models in mind, the aim of this work is to incorporate into LDA semantic information, in order to achieve a concept topic distribution as output of the algorithm. Therefore, subtopics on the field of Natural Language Processing and Semantic Processing must be studied and implemented as well, such as: Word Sense Disambiguation and Semantic Textual Similarity.
On InfoCrowds, information about events is automatically extracted from online resources for different cities such as Lisbon and Boston. Thus, this semantic model must be applied over texts on both languages: Portuguese and English. Relying on external semantic resources as WordNet (http://wordnet.princeton.edu/) or Onto.PT (http://ontopt.dei.uc.pt/), or in a fully unsupervised way, the proposed model must be generic enough to be implemented in different languages even if some conditions must be met (e.g. semantic resources availability such as those mentioned earlier).
The proposed model will be evaluated using available corpora and evaluation metrics for topic modelling and will be applied over information extracted from online resources for event modelling.
Plano de Trabalhos - Semestre 1
- Study of the state-of-the-art regarding: topic models, approaches to integrate semantic information, and familiarization with lexical knowledge bases
- Definition of an approach to incorporate semantics into LDA.
- Implementation of a prototype as proof of concept
- Elaboration of the dissertation proposal
Plano de Trabalhos - Semestre 2
- Implementation of the semantic model
- Validation of the model using available corpora, for Portuguese and English, and evaluation metrics
- Writing of a scientific article
- Application of the developed model to the problem of event modelling
- Writing of the thesis
Condições
This work will be carried out in the Ambient Intelligence Lab (AmIlab) of CISUC, where there will be a regular supervision and feedback on the behalf of the supervisors.
This work will be paid through a BSc research grant.
Observações
Referências
[1] Blei, D., A. Ng, M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
[2] Weiwei Guo and Mona Diab. 2012. Learning the latent semantics of a concept from its definition. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2 (ACL '12), Vol. 2. ACL Press, Stroudsburg, PA, USA, 140-144.
[3] Tang et al. Topic Models Incorporating Statistical Word Senses. Computational Linguistics and Intelligent Text Processing LNCS Volume 8403, 2014, pp 151-162, 2014.
Orientador
Ana Oliveira Alves e Hugo Gonçalo Oliveira
hroliv@dei.uc.pt 📩