Titulo Estágio
Supervised Topic Models with Multiple Annotators
Áreas de especialidade
Sistemas Inteligentes
Local do Estágio
AmIlab
Enquadramento
Nowadays there is a growing need to analyze large collections of electronic text. The complexity of document corpora has led to considerable interest in applying hierarchical statistical models. Given the success of unsupervised topic models, e.g. latent Dirichlet allocation (LDA) on modeling words in documents and given the need to model document labels as well as words, new supervised variants started to appear. However, it is seldom the case where a document is classified by a single annotator. News articles, webpages, books, emails and many other textual content are often categorized online but in different ways by different users. When disagreements arise, establishing a parsimonious solution through traditional methods is inappropriate, since they build on the wrong assumption that all annotators are equally reliable. It is therefore essential to account for the uncertainty by weighting the labels of different annotators differently giving importance to reliable annotators.
Objetivo
With the current limitations of supervised topic models in mind, the aim of this thesis is to extend supervised LDA to account for multiple annotators with different levels of expertise, thereby allowing us to jointly learn the latent topics associated with each document, how each topic relates to the documents class, and to make predictions for the class of new unseen documents (as well as their topic assignments). The resultant probabilistic model will be a generalization of supervised LDA, which corresponds to the special case when only a single high-reliability annotator is available.
Given the probabilistic model, the goal is to perform Bayesian inference on all the latent variables of interest, such as the word-topic assignments, the latent (true) class values, the reliabilities of the different annotators and the topics distributions themselves. However, as with LDA and supervised LDA, exact Bayesian inference is intractable. It is therefore the goal of this thesis to explore approximate Bayesian inference techniques based on variational methods [Biel, 2003].
Once the model is implemented, it will be validated using artificial data and real multiple-annotator data obtained through Amazon Mechanical Turk. The final model will then be applied to the problem of event classification
Plano de Trabalhos - Semestre 1
•Study of the state-of-the-art regarding supervised topic models and variational inference
•Development of variational inference algorithm for the proposed model
•Implementation approximate inference algorithm
• Writing of intermediate report
Plano de Trabalhos - Semestre 2
•Validation of the model using simulated annotators on real data
•Validation of the model using real multiple-annotator data from Amazon Mechanical Turk
•Writing of scientific article
•Application of the developed model to the problem of event classification
•Writing of a scientific article
•Writing of the thesis
Condições
This work will be carried out in the Ambient Intelligence Lab (AmIlab) of CISUC, where there will be a regular supervision and feedback on the behalf of the supervisors.
This work will be paid with a BSc research grant from the InfoCROWDS project.
Orientador
Bernardete Ribeiro, Filipe Rodrigues (bribeiro@dei.uc.pt,fmpr@dei.uc.pt)
bribeiro@dei.uc.pt 📩