Titulo Estágio
Synthetic Data Augmentation for Biological Datasets
Local do Estágio
DEI-FCTUC
Enquadramento
Synthetic data is information that is artificial generated by capturing and sampling the underlying distributions and statistical properties that are present in real data. Synthetic data generation increases the amount of available information, thereby fostering the development of models with enhanced accuracy. Biological scenarios are remarkable examples where synthetic data generation is particularly relevant, as it might be difficult, or even impossible, to obtain large and representative datasets (e.g., due to ethical reasons).
With the rise of Deep Learning, and the success it achieved in a variety of problems, several approaches have been proposed in the literature to tackle the problem of synthetic data generation [1,2].
1 - Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1-127.
2 - Xu, L., & Veeramachaneni, K. (2018). Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264.
Objetivo
The main goal of this dissertation is to design, implement and test a framework that learns the underlying distribution of real data belonging to a dataset with a low number of entries, and samples it to generate synthetic meaningful examples. In concrete, we will focus on approaches based on the usage of Generative Artificial Networks (GANs) [4] and/or AutoEncoders [1,2]. The outcome of the learning process will be used to train Machine Learning models, and their performance will be evaluated.
The models will be tested on real biological data obtained by MitoXT, a group from the Center for Neuroscience and Cell Biology at the University of Coimbra. The biological problem concerns the metabolic characterization of cells from patients with amyotrophic lateral sclerosis (ALS). A set of experimental data was collected on cells from patients with ALS and also from a set of healthy controls.
Plano de Trabalhos - Semestre 1
1 - Review of literature
2 - Definition of the techniques and technologies that will be used
3 - System architecture Design
4 - Implementation of the first version of the system
5 - Writing of the intermediate report
Plano de Trabalhos - Semestre 2
6 - Analysis of the first prototype and the obtained results
7 - Implementation of a second version of the prototype
8 - Validation and refinement
9 - Scientific article with the main results
10 - Writing of the dissertation
Condições
The work will be conducted in the Evolutionary and Complex Systems (ECOS) group from CISUC.
There is a possibility of the student being awarded a scholarship (Bolsa de Investigação para Licenciado) for at least 6 months, renewable for an equal period by agreement between the advisor and the intern. The scholarship will follow the Fundação para a Ciência e Tecnologia (FCT) monthly stipend guidelines.
Orientador
Nuno Lourenço / Francisco B. Pereira
naml@dei.uc.pt 📩