Propostas Submetidas

DEI - FCTUC
Gerado a 2024-03-28 09:52:35 (Europe/Lisbon).
Voltar

Titulo Estágio

The Impact of Synthetic Data on Feature Engineering

Áreas de especialidade

Sistemas Inteligentes

Local do Estágio

DEI-FCTUC

Enquadramento

The amount of data that people generate nowadays is immense. This enables the development of intelligent systems and services that can improve the quality and safety in the day to day life. Fraud Detection systems are examples of current state-of-the-art systems. They have benefited from the growing availability of data, allowing the application of Machine Learning (ML) algorithms that leverage information to classify financial transactions as legitimate or illicit in real-time. The data used for creating these solutions is usually highly structured and contains features characterized by complex distributions. However, fraudulent instances are scarce, leading to highly unbalanced datasets. This, together with high privacy restrictions raises several challenges to the development of ML Models.
With the rise of Deep Learning and the success it achieves in a variety of problems, many approaches have been proposed in the literature to tackle the problem of unbalanced data and privacy preservation. One possible way is to develop a model that learns the underlying representation of the data, and is then used to generate new synthetic examples that follow in the same distributions, but do not reveal any sensitive information about the original data [1,2,3,4].


1 - Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1-127.
2 - Phan, N., Wang, Y., Wu, X., & Dou, D. (2016, February). Differential privacy preservation for deep auto-encoders: an application of human behavior prediction. In Thirtieth AAAI Conference on Artificial Intelligence.
3 - Malekzadeh, M., Clegg, R. G., & Haddadi, H. (2018, April). Replacement autoencoder: A privacy-preserving algorithm for sensory data analysis. In 2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI) (pp. 165-176). IEEE.
4 - Xu, L., & Veeramachaneni, K. (2018). Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264.

Objetivo

Building upon the work that we have been conducting, the main goal of this dissertation is to analyse and evaluate when artificial data should be generated in the ML pipeline. In concrete we aim to understand what is the impact of generating synthetic samples before and after a feature engineering procedure has been applied. Also, considering the problem of unbalanced datasets for fraud detection, we also aim to develop new approaches for dealing with the reduced number of positive instances in data.

To validate the approaches we are going to use fraud detection datasets that are publicly available, and that we already used in our project. Additionally, there is also the possibility of evaluating the results in real-world data which will be provided by one of partners of the project.

Plano de Trabalhos - Semestre 1

1 - Review of literature.
2 - Familiarisation with the current system.
3 - Definition of the techniques and technologies that will be used.
4 - System architecture Design
5 - Implementation of the first version of the system
6 - Writing of the intermediate report

Plano de Trabalhos - Semestre 2

7 - Analysis of the first prototype and the obtained results
8 - Implementation of a second version of the prototype
9 - Validation and refinement
10 - Scientific article with the main results
11 - Writing of the dissertation

Condições

The student will work in the context of the interdisciplinary project CAMELOT. This project is led by the Feedzai company and involves the Carnegie Mellon University, Universidade de Coimbra, Faculdade de Ciências da Universidade de Lisboa, Instituto Superior Técnico.
The work will be conducted in the Evolutionary and Complex Systems (ECOS) and Systems and Software Engineering (SSE) groups, from CISUC.
There is a possibility of the student being awarded a scholarship (Bolsa de Investigação para Licenciado) for at least 6 months, renewable for an equal period by agreement between the advisor and the intern. The scholarship will follow the Fundação para a Ciência e Tecnologia (FCT) monthly stipend guidelines.

Orientador

Nuno Lourenço / Bruno Cabral / João Paulo Fernandes
naml@dei.uc.pt 📩