Propostas Submetidas

Gerado a 2025-11-24 06:51:22 (Europe/Lisbon).

Voltar

Titulo Estágio

Fraud Data Generator

Áreas de especialidade

Sistemas Inteligentes

Sistemas de Informação

Local do Estágio

Feedzai/Remote

Enquadramento

Feedzai's Machine Learning (ML) models process millions of transactions every day to produce risk scores for each transaction and stop fraud. This supervised ML problem is highly dependent on quality data to encode risky behavior (commonly known as fraud patterns) as well as normal behavior.

Such data patterns can be highly variable, varying across time, geography and use case. In particular, knowledge collected from one application over the years, may be valuable for another application later or even for the same application if a model is re-trained including more recent data. Furthermore, in general, historical data is very valuable for research and development. The simplest way to preserve such knowledge is to save historical data (e.g., in a data lake), however, increasingly tight privacy constraints make it a risky liability to save data even if sensitive content has been suitably anonymized, either because of contractual or legal obligations, or because of reputational risks. Data retention policies therefore require data to be frequently deleted, which is in conflict with data hungry applications that rely on such information to perform well.

Objetivo

One way to bypass the contradiction discussed above is to realise that even though data cannot be retained forever, there might be many ways in which data can be summarised/described/encoded without preserving any detailed information or being able to reconstruct the original data, thus mitigating the risks mentioned above. In fact, the whole point of ML algorithms is to be able to generalize beyond specific examples thus capturing some important aspect of the underlying statistical process.

Motivated by the discussion above, the goal of this project is to tackle the problem of encoding fraud datasets by learning a generative process that can faithfully generate synthetic datasets with similar properties. This would provide a way to memorize relevant information when data retention policies require us to delete datasets. The main challenges of this project will be:
- To preserve information on time dependencies, in particular data drifts across time.
- To learn dependencies among the various raw fields (e.g., amount, emails, addresses, user ids, etc...).
- To be able to remove all private information while preserving all the relevant data patterns that are usually extracted from such private information.
- To be able to prove that the system cannot be reverse engineered to reconstruct sensitive data.
- To validate that the synthetic data can be used for ML tasks without significant losses in performance compared with using the real data.

Plano de Trabalhos - Semestre 1

- Understand use cases and get acquainted with time series and concept drift.
- Literature review of generative modeling and data encoding (includes writing of dissertation relevant sections)
- Literature review on privacy preserving data generation (includes writing of dissertation relevant sections)
- Design experiments and define evaluation.
- Intermediate report writing.

Plano de Trabalhos - Semestre 2

- Gathering and pre-processing datasets for experimentation on different domains: banking, merchants and merchant acquirers.
- Implement framework/benchmarking tool to test SOTA most promising methods.
- Qualitative analysis of data generated and iterated on implemented solutions accordingly.
- Evaluation of privacy preservation between generated data and original data.
- Use generated data to train fraud detection models.
- Comparative analysis of detection models trained on real data and generated/encoded data.
- Document experiments.
- Dissertation writing.

Condições

The data sources are identified and available in the Feedzai Research data repository. The internship agreements ensure access to the Feedzai Research data

Paid internship for the duration of the project (1000€/month) and according to time allocation.

Observações

References
An important part of the project will be to review the literature on Federated Learning. Here we just provide a couple of recent references that relate to the problems alluded to above.

Ji Liu, Jizhou Huang, Yang Zhou, Xuhong Li, Shilei Ji, Haoyi Xiong, Dejing Dou, From Distributed Machine Learning to Federated Learning: A Survey, arXiv:2104.14362

Frank W. Bentrem, Michael A. Corsello, Joshua J. Palm, Leveraging Sharing Communities to Achieve Federated Learning for Cybersecurity, arXiv:2104.11763

Chengliang Zhang, Junzhe Xia, Baichen Yang, Huancheng Puyang, Wei Wang, Ruichuan Chen, Istemi Ekin Akkus, Paarijaat Aditya, Feng Yan, Citadel: Protecting Data Privacy and Model Confidentiality for Collaborative Learning with SGX, arXiv:2105.01281

Orientador

Jacopo Bono
jacopo.bono@feedzai.com 📩