Titulo Estágio
Fraud Data Generator
Local do Estágio
Feedzai/Remote
Enquadramento
Feedzai's Machine Learning (ML) models process millions of transactions every day to produce risk scores for each transaction and stop fraud. This supervised ML problem is highly dependent on quality data to encode risky behavior (commonly known as fraud patterns) as well as normal behavior.
Such data patterns can be highly variable, varying across time, geography and use case. In particular, knowledge collected from one application over the years, may be valuable for another application later or even for the same application if a model is re-trained including more recent data. Furthermore, in general, historical data is very valuable for research and development. The simplest way to preserve such knowledge is to save historical data (e.g., in a data lake), however, increasingly tight privacy constraints make it a risky liability to save data even if sensitive content has been suitably anonymized, either because of contractual or legal obligations, or because of reputational risks. Data retention policies therefore require data to be frequently deleted, which is in conflict with data hungry applications that rely on such information to perform well.
Objetivo
One way to bypass the contradiction discussed above is to realise that even though data cannot be retained forever, there might be many ways in which data can be summarised/described/encoded without preserving any detailed information or being able to reconstruct the original data, thus mitigating the risks mentioned above. In fact, the whole point of ML algorithms is to be able to generalize beyond specific examples thus capturing some important aspect of the underlying statistical process.
Motivated by the discussion above, the goal of this project is to tackle the problem of encoding fraud datasets by learning a generative process that can faithfully generate synthetic datasets with similar properties. This would provide a way to memorize relevant information when data retention policies require us to delete datasets. The main challenges of this project will be:
- To preserve information on time dependencies, in particular data drifts across time.
- To learn dependencies among the various raw fields (e.g., amount, emails, addresses, user ids, etc...).
- To be able to remove all private information while preserving all the relevant data patterns that are usually extracted from such private information.
- To be able to prove that the system cannot be reverse engineered to reconstruct sensitive data.
- To validate that the synthetic data can be used for ML tasks without significant losses in performance compared with using the real data.
Plano de Trabalhos - Semestre 1
- Understand use cases and get acquainted with time series and concept drift.
- Literature review of generative modeling and data encoding (includes writing of dissertation relevant sections)
- Literature review on privacy preserving data generation (includes - writing of dissertation relevant sections)
- Design experiments and define evaluation.
Plano de Trabalhos - Semestre 2
- Gathering and pre-processing datasets for experimentation on different domains: banking, merchants and merchant acquirers.
- Implement framework/benchmarking tool to test SOTA most promising methods.
- Qualitative analysis of data generated and iterated on implemented solutions accordingly.
- Evaluation of privacy preservation between generated data and original data
- Use generated data to train fraud detection models.
- Comparative analysis of detection models trained on real data and generated/encoded data.
- Document experiments.
- Dissertation writing.
Condições
Estágio remunerado.
Observações
References
An important part of the project will be to review the literature on generative models. Here we just provide a couple of recent references that relate to the problems alluded to above.
Zinan Lin, Alankar Jain, Chen Wang, Giulia Fanti, Vyas Sekar, Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions, arXiv:1909.13403
Jinsung Yoon, Daniel Jarrett, Mihaela van der Schaar, Time-series Generative Adversarial Networks, NeurIPS 2019.
Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, Jieping Ye, A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications, arXiv:2001.06937
Orientador
Jacopo Bono
jacopo.bono@feedzai.com 📩