Titulo Estágio
Federated Learning for Fraud Detection
Áreas de especialidade
Sistemas Inteligentes
Sistemas de Informação
Local do Estágio
Feedzai/Remote
Enquadramento
Feedzai's Machine Learning (ML) models process millions of transactions every day to produce risk scores for each transaction and stop fraud. This supervised ML problem is highly dependent on quality data to encode risky behavior (commonly known as fraud patterns) as well as legitimate behavior.
One of the great insights from risk analysts, who research fraud behavior, is that fraud perpetrators often approach the task as a “full time job” and try to commit fraud in various contexts (for example, a batch of stolen cards may be used to buy an array of different goods in various websites to later be sold on the street). This can be either to more easily mask their activities or simply to exploit the stolen resources more effectively. Furthermore, there are criminal communities, e.g., in the dark web, that share information, so fraud behavior can also follow time varying social trends. Finally, in similar use cases fraud strategies are likely to evolve in similar ways (e.g., two different apps to order food).
The examples above make it clear that sharing information on emerging data patterns in real time across different applications might hold a very high value and uncover new risky behavior that is not possible to detect if applications are siloed (e.g., a credit card is exploited in Bank1 following a certain fraud strategy and later in Bank2 the same strategy is used on another card). However, an outstanding constraint in approaching data sharing are data privacy regulations, contractual restrictions or even reputational risks for the company in sharing sensitive data across different use cases.
Objetivo
One approach to bypass the issues mentioned above is to not share data at all but instead to try to encode the data patterns in a privacy preserving manner for sharing. Federated Learning is a framework that aims at training an ML model in a collaborative way from data in different nodes in a network, but without sharing data. In some approaches, each node in the network will either train their own local model or update a global model using their local data. The model is then sent to a master node that combines the various updates to obtain an updated stronger model. This procedure usually has to be carefully tailored to preserve privacy constraints, i.e. to ensure the models do not memorize information that allows them to reverse engineer sensitive data.
The goal of this project would be to investigate the feasibility and effectiveness of Federated Learning for fraud detection. The main challenges to address will be:
- Which Federated Learning strategies are suitable for different applications at Feedzai (e.g., maybe federated learning makes sense to share information across banks and across merchants but not so much between banks and merchants)
- Can we share all relevant fraud patterns without sharing private information? Domain knowledge shows us that often feature extraction from sensitive fields is very important.
- Are there use cases where ML models trained with Federated Learning obtain a considerable performance boost?
- Will such a solution scale appropriately for very large clients and comply with engineering constraints in a streaming environment (e.g., latency and memory constraints).
Plano de Trabalhos - Semestre 1
- Understand use cases and get acquainted with time series in the fraud detection domain.
- Literature review of federated learning (includes writing of dissertation relevant sections)
- Literature review of privacy preserving data sharing and machine learning (includes writing of dissertation relevant sections)
- Design experiments and define evaluation.
- Intermediate report writing.
Plano de Trabalhos - Semestre 2
- Gathering and pre-processing datasets for experimentation on different domains: banking, merchants and merchant acquirers.
- Implement framework/benchmarking tool to test SOTA most promising methods.
- Evaluate results and compare federated learning approaches with classical model training techniques.
- Assess if the technique implemented is viable for the different business cases.
- Assess how really the data is protected and if the sensible attributes are safe with the implemented approach.
- Assess if the implemented approaches are scalable and feasible to put in production in a distributed environment and comply with engineering constraints in a streaming environment (e.g., latency and memory constraints).
- Document experiments
- Dissertation writing.
Condições
The data sources are identified and available in the Feedzai Research data repository. The internship agreements ensure access to the Feedzai Research data
Paid internship for the duration of the project (1000€/month) and according to time allocation.
Observações
References
An important part of the project will be to review the literature on Federated Learning. Here we just provide a couple of recent references that relate to the problems alluded to above.
Ji Liu, Jizhou Huang, Yang Zhou, Xuhong Li, Shilei Ji, Haoyi Xiong, Dejing Dou, From Distributed Machine Learning to Federated Learning: A Survey, arXiv:2104.14362
Frank W. Bentrem, Michael A. Corsello, Joshua J. Palm, Leveraging Sharing Communities to Achieve Federated Learning for Cybersecurity, arXiv:2104.11763
Chengliang Zhang, Junzhe Xia, Baichen Yang, Huancheng Puyang, Wei Wang, Ruichuan Chen, Istemi Ekin Akkus, Paarijaat Aditya, Feng Yan, Citadel: Protecting Data Privacy and Model Confidentiality for Collaborative Learning with SGX, arXiv:2105.01281
Orientador
Ricardo Barata
ricardo.barata@gmail.com 📩