Titulo Estágio
Integrate Feedzai’s Data Science Framework with SparkSQL.
Áreas de especialidade
Engenharia de Software
Sistemas Inteligentes
Local do Estágio
Feedzai Office in Coimbra
Enquadramento
Feedzai's Data Science Studio is a tool that allows Fraud Analysts and Data Scientists to
prepare data, train Machine Learning models and validate rules for automatic fraud
detection. In order to deal with large datasets, our internal data science framework
(DSF) relies on an integration between Spark Core and our proprietary CEP engine
(PKernel) to execute heavy distributed Data Science jobs.
In the past few years, Spark has immensely evolved and countless features and
performance improvements have been added, however, most of those are targeted for
the SparkSQL layer. Feedzai’s DSF, as an early adopter of Spark, still relies on Spark
Core which, in some cases, prevents DSF from taking advantage of such improvements
Thus, Feedzai intends to extend its DSF to have the ability to execute a DSF Logical
Plan into a SparkSQL back-end engine, maintaining the compatibility with the existing back-end
engines.
Objetivo
The main goal of this internship is to implement, validate and benchmark a new DSF
plan back-end engine backed by SparkSQL (Data Frames) that will allow the product to
take advantage of the new Spark features and improvements. The design must
consider that the product needs to maintain compatibility with the current engine.
Thus, we have these goals:
● On-boarding on Feedzai’s Pulse and Data Science Studio.
● On-boarding on the current DSF framework implementation and how it integrates into
the product.
● Design and present possible solutions to integrate with the existing code base and
identify the impacts, advantages and drawbacks of the proposed solutions.
● Implement the chosen solution, considering:
○ Code implementation using best practices.
○ Unit and System Tests implementation
○ Quality documentation
● Benchmark and document the current engine against the new SparkSQL engine
running Data Science jobs.
● Integration in the product, if ready for it.
Plano de Trabalhos - Semestre 1
On-boarding on Feedzai’s Pulse and Data Science Studio.
● On-boarding on the current DSF framework implementation and how it integrates into
the product.
● Design and present possible solutions to integrate with the existing code base and
identify the impacts, advantages and drawbacks of the proposed solutions.
● Write the intermediate report
Plano de Trabalhos - Semestre 2
● Implement the chosen solution, considering:
○ Implementation using the company’s tooling, best practices and styles.
○ Implement unit and integration tests that allow high coverage over the
produced code.
○ Quality documentation using the company tools.
● Benchmark and document the current engine against the new SparkSQL engine for
Data Science jobs.
● Integration in the product, if ready for it.
● Write the thesis.
Condições
The software and hardware required for the internship will be provided by
Feedzai. This is a paid internship. The exact duration of the internship is to be defined and the
remuneration will be 1000€ gross per month (full time).
Observações
You can find more at www.feedzai.com
Orientador
Pedro Gandola
pedro.gandola@feedzai.com 📩