Propostas Submetidas

Gerado a 2024-04-25 09:25:22 (Europe/Lisbon).

Voltar

Titulo Estágio

Benchmark and Tuning of Data Science Framework over Spark

Áreas de especialidade

Engenharia de Software

Sistemas Inteligentes

Local do Estágio

Rua Pedro Nunes – IPN, 3030-199 Coimbra

Enquadramento

In the field of data science applied to fraud detection, operating big data datasets that span to several terabytes of size is a normal day of work. As such, doing simple ETL tasks such as data cleaning and transformation requires the usage of big data frameworks such as Hadoop Map Reduce or Spark over distributed file systems, in order to be able to iterate fast over the data science tasks.
Feedzai has a Data Science Framework that integrates with Spark in order to process large volumes of data in a large number of cluster nodes.
Optimizing and tuning a cluster or a big data framework such as Spark to be as efficient as possible is not a simple task, mainly because in some cases it depends on the actual workload.There are several aspects that an engineer can address in order to optimize a distributed job.
(continues in objectives)

Objetivo

In addition, big data frameworks such as Spark have several configurations parameters that control and define how the communication between nodes are made in terms of buffer sizes, compression algorithm, serialization algorithm, memory per node, sort algorithm and several others.
Understand the best set of configuration of parameters requires benchmarking and iteration on the set of parameters depending on the type of jobs being executed.
As such, the objective of this internship is to learn the most common set of workloads used in the Data Science Frameworks and perform benchmarks over the Cluster in order to obtain the best set of Data Science Framework and Spark that can minimize the time that jobs take
to complete. The student will tackle problems such as minimizing network I/O, garbage collection and CPU bottlenecks.

The main goal of this internship is for the student to deliver important insights on the internals of Spark and how they can be exploited in order to optimize job executions.
In the end, the deliverable should be a report that explains the limitations identified in the current set of parameters being used by the Data Science Framework and how they can be solved with a new configuration or even rewriting the job with a different set of Spark
primitives.

All finding should be documented and backed by analytical analysis of the benchmarks.

Plano de Trabalhos - Semestre 1

The stages of this internship for 1st semester could be described as follows:

• Review State of the Data Science Framework and Workloads.

• Review State of the art in Spark tuning and configuration.

• Define a methodology plan for benchmarking.

Plano de Trabalhos - Semestre 2

The stages of this internship for 2nd semester could be described as follows:

• Implement the test-bed for all benchmarks.

• Benchmark and iterate.

• Documentation, final report, and presentation.

Condições

Paid internship

Observações

You can find more information at: www.feedzai.com

Orientador

Pedro Pinto
pedro.pinto@feedzai.com 📩

Propostas Submetidas - sem aluno

Acesso Privado