Titulo Estágio
Debuggable Distributed Systems
Áreas de especialidade
Engenharia de Software
Engenharia de Software
Local do Estágio
Coimbra OU Lisboa OU Porto
Enquadramento
About Feedzai:
Feedzai is a company that makes bleeding-edge machine learning software.
The world’s mightiest payment networks, banks and retailers use us to prevent fraud when customers shop in store, online or via mobile devices. Backed by years of hardcore work and funding from amazing investors (Oak HC/FT, Sapphire Ventures, Data Collective) we’re at the inflection point of growth.
Context:
Feedzai’s systems are distributed by nature, with the intent of scaling out horizontally and to be highly fault tolerant, resorting to known systems such as ZooKeeper, RabbitMQ, Cassandra, Spark, Hadoop, YARN, as well as several others developed and maintained internally.
One of the main challenges underpinning Feedzai’s deployments is that they are highly distributed, often comprising of dozens of components, which together process incoming transactions with high fault tolerance. This distributed nature makes debugging problems particularly hard as, often, the root cause issue stems from a different component than the one where the problem was witnessed in.
Today, without an underlying framework focused on this problem, the typical process to find the root cause issue for a problem is to iteratively go from one component to the other, led by the logs verified from the previous component, and to try to correlate the logging from one system with the other.
E.g., a monitoring system yields a warning about the latency of the transaction processing engine being very high for 0.1% of the transactions. An engineer looks into the logs of the transaction processing engine and identifies 1 transaction where the latency was effectively high. He understands from the logs that the problem was in fetching data from a distributed NoSQL data store. Now he has to go to the 5 machines, which comprise that distributed NoSQL data store, and walk through its (possible various) logs to understand what was happening at the time when the transaction was processed in the other store. Not only that, but maybe the root cause issue is in Garbage Collection pressure from the Java Virtual Machine happening at that exact time.
In essence the task of debugging a distributed system, such as those in Feedzai’s platforms, poses hardships and challenges that result in error prone hard work.
Objetivo
Objectives:
The key objective of this internship is to create a framework, to be used by Feedzai platforms, that will allow to easily debug problems in its distributed systems.
The result of such framework should be that:
- It is possible to access a platform that aggregates the logs of different distributed components of a Feedzai deployment;
- Each operation in the system is uniquely identifiable across the components;
- All relevant parts of the logs are traceable from the identifier of an operation;
- The components, and associated hardware resources, are tracked over time and can be associated also with the time during which an operation occurred;
- The level of tracing should be controllable and dynamically changeable.
Part of the difficulties in doing so is to consider that:
- Some components are owned by Feedzai and can have their source code changed, whereas for others we cannot assume that and may need to rely on existing logging and / or instrumentation;
- Different components execute in different languages and runtimes (e.g., Java, Erlang, C++).
These difficulties should be materialized into different stages of the internship, in which initially it should be possible to create a proof of concept that works for Feedzai’s components that execute in the Java Virtual Machine, and ultimately the framework should be applicable to all Feedzai deployments --- thus being a general purpose framework, naturally with more value.
The work should, as much as possible, rely on existing open source tools, and ensure that it builds on the existing state of the art on this challenging problem statement.
Plano de Trabalhos - Semestre 1
1st Semester:
During this semester there are two main objectives:
- Discovery of relevant state of the art in tracing, debugging and automating the correlation of events across distributed systems.
- Understanding the architecture and components of the Feedzai platform
Approach:
The problem of debugging problems in distributed systems has been studied more intensively in the past decade, more so with the rise of the big internet giants that deploy distributed systems with hundreds or thousands of machines.
Therefore, a big part of this first semester must be in understanding existing approaches and their applicability to Feedzai’s scenario. A recent publication in ACM Queue discusses the key works in the area [1] and is thus an adequate starting point in the literature.
On the other hand, it is important to perform a survey of Feedzai’s components and architecture. On top of this, one sub-set of the system should be considered for a first attempt at creating a proof of concept where it can be applied.
Two possible examples of this would be Feedzai’s transaction processing engine (named Pulse, in particular its Runtime Workflow component and replication functionality) or its Data Science platform that executes mostly on top of Spark, YARN and HDFS.
To lay the foundations for the upcoming work, particular examples of problems in these systems should be identified (from real production environments) and replicated in a development environment by the student.
The result of this semester should be:
- Survey of the relevant state of the art literature
- Architecture documentation of the Feedzai platform
- Detailed documentation for the sub-set of the architecture subject to the upcoming proof of concept
- Definition of promising approaches and their architectures based on the state of the art
- Written report for the intermediate checkpoint of the internship
[1] https://queue.acm.org/detail.cfm?id=3074451
Plano de Trabalhos - Semestre 2
2nd Semester
During this semester the key objective is to implement the proposed framework and integrate it in the Feedzai platform.
To do so, the student will follow an iterative development approach, based on Scrum, where goals will be increasingly turning the initial proof of concept into a more realistic prototype with extended functionalities and evaluation processes. A senior Feedzai engineer will follow up the work, in the context of a Feedzai Engineering team, to ensure alignment of the development with proper software engineering practices that guarantee the applicability and integration of the framework in Feedzai’s components.
This will entail the following tasks:
- To implement the distributed debugging framework
- Integrate it in the sub-set of Feedzai platform for a proof of concept
- Evaluation of the effectiveness of the debugging capabilities with Feedzai engineers from real production environments
- Written report for the internship, following up on the checkpoint document, and with the final architecture, implementation details and evaluation results
Condições
PC
Paid Internship
Flexible work schedule
Possibility to do the internship in Coimbra, Porto or Lisbon
Orientador
Nuno Diegues
nuno.diegues@feedzai.com 📩