Proposta sem aluno

Gerado a 2024-05-07 23:27:55 (Europe/Lisbon).

Voltar

Titulo Estágio

Tools for distributed systems observation

Áreas de especialidade

Engenharia de Software

Local do Estágio

DEI

Enquadramento

Breaking large software systems into smaller functionally interconnected components is a trend on the rise. This architectural style, known as "microservices", simplifies development, deployment and management at the expense of complexity. System fragmentation produces a large number of possibly long and even unknown workflows that are difficult to trace, thus making the system as a whole quite difficult to observe. Hence, in large-scale distributed systems, devops might find it particularly difficult to determine the set of microservices responsible for delaying client requests, as delays and errors might emerge at later stages of the execution. Components cannot be analyzed in isolation and system operators lack an overall view of the system, to identify anomalies and trace their root causes. Under the DataScince4NP project, we have been working on several tools to improve the observability of distributed systems. These include visualization tools that expose data from individual services, like times, error codes, or number of replicas, dependency graphs (i.e., relations between services), overall architecture, individual workflows, failures in the system, etc.. In fact, state-of-the-art tools, like Zipkin and Jaeger are quite limited in their ability to expose the internal details of the distributed system.

Objetivo

To overcome the aforementioned limitations, in this internship, we want to improve tools for the observability of distributed systems. This will involve the following work:
- definition of a better instrumentation standard. Instrumentation with current standards, like OpenTracing leaves too many fields undetermined, thus making it difficult for the distributed tracing system to recover the complete interactions in the system. For example, in some key-value pairs the actual key is arbitrary. This makes automatic utilization of the key-value pair pretty much impossible. Value data types are also often free, thus also complicating their interpretation. Numeric data is also inconsistent, among other problems.
- improve graph-based tools that extract data (i.e., microservice interactions) from the distributed tracing system. In other words, this part of the work will involve fetching data from Zipkin, for the sake of determining service interaction in the form of graphs. This tool should automate many of the tasks that devops do manually. One of the big challenges of this task is the ability to process large volumes of trace data efficiently.
- improve the analytical tools that identify failures and determine failure root causes. This work involves the utilization of machine learning techniques, based on data extracted from the distributed tracing systems like Zipkin.

Plano de Trabalhos - Semestre 1

- Study of the existing work (1 month)
- Define the set of tools to improve (1 month)
- Define the requirements of the tools (1 month);
- Define the architecture of the system (1 months);
- Write intermediate report (1 month);

Plano de Trabalhos - Semestre 2

- Implement the tools (2 months);
- Improve the instrumentation standards (2 months)
- Write final report (1 month);

Condições

A 3-month, possibly extensible, scholarship of 745 euros (per month) is foreseen for this work.

Orientador

Filipe Araujo e Jorge Cardoso
filipius@uc.pt 📩