Titulo Estágio
Failure injection in microservice applications
Áreas de especialidade
Engenharia de Software
Local do Estágio
DEI
Enquadramento
Breaking large software systems into smaller functionally interconnected components is a trend on the rise. This architectural style, known as "microservices", simplifies development, deployment and management at the expense of complexity. System fragmentation produces a large number of possibly long and even unknown workflows that are difficult to trace, thus making the system as a whole quite difficult to observe. Hence, in large-scale distributed systems, devops might find it particularly difficult to determine the set of microservices responsible for delaying client requests, as delays and errors might emerge at later stages of the execution. Components cannot be analyzed in isolation and system operators lack an overall view of the system, to identify anomalies and trace their root causes. Under the DataScince4NP project, we have been working on several tools to improve the observability of distributed systems. These include visualization tools that expose data from individual services, like times, error codes, or number of replicas, dependency graphs (i.e., relations between services), overall architecture, individual workflows, failures in the system, etc.. In fact, state-of-the-art tools, like Zipkin and Jaeger are quite limited in their ability to expose the internal details of the distributed system.
Objetivo
One of the limitations we found in our current work is the lack of real systems and good data to build the analytic tools. To overcome this problem, the objective of this scholarship is to install, run and analyze a microservice application under normal and faulty operation. This will involve three steps: 1) install an existing application, like Stan's Robot Shop, from Instana, or a Music application we developed in the past; 2) use an established tool, from "Simian Army" for example to inject failures in the application; 3) finally, analyze the system using monitoring and end-to-end tracing, to improve our analytic tools. In simple terms, end-to-end tracing enables the complete recovery of all the client invocations, as they travel in the system, as well as recovery of all the workflows. The analysis we intend to do on the system will involve machine learning techniques based on multiple metrics. This will improve our previous work, because in the course of this internship we should enable the joint utilization of tracing and monitoring data, something we could not do so far, as we only have traces of real systems, without the corresponding monitoring data.
Plano de Trabalhos - Semestre 1
- Study of the existing work (1 month)
- Installation of the microservice application (1 month)
- Define the set of failures to inject (1 month)
- Define requirements for the analytic tools (1 month)
- Write intermediate report (1 month);
Plano de Trabalhos - Semestre 2
- Perform failure injection (2 months)
- Implement, improve and fine-tune new and existing analytic tools (2 months);
- Write final report (1 month);
Condições
A 3-month, possibly extensible, scholarship of 745 euros (per month) is foreseen for this work.
Orientador
Filipe Araujo, Jorge Cardoso (Huawei/U.C.) e Raul Barbosa
filipius@uc.pt 📩