Propostas com alunos

Gerado a 2024-04-28 14:45:54 (Europe/Lisbon).

Voltar

Titulo Estágio

Failure injection in microservice applications

Áreas de especialidade

Engenharia de Software

Local do Estágio

DEI

Enquadramento

Breaking large software systems into smaller functionally interconnected components is a trend on the rise. This architectural style, known as "microservices", simplifies development, deployment and management at the expense of complexity and observability. Microservice-based architectures and Function-as-a-Service (FaaS) platforms are being favored for the flexibility they afford. This trend is only accelerated by the financial benefits and reduced development times promised by Platforms-as-a-Service (PaaS) and serverless deployments. The benefits are faster development cycles, development team independence (because scope is limited and well defined), ease of deployment, management, scaling and governance. Microservices, and FaaS are the building blocks of modern, highly dynamic distributed systems. We can see this trend in many large international companies, like LinkedIn, Netflix, Uber, Ebay, Amazon, among many others, but the same trend can be found in important national companies as well.

On the downside, companies face increasing visibility challenges into the health and performance of their systems, as cloud and micro-service architectures become more complex and dynamic, with more changing parts. The complexity moves from the components to their interaction, and their emergent behaviors. Since developers and operators lack a complete view of the system, the result is impaired observability, which turns debugging and monitoring into a challenge that keeps getting harder, as systems grow larger with additional components and version iterations. In large-scale systems, it is particularly difficult to determine the set of microservices responsible for delaying a client’s request. While all services involved in such request might seem to be working properly, their sequence is more likely to produce slow requests. Cascading effects, where one module impacts several other invoking microservices are also possible. Components cannot be analyzed in isolation, and system operators lack an overall view of the system, to determine the bottlenecks and trace their root causes.

Under the AESOP project (https://www.cisuc.uc.pt/projects/show/284) and the past DataScience4NP project (https://osf.io/8dc2e/), we have been working on several tools to improve the observability of distributed systems. These include visualization tools that expose data from individual services, like times, error codes, or number of replicas, dependency graphs (i.e., relations between services), overall architecture, individual workflows, failures in the system, etc.. These tools need to use logging, monitoring and tracing data. In simple terms, end-to-end tracing enables the complete recovery of all the client invocations, as they travel in the system, as well as recovery of all the workflows existing in the application.

Objetivo

One of the limitations we found in our current work is the lack of real systems and good data to build the analytic tools. To overcome this problem, our team is working on a tool called "Defektor" to automatically inject failures and collect monitoring, logging and tracing information from the system. For example, the tool might delay the response of some requests on purpose, to enable observation of the effect on other requests. Similarly, it may switch off some parts of the architecture to check if the overall application can cope with the situation.

The goal of this internship is to provide support to the development of the Defektor tool. This will involve a number of steps:
1. Give support to the installation of a distributed application, like Stan's Robot Shop, from Instana.
2. Modify the application and create associated tools to automatically tag log, monitoring, and tracing data based on the injected failures.
3. Finally, run the system and inject the failures to create tagged data. This will enable us to improve our analytic tools. The analysis we intend to do on the system will involve machine learning techniques based on multiple metrics.

Plano de Trabalhos - Semestre 1

- Study of the existing work and microservice application (2 months)
- Define the set of failures to inject (1 month)
- Define requirements for the analytic tools (1 month)
- Write intermediate report (1 month);

Plano de Trabalhos - Semestre 2

- Perform failure injection to collect data (2 months)
- Implement, improve and fine-tune new and existing analytic tools (2 months);
- Write final report (1 month);

Condições

- The work should take place at the Centre for Informatics and Systems of the University of Coimbra (CISUC) at the Department of Informatics Engineering of the University of Coimbra. A 3-month, possibly extensible, scholarship of 745 euros (per month) is foreseen for this work.

Observações

Orientador

Filipe Araújo e Raul Barbosa
filipius@uc.pt 📩