Propostas Submetidas

DEI - FCTUC
Gerado a 2024-07-17 09:38:39 (Europe/Lisbon).
Voltar

Titulo Estágio

A dataset of a microservices application for research purposes

Áreas de especialidade

Engenharia de Software

Local do Estágio

DEI/CISUC

Enquadramento

Breaking large software systems into smaller functionally interconnected components is a trend on the rise. This architectural style, known as "microservices", simplifies development, deployment, and management at the expense of complexity and observability. Microservice-based architectures and Function-as-a-Service (FaaS) platforms are being favored for the flexibility they afford. This trend is only accelerated by the financial benefits and reduced development times promised by Platforms-as-a-Service (PaaS) and serverless deployments. The benefits are faster development cycles, development team independence (because scope is limited and well defined), ease of deployment, management, scaling and governance. Microservices, and FaaS are the building blocks of modern, highly dynamic distributed systems. We can see this trend in many large international companies, like LinkedIn, Netflix, Uber, Ebay, Amazon, among many others, but the same trend can be found in important national companies as well.

On the downside, companies face increasing visibility challenges into the health and performance of their systems, as cloud and micro-service architectures become more complex and dynamic, with more changing parts. The complexity moves from the components to their interaction, and their emergent behaviors. Since developers and operators lack a complete view of the system, the result is impaired observability, which turns debugging and monitoring into a challenge that keeps getting harder, as systems grow larger with additional components and version iterations. In large-scale systems, it is particularly difficult to determine the set of microservices responsible for delaying a client’s request. While all services involved in such request might seem to be working properly, their sequence is more likely to produce slow requests. Cascading effects, where one module impacts several other invoking microservices are also possible. Components cannot be analyzed in isolation, and system operators lack an overall view of the system, to determine the bottlenecks and trace their root causes.

Previous literature has shown that failures in large-scale distributed systems, namely in cloud systems, may result from many different sources, such as software faults, configuration, operator, network, and more. Many faults are immediately mitigated by standard techniques at the microservice or service mesh level. For example, circuit breakers can prevent a service from halting due to other very slow downstream services by preventing requests to them; some tools may automatically retry a request that fails due to some temporary reason; finally, health checks may automatically mark faulty replicas for replacement. However, other failures might evade these simple measures. For example, if a service is replaced by a new version or the old one is discontinued, all the clients depending on the old version will stop working. Another example could be a misconfiguration of a service or a sudden increase in load, preventing the auto scaler from responding in time or at all with additional replicas. In this case, response times may increase sharply with a very negative impact on the user experience. While these failures evade the simple fault-tolerance mechanisms, their causes are still relatively easy to find for standard observation schemes in place, as they usually have a tight control of HTTP error status messages, like a 404 (URL not found) or service response times. In this work, we want to go beyond these simple cases and explore subtler scenarios. In complex applications with long workflows, these could take too long, simply because there are too many chained services; some microservices might not work well with some invocation parameters; software bugs might get activated in some rare workflows; or resources might not be properly released. Cases like these can be very hard to identify by simple monitoring measures that are often automatically put in place.

As a consequence, observing microservices is a challenging task requiring considerable research. One of the problems for the academia, at the present time, is the lack of microservice applications and datasets that researchers can use to evaluate their tools and algorithms.

Objetivo

In this internship, we aim to create a dataset containing information pertaining to a set of failures occurring in a microservice application. The information in this dataset will afterwards be used to train a model to help identify root cause of failures, but the training itself is outside of the scope of this work. The information in the dataset will be obtained from executing a microservice application in which failures are injected. This implies, first, a prior step of identifying the relevant failures representative of this type of applications, and, second, the development of a case-study application to execute and observe, in order to collect the intended data.
Thus, the main tasks in this work are as follows:
- survey academic works on microservice applications to obtain a list as complete as possible, of different failure cases, especially non-trivial ones.
- Obtain a case-study microservice-based application to serve as a failure data generator which will be used to inject the failures identified in the previous task in order to obtain relevant data for the dataset. The application may be based on a previously available application, adapted as necessary, or built from scratch.
- Having the case-study application ready, failures will then be injected following the traditional fault injection experiment methodology, where a target application or system is executed with a representative workload, while failures are artificially injected in the application to observe or assess its behavior. The resulting failure data will then be used to build the dataset.

Plano de Trabalhos - Semestre 1

- Understand the domain, evaluate the state of the art and prepare the list of failures (2 months).
- Select and prepare an application that can be used to observe the failures previously identified (2 months).
- Write intermediate report (1 month).

Plano de Trabalhos - Semestre 2

- Implement the necessary changes in the application, and supporting tools, to allow its execution with injected failures (2 months).
- Test and evaluate the cases in the failures list (1 month).
- Write final report and scientific paper (2 months).

Condições

The work should take place at the Centre for Informatics and Systems of the University of Coimbra (CISUC) at the Department of Informatics Engineering of the University of Coimbra.

An 990,98 euros per month scholarship is foreseen for 3 months. The attribution of the scholarship is subject to a public application.

Observações

--

Orientador

Filipe Araújo e João Durães
filipius@uc.pt 📩