Propostas submetidas

Gerado a 2025-07-04 05:33:31 (Europe/Lisbon).

Voltar

Titulo Estágio

Big Data Cloud Services

Áreas de especialidade

Engenharia de Software

Local do Estágio

DEI-FCTUC

Enquadramento

Cloud computing comes in three main forms: Infrastructure as a Service (IaaS), where the service is virtualized hardware; Platform as a Service (PaaS), where the service is virtualized infrastructure software such as a DBMS; and Software as a Service (SaaS), where the service is a virtualized application such as a customer relationship management solution. From a data platform perspective, the ideal goal is a PaaS for data, where users can upload data to the cloud, query it as they do today over their on-premise SQL databases, and selectively share the data and results easily, all without worrying about how many instances to rent, what operating system to run on, how to partition databases across servers, or how to tune them. Despite the emergence of services such as Database.com from Salesforce.com, Big Query from Google, Redshift from Amazon, and Azure SQL Database from Microsoft, we have yet to achieve the full ideal. Here, we outline some of the critical challenges to realize the complete vision of a Data PaaS in the cloud.
In this internship, we propose to assess some of the critical challenges to realize the complete vision of a Data PaaS in the cloud: Elasticity, Data replication, System administration and tuning, Multitenancy, Data sharing, and Hybrid clouds.

Objetivo

In practice, the expected outcomes of this internship are contributions in the critical challenges:
- Elasticity: An open question is whether the same cloud storage service can support both transactions and analytics; how caching best fits into the overall picture is also unclear. To provide elasticity, database engines and analysis platforms in a Data PaaS will need to operate well on top of resources that can be allocated quickly during workload peaks but possibly preempted for users paying for premium service.
- Data replication. Latency across geographically distributed datacenters makes it difficult to keep replicas consistent yet offer good throughput and response time to updates. Multi-master replication is a good alternative, when conflicting updates on different replicas can be automatically synchronized. But the resulting programming model is not intuitive to mainstream programmers. Thus, the challenge is how best to trade-off availability, consistency performance, programmability, and cost.
System administration and tuning. In the world of Data PaaS, database and system administrators simply do not exist. Therefore, all administrative tasks must be automated, such as capacity planning, resource provisioning, and physical data management. Resource control parameters must also be set automatically and be highly responsive to changes in load, such as buffer pool size and admission control limits.
Multitenancy. To be competitive, a Data PaaS should be cheaper than an on-premises solution. This requires providers to pack multiple tenants together to share physical resources to smooth demand and reduce cost. This introduces several problems. First, the service must give security guarantees against information leakage across tenants. This can be done by isolating user databases in separate files and running the database engine in separate virtual machines (VMs). However, this is inefficient for small databases, and makes it difficult to balance resources between VMs running on the same server. An alternative is to have users share a single database and database engine instance. But then special care is needed to prevent cross-tenant accesses. Second, users want an SLA that defines the level of performance and availability they need. Data PaaS providers want to offer SLAs too, to enable tiered pricing. However, it is challenging to define SLAs that are understandable to users and implementable by PaaS providers. The implementation challenge is to ensure performance isolation between tenants, to ensure a burst of demand from one tenant does not cause a violation of other tenants' SLAs.
Data sharing. The cloud enables sharing at an unprecedented scale. One problem is how to support essential services such as data curation and provenance collaboratively in the cloud. Other problems include: how to find useful public data, how to relate self-managed private data with public data to add context, how to find high-quality data in the cloud, how to share data at fine-grained levels, how to distribute costs when sharing computing and data, and how to price data. The cloud also creates new life-cycle challenges, such as how to protect data if the current cloud provider fails and to preserve data for the long term when users who need it have no personal or financial connection to those who provide it. The cloud will also drive innovation in tools for data governance, such as auditing, enforcement of legal terms and conditions, and explanation of user policies.
Hybrid clouds. There is a need for interoperation of database services among the cloud, on-premise servers, and mobile devices. One scenario is off-loading. For example, users may run applications in their private cloud during normal operation, but tap into a public cloud at peak times or in response to unanticipated workload surges. Another is cyber-physical systems, such as the Internet of Things. For example, cars will gather local sensor data, upload some of it into the cloud, and obtain control information in return based on data aggregation from many sources. Cyber-physical systems involve data streaming from multiple sensors and mobile devices, and must cope with intermittent connectivity and limited battery life, which pose difficult challenges for real-time and perhaps mission-critical data management in the cloud.
- A research paper, to be submitted and presented at a top international conference, describing the approach and main results obtained from the experiments.

Plano de Trabalhos - Semestre 1

[Some tasks might overlap; M=Month]
T1 (M1 – M3): Knowledge transfer and state of the art literature review on Data PaaS in the Cloud.
T2 (M3) Design critical mechanisms for Data PaaS in the Cloud, using the information gathered in task T1 as basis.
T3 (M3) Identification of target systems to be used in the experiments.
T4 (M3 – M4) Implementation of a proof of concept prototype.
T5 (M5): Writing the Intermediate report.

Plano de Trabalhos - Semestre 2

[Some tasks might overlap; M=Month]
T1 (M6): Integration of the intermediate defense comments and completion of the Big Data Cloud Services.
T2 (M6 – M7): Implementation of the architecture and critical mechanisms for Data PaaS in the Cloud, and execution of tests (functional).
T3 (M8): Execution of experiments and analysis of results.
T4 (M9): Write a research paper and submission to a top international conference on Big Data and Cloud areas (IEEE Big Data Congress, IEEE Services Computing Conference, IEEE International Conference on Data Engineering – ICDE, etc.).
T5 (M10): Writing the thesis.

Condições

The work will be carried out in the facilities of the Department of Informatics Engineering at the University of Coimbra (CISUC - Software and Systems Engineering Group), where a work place and necessary computer resources will be provided.

Orientador

Jorge Bernardino
jorge@isec.pt 📩