Titulo Estágio
Big Data Open Source Platforms
Áreas de especialidade
Engenharia de Software
Local do Estágio
CISUC
Enquadramento
In the database world, parallel processing of large structured datasets has been a major success, leading to several generations of SQL-based products that are widely used by enterprises. Another success is data warehousing, where database researchers defined the key abstraction of data cube (for online analytic processing, or OLAP) and strategies for querying it in parallel, along with support for materialized views and replication. The distributed computing field has achieved success in scaling up data processing for less structured data on large numbers of unreliable, commodity machines using constrained programming models such as MapReduce. Higher-level languages have been layered on top, to enable a broader audience of developers to use scalable big data platforms. The complexity and volume of big data software solutions has increased in the past decade. Nowadays, one can find a plethora of solutions targeting almost every aspect of data management. At the same time setting up and maintaining big data infrastructures has become a matter of experts. Today, open source platforms such as Hadoop—with its MapReduce programming model, large-scale distributed file system, and higher-level languages, such as Pig and Hive—are seeing rapid adoption for processing less structured data, even in traditional enterprises. Some new platforms are Drill, Presto, and Spark.
In this internship, we intend to test and evaluate Big Data Open Source Platforms.
Objetivo
In practice, the expected outcomes of this internship are:
- Evaluate the existing Big Data Open Source Platforms using a benchmark.
- Test the best Big Data Open Source Platform in a real environment.
- Propose an integration architecture for Big Data.
- A research paper, to be submitted and presented at a top international conference, describing the approach and main results obtained from the experiments.
Plano de Trabalhos - Semestre 1
[Some tasks might overlap; M=Month]
T1 (M1 – M3): Knowledge transfer and state of the art literature review on Open Big Data Infrastructures.
T2 (M3) Design integration techniques for Big Data Infrastructures, using the information gathered in task T1 as basis.
T3 (M3) Identification of benchmark and target systems to be used in the experiments.
T4 (M3 – M4) Implementation of a proof of concept prototype.
T5 (M5): Writing the Intermediate report.
Plano de Trabalhos - Semestre 2
[Some tasks might overlap; M=Month]
T6 (M6): Integration of the intermediate defense comments and completion of Open Data Infrastructure for Big Data Systems.
T7 (M6 – M7): Implementation of the architecture for Big Data systems, and execution of tests (functional).
T8 (M8): Execution of experiments and analysis of results.
T9 (M9): Write a research paper and submission to a top international conference on Database area (IEEE Big Data Congress, Database Systems for Advanced Applications - DASFAA, IEEE International Conference on Data Engineering – ICDE, etc.).
T10 (M10): Writing the thesis.
Condições
The work will be carried out in the facilities of the Department of Informatics Engineering at the University of Coimbra (CISUC - Software and Systems Engineering Group), where a work place and necessary computer resources will be provided.
Orientador
Jorge Bernardino
jorge@isec.pt 📩