Titulo Estágio
Approximate Query Answering for Big Data Analytics
Áreas de especialidade
Engenharia de Software
Local do Estágio
CISUC
Enquadramento
Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives in huge volumes at a high speed. The cost of handling this massive data can be significant. Providing interactive latency in processing big data is often impractical due to the fact that the data is growing exponentially. Approximate computing has recently emerged as a promising solution to overcome this problem. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data.
The area of online aggregation [1] studies new algorithms that allow to return approximate results (with statistical guarantees) for analytical queries at early stages of the processing, so that the user can terminate it as soon as the accuracy is acceptable. Recent studies have shown encouraging results [2, 3, 4], but there is still much room for improvement:
(1) The existing algorithms have only used simple random sampling or sample random walks to sample from query results. More sophisticated techniques based on Markov Chain Monte Carlo might be more effective.
(2) The streaming algorithms community has developed many techniques to summarize large data sets into compact data structures while preserving properties of the data. These summarization techniques can also be useful in approximate query processing.
(3) Integrating these techniques into Big data processing engines is still a significant challenge.
In this internship, we intend to propose and evaluate algorithms and techniques to approximate query answering in Big Data systems.
References
[1] P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In SIGMOD, pages 287–298. ACM, 1999.
[2] J. Bernardino, P. Furtado, H. Madeira. Approximate Query Answering Using Data Warehouse Striping. Journal of Intelligent Information Systems (JIIS). 19(2): 145-167 (2002)
[3] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, pages 171–182. ACM, 1997.
[4] F. Li, B. Wu, K. Yi, and Z. Zhao. Wander join: Online aggregation via random walks. In International Conference on Management of Data (SIGMOD), pages 615–629. ACM, 2016.
Objetivo
In practice, the expected outcomes of this internship are:
- Evaluate the existing algorithms that use simple random sampling or sample random walks to sample from query results.
- Evaluate summarization techniques used by streaming community in approximate query processing.
- Adapt these approximate query answering techniques for Big Data systems.
- Integrating these techniques into Big data processing engines such as Spark.
- A research paper, to be submitted and presented at a top international conference, describing the approach and main results obtained from the experiments.
Plano de Trabalhos - Semestre 1
Plano de Trabalhos 1º Semestre
[Some tasks might overlap; M=Month]
T1 (M1 – M3): Knowledge transfer and state of the art literature review on approximate query answering.
T2 (M3) Design new algorithms and techniques for Big Data Systems, using the information gathered in task T1 as basis.
T3 (M3) Identification of target systems to be used in the experiments.
T4 (M3 – M4) Implementation of a proof of concept prototype.
T5 (M5): Writing the Intermediate report.
Plano de Trabalhos - Semestre 2
Plano de Trabalhos 2º Semestre
[Some tasks might overlap; M=Month]
T6 (M6): Integration of the intermediate defense comments and completion of approximate query answering algorithms and techniques for Big Data Systems.
T7 (M6 – M7): Integrating these techniques into Big data processing engines.
T8 (M8): Execution of experiments and analysis of results.
T9 (M9): Write a research paper and submission to a top international conference on Database area (IEEE Big Data Congress, Database Systems for Advanced Applications - DASFAA, IEEE International Conference on Data Engineering – ICDE, etc.).
T10 (M10): Writing the thesis.
Condições
Plano de Trabalhos 2º Semestre
[Some tasks might overlap; M=Month]
T6 (M6): Integration of the intermediate defense comments and completion of approximate query answering algorithms and techniques for Big Data Systems.
T7 (M6 – M7): Integrating these techniques into Big data processing engines.
T8 (M8): Execution of experiments and analysis of results.
T9 (M9): Write a research paper and submission to a top international conference on Database area (IEEE Big Data Congress, Database Systems for Advanced Applications - DASFAA, IEEE International Conference on Data Engineering – ICDE, etc.).
T10 (M10): Writing the thesis.
Observações
The work will be carried out in the facilities of the Department of Informatics Engineering at the University of Coimbra (CISUC - Software and Systems Engineering Group), where a work place and necessary computer resources will be provided.
Orientador
Jorge Bernardino; Bruno Cabral
jorge@isec.pt 📩