Titulo Estágio
Finding the Critical Feature Dimension of Big Datasets
Áreas de especialidade
Sistemas Inteligentes
Local do Estágio
Laboratório de Computação Adaptativa (LARN)
Enquadramento
Recent research on big data mining has revealed that many massive datasets that exist or are being constructed today possess a Critical Feature Dimension, which is the minimum number of features required for a given data analytic task (e.g., a machine learning process) executed on the dataset to achieve satisfactory performance (i.e., meeting a given performance threshold). Consequently, for data mining purposes, reduction in data size can be accomplished by finding the critical dimension and identifying appropriate features to constitute a critical-dimension feature set. Preliminary experiments indicated that for some large datasets, their critical dimensions may in fact be fortuitously much smaller than the original numbers of features, thus allowing very significant data reductions.
Objetivo
The project, by way of applying heuristic methods developed ad hoc to study the issue, involves experiments with various machine learning and feature ranking algorithms on several massive datasets to
(1) verify that a dataset possesses, or does not possess, a critical feature dimension,
(2) study the effect of different combinations of machine learning and feature ranking algorithms on the critical feature dimension,
(3) measure the performance of data mining using the complete dataset vs. the reduced dataset containing only the critical features,
(4) design other heuristic algorithms for determining the critical feature dimension of datasets. Many other research directions and objectives can be formulated and pursued, in the current research framework or as extensions of it.
Plano de Trabalhos - Semestre 1
•Study the existing heuristic method used for previous experiments, and design alternative methods
•Study, and select, machine learning (ML) algorithms and feature selection (FS) algorithms for use in experiments
•Select, prepare, and preprocess a collection of large datasets for experiments
•Writing of intermediate report
Plano de Trabalhos - Semestre 2
•Conduct experiments using various combinations of ML and FS, to determine the critical feature dimension (CFD) of each of the datasets
•Analyze experimental results: e.g., study discovered CFD values vs. previous results; observe effects of ML and FS on the CFD values; compare performance of the reduced datasets vs. previous results, etc.
•Writing of scientific article
•Writing of the thesis
Condições
This work will be carried out in the Laboratory of Neural Networks (LARN) of CISUC, where there will be a regular supervision and feedback on the behalf of the supervisor.
Familiarity with machine learning algorithms and software tools are essential. Participating students will acquire valuable knowledge and experience with mining massive datasets, which skills are currently in high demand for various technology employers due to the relevance to various applications.
Observações
Opportunities may become available in summer 2017 to short-term paid visit to a U.S. university to collaborate on the project
Logistics @Laboratory of Neural Networks (LARN)
DEI-FCTUC
Orientador
Bernardete Ribeiro, Cesar Teixeira (bribeiro@dei.uc.pt,cteixei@dei.uc.pt)
bribeiro@dei.uc.pt 📩