Titulo Estágio
Techniques for Analyzing and Curating Datasets to Improve Data Quality in Machine Learning Pipelines
Áreas de especialidade
Sistemas Inteligentes
Local do Estágio
DEI
Enquadramento
Machine learning (ML) systems are only as good as the data they are trained on. In many real-world applications, datasets suffer from various quality issues such as noise, missing values, mislabeled instances, class imbalance, and redundancies. These problems not only degrade model performance but can also lead to biased or unreliable results. Despite this, most ML efforts still concentrate primarily on model architecture and hyperparameter tuning, often overlooking the critical stage of data curation.
Data curation involves the systematic process of analyzing, cleaning, validating, enriching, and maintaining datasets to ensure their quality and utility. As data becomes increasingly central to AI development, there is growing recognition of the need for robust techniques to analyze and curate datasets prior to model training.
Objetivo
The primary goal of this thesis is to explore and apply techniques for the analysis and curation of datasets in order to improve data quality and model performance.
Specific Objectives:
-Review and classify techniques used for dataset analysis and curation, including error detection, noise filtering, class balancing, and duplicate removal.
-Select and study real-world datasets with common issues (e.g., class imbalance, mislabeled samples, incomplete data).
-Implement a toolkit or pipeline that incorporates multiple curation techniques.
-Evaluate the impact of each curation step on data quality and model performance using standard ML tasks.
-Propose a methodology and guidelines for practitioners on how to approach dataset analysis and curation.
Plano de Trabalhos - Semestre 1
Month 1–2: Literature Review
-Study existing literature on data quality, curation techniques, and data-centric AI.
-Review tools and libraries (e.g., Cleanlab, Great Expectations, DataPrep, Snorkel).
Month 3–4: Dataset Selection and Preliminary Analysis
-Choose 1–2 datasets from different domains
-Identify and document quality issues in the datasets (missing values, noisy labels, imbalances, outliers).
Month 5–6: Baseline Experiments
-Train ML models on raw datasets using standard metrics (accuracy, F1, AUC, etc.).
Plano de Trabalhos - Semestre 2
Month 1–2: Implementation of Curation Techniques
-Apply various curation techniques such as:
-Data profiling and statistical analysis
-Label error detection and correction
-Imbalance correction (e.g., SMOTE, class weighting)
-Outlier detection (e.g., isolation forest, z-score)
-Duplicate removal and feature consistency checks
Month 3–4: Re-training and Comparative Evaluation
-Retrain models on curated datasets.
-Compare new results against baseline.
-Analyze contribution of each technique to performance improvement.
Month 5: Toolkit and Methodology Documentation
-Develop a reproducible, modular pipeline or script set.
-Write best practice guidelines for dataset analysis and curation.
Month 6: Thesis Writing and Defense Preparation
Condições
n/a
Observações
n/a
Orientador
Pedro Henriques Abreu/Miriam Seoane Santos
pha@dei.uc.pt 📩