Titulo Estágio
Towards Fair and Privacy-Preserving Oversampling: A Novel SMOTE Variant Handling Outliers and Skewed Distributions
Áreas de especialidade
Sistemas Inteligentes
Local do Estágio
DEI
Enquadramento
Synthetic Minority Over-sampling Technique (SMOTE) and its variants have been widely used to address class imbalance in supervised learning. However, standard SMOTE approaches typically disregard three crucial aspects in modern data science:
-Fairness – Ensuring that synthetic data do not amplify or introduce biases with respect to sensitive attributes (e.g., gender, race).
-Privacy – Protecting individuals' data from being reverse-engineered or exposed through oversampling.
-Data Complexity – Handling real-world challenges such as outliers and skewed distributions that can compromise both the quality and utility of synthetic data.
This thesis proposes to develop a new oversampling algorithm that extends the SMOTE family to explicitly address fairness, privacy preservation, and data complexity (outliers and skewed distributions).
Objetivo
1-Literature Review: Analyze state-of-the-art oversampling algorithms, especially those addressing fairness and privacy.
2-Bias and Privacy Risk Analysis: Understand how current SMOTE variants may introduce bias or leak private information.
3-Algorithm Design: Develop a new oversampling method incorporating:
3.1 Fairness constraints (e.g., equalized odds, demographic parity)
3.2 Privacy-preserving mechanisms (e.g., differential privacy or distance-based privacy filtering)
3.3 Robustness to outliers and capability to handle skewed data distributions
4-Experimental Validation: Evaluate the proposed method on real-world datasets from multiple domains.
5-Comparison with Baselines: Benchmark against existing SMOTE variants and fairness-aware models using metrics for performance, fairness, and privacy.
Plano de Trabalhos - Semestre 1
T1. Literature Survey Study existing SMOTE variants, fairness-aware ML, and privacy-preserving data generation
T2. Problem Formalization Define fairness and privacy metrics relevant to oversampling; specify dataset requirements
T3. Exploratory Data Analysis Analyze selected datasets for imbalance, skew, outliers, and bias
Plano de Trabalhos - Semestre 2
T4. Algorithm Development Design and implement the novel oversampling algorithm
T5. Privacy & Fairness Integration Integrate privacy-preserving techniques (e.g., DP-noise) and fairness-aware generation strategies
T6. Experimental Setup Define benchmarks, performance metrics, fairness/privacy evaluation tools
T7. Evaluation and Tuning Run experiments and refine the method based on empirical findings
T8. Documentation & Thesis Writing Report results, prepare thesis manuscript, and finalize documentation
Condições
n/a
Observações
Orientadores:
• Pedro Abreu
• Penousal Machado
Orientador
Pedro Henriques Abreu/Penousal Machado
pha@dei.uc.pt 📩