Data Science and Machine Learning
(DAT)
- Coefficient : 3
- Hourly Volume: 72.0h (including 42.0h supervised)
- Labo : 42h supervised
- Out-of-schedule personal work : 30h
- Including project : 21h supervised and 7h unsupervised project
AATs Lists
Description
This course builds on the Data Analysis course from Semester 7, moving from an inferential approach (modelling to understand, estimating parameters and testing hypotheses) to a predictive and algorithmic approach (learning from data to generalise to new observations). The course covers supervised learning using two complementary sets of tools: classical machine learning with scikit-learn (a complete pre-processing pipeline, linear and logistic regression from a predictive perspective with Ridge and Lasso regularisation, k-nearest neighbours, decision trees, random forests) and deep learning with PyTorch (deep neural networks: backpropagation mechanism, optimisation algorithms, activation functions, weight initialisation, regularisation techniques, training diagnostics). The systematic comparison between classical models and neural networks on identical problems enables students to develop an informed judgement regarding the choice of method.
Learning Outcomes (AAv)
AAv1 [heures: 12, C1, B2, B3]: design and implement a complete supervised machine learning pipeline using scikit-learn
AAv2 [heures: 14, B2, B3, C1]: design, train and gain in-depth mastery of a dense neural network using PyTorch
AAv3 [heures: 16, D3, D4, B3, C1]: rigorously evaluate a model’s performance, diagnose sources of error, and successfully complete a personal end-to-end supervised learning project
Assessment methods
- Interim assessments on technical concepts (AAv1 and AAv2) during practical workshops
- Personal project (AAv3) assessed on the final submission (notebook, versioned repository) and an oral presentation
- Cross-disciplinary assessment via a critical validation test: using a notebook containing methodological errors (data leakage, inappropriate metrics, undiagnosed overfitting, choice of model or configuration inconsistent with the data), the student identifies the errors, justifies their identification and proposes appropriate corrections
Keywords
- Supervised machine learning
- Deep learning, dense neural networks
- Backpropagation, gradient descent, optimisation (SGD, Adam)
- Regularisation (dropout, batch normalisation, weight decay, early stopping)
- Weight initialisation, activation functions
- Regression (Ridge, Lasso), classification, random forests
- scikit-learn, PyTorch, pandas
- Cross-validation, evaluation metrics, training diagnostics
- Data science project
Prerequisites
- S7 Data Analysis course (inferential statistics, linear regression from an inferential perspective, hypothesis testing, statistical quality assessment)
- Linear algebra (matrix product, dot product) and calculus (derivative, gradient, chain rule for backpropagation)
- Python programming, basics of NumPy, pandas and matplotlib
- Basic concepts of algorithms
