Overall Goal:
-----------------
To investigate how the initial choice of training data affects the performance and reliability of active learning methods, and to identify which methods are more robust when early data are biased or unrepresentative.
Details:
---------
Machine-learning models often start with only a small amount of labeled data, which is then expanded gradually as new data are collected. If the new data is collected in a strategic manner to best inform the model, this process is called “active learning”. In practice, the very first samples used for training can strongly influence the success of active learning methods and the model itself. If these early samples are biased, incomplete, or unrepresentative, the model may perform poorly, even if more data are added later.
This project studies how sensitive different data-efficient learning strategies are to their initial training data. The student will systematically construct different types of “problematic” initial datasets (for example, datasets that miss certain regions of the feature space, over-represent common outcomes, or exclude rare but important cases) and measure how these choices affect learning performance.
Depending on the student’s interests, the project can focus on classification tasks, regression tasks (predicting continuous values), or both. We will start off with a couple of sufficiently large but manageable datasets to simulate active learning cycles under different sampling strategies for the initial sample.
Supervisors
Primary Supervisor: Katharina Dost
Key qualifications and skills
- Background in data science, statistics, or machine learning
- Comfortable working in Python
- Prior knowledge of advanced machine-learning research topics is not required
- Passionate about research and data
Does the project come with funding
No - Student must be self-funded
Final date for receiving applications
Ongoing
How to apply
Apply by email to primary supervisor with CV, transcript, and motivation letter
Keywords
machine learning; AI; active learning