## School of Mathematics and Statistics

Project Number: 2019-9

Project Leader: Rachael Tappenden

Host Department: School of Mathematics and Statistics

Project Title: Algorithms for Data Science Applications

Project outline: Optimisation plays a crucial role in many modern, real-world applications, and its study is ever more important in this data science era. At the heart of many data science problems lies empirical risk minimisation (ERM), where the goal is to determine the underlying input given some (possibly noisy) output data. Mathematically, ERM can be posed as an optimization problem involving the sum of a `data fidelity' term (a term that attempts to match the solution to the input data), plus a regularization term that captures prior knowledge about the solution (such as known structural properties of the problem and/or solution). It is a general problem formulation, in the sense that a plethora of particular data science/machine learning problems fit the ERM framework. For example, compressed sensing (where the 1-norm is used to enforce sparsity in the solution); binary logistic regression (where the goal is to classify data points into distinct groups); group lasso (which is a variant of compressed sensing that has applications in genetics). It is inherently challenging to devise algorithms for such problems due to (1) the generality of the objective formulation, and (2) the very high dimensionality of the problem data. Reliable algorithms for such problems are highly sought after.

This project will study efficient algorithms for solving very large-scale ERM problems. Both theoretical and practical aspects of existing algorithms will be considered. We will investigate questions such as: `Do the algorithms have theoretical convergence guarantees?' `What is their rate of convergence (i.e., how fast do they find the solution)?' `Do they work well in practice on real-world data?' We will also consider potential drawbacks of existing algorithms for ERM and will use this to help guide us in the development of new, fast, efficient, and reliable algorithms, for data science applications.

A student with a solid background in linear algebra (equivalent to at least a MATH203 level) would be suitable for this project. Student tasks will involve coding algorithms for ERM, and studying their practical (and theoretical) performance on real-world benchmark test data.

Specific Requirements: Strong linear algebra skills to at least a MATH203 level.

Project Number: 2019-39

Project Leader: Alex James, Ann Brower

Host Department: Maths and Stats

Project Title: Is Science journalism sexist?

Project outline: In Science, citations are key. We acknowledge the authors of work religiously and have a strong code of ethics when it comes to giving credit for research.

What about the media? Do they follow our rules? Or as one of NZ's top science journalists said "a great quote makes it into print" implying that authorship of the work is actually irrelevant. Is this observation compounded by implicit bias? A quick flick through the pages of some of our popular media outlets suggests that these great quotes too often come from men whilst female lead authors go unmentioned.

In this project you will collect and analyse data from a number of different sources to find quantitative evidence to support or refute these observations.

Specific Requirements: No specific courses are needed but diligence, a good eye for detail and an interest in science, gender and all things equitable are essential pre-requisites.

Project Number: 2019-57

Project Leader: Michael Plank, Alex James

Host Department: Mathematics and Statistics

Project Title: Biocontrol of California Thistle

Project outline: California thistle is a weed that impacts agricultural production in NZ. The beetle, Cassida rubiginosa, was released in NZ in 2007 as a biocontrol agent for California thistle. The beetle reduces biomass of California thistle via herbivory. However, little is yet known about the interaction between the beetle's and thistle's life cycles.  The aim of this project is produce a simple mathematical model of California thistle biomass in the presence of herbivory by C. rubiginosa. This will be a stage-structured model similar to a Leslie matrix. The model will be used to assess the relative effectiveness of alternative control strategies, such as combining mowing with biocontrol, and the optimal timing of these.

Specific Requirements: This project would suit a student who has taken some MATH courses at least to 200 level, including at least one of MATH201, MATH202, MATH203. The student should also have some experience in computer programming, e.g. Matlab or Python.