DATA420-20S1 (C) Semester One 2020

Scalable Data Science

15 points

Details:
Start Date: Monday, 17 February 2020
End Date: Sunday, 21 June 2020
Withdrawal Dates
Last Day to withdraw from this course:
  • Without financial penalty (full fee refund): Friday, 28 February 2020
  • Without academic penalty (including no fee refund): Friday, 29 May 2020

Description

This course will introduce students to core topics in scalable data science based on distributed-computing techniques. This is a very practical course, with students learning by experimenting on a computer cluster.

This course will introduce students to new computational methods used in data science. We will look at methods for data from a range of contexts, including scalable methods used for big data and distributed computing. We will cover topics primarily in cloud computing, distributed
computing, and machine learning. This is a very hands on course, with students learning and experimenting on the School data science cluster. We will work in the computer lab, and students will have access to the cluster at any time to pursue additional projects.

The intent of the course is to provide an environment that is similar to what you will experience in a data science position in the real world, and to teach you to think carefully and to apply the appropriate tool for the task at hand.

Learning Outcomes

  • Concrete learning outcomes will include:
  • familiarity with map-reduce algorithms for processing big-data, including its robust clean-up via regular expressions
  • basic skills to extract, transform and load data into distributed file systems such as hadoop
  • working with structured data using dataframes and dynamic querying in sparkSQL on catalyst
  • basic applications of some of the standard learning algorithms in Spark's machine learning and distributed graph processing libraries
  • basic data science analytics pathways for the following common data types:
     - structured text data (logs generated by machines, tabular data from various open data sources)
     - geospatial data (and their integration with other types of data)
     - unstructured text data (a collection of text documents)
     - social media data

    Students will be encouraged to show-case their completed labs (which will have plenty of opportunities for extending the basic labs in creative ways even after the course is completed) by publishing them in public GitHub repositories in order to directly appeal to their potential employers.

Pre-requisites

Subject to approval of the Head of Department of Mathematics and Statistics.

Timetable 2020

Students must attend one activity from each section.

Lecture A
Activity Day Time Location Weeks
01 Friday 14:00 - 15:00 Eng Core 342 CAD Lab (21/2-20/3)
- (24/4-29/5)
17 Feb - 22 Mar
20 Apr - 31 May
Computer Lab A
Activity Day Time Location Weeks
01 Tuesday 10:00 - 12:00 Eng Core 342 CAD Lab (18/2-17/3)
- (24/3, 21/4-26/5)
17 Feb - 29 Mar
20 Apr - 31 May
Drop in Class B
Activity Day Time Location Weeks
01 Monday 17:00 - 18:00 Rehua 008 Computer Lab (24/2-16/3)
- (23/3, 20/4, 4/5-25/5)
24 Feb - 29 Mar
20 Apr - 26 Apr
4 May - 31 May
02 Monday 16:00 - 17:00 Rehua 008 Computer Lab (24/2-16/3)
- (23/3, 20/4, 4/5-25/5)
24 Feb - 29 Mar
20 Apr - 26 Apr
4 May - 31 May
03 Thursday 17:00 - 18:00 Ernest Rutherford 212 Computer Lab (27/2-19/3)
- (23/4-28/5)
24 Feb - 22 Mar
20 Apr - 31 May
04 Wednesday 15:00 - 16:00 Jack Erskine 038 Lab 4 (26/2-18/3)
- (25/3, 22/4-27/5)
24 Feb - 29 Mar
20 Apr - 31 May

Course Coordinator

For further information see Mathematics and Statistics Head of Department

Textbooks / Resources

No textbook required.

Indicative Fees

Domestic fee $1,022.00

* Fees include New Zealand GST and do not include any programme level discount or additional course related expenses.

For further information see Mathematics and Statistics.

All DATA420 Occurrences