DATA422-20S2 (C) Semester Two 2020

Data Wrangling for Data Science

15 points

Details:
Start Date: Monday, 13 July 2020
End Date: Sunday, 8 November 2020
Withdrawal Dates
Last Day to withdraw from this course:
  • Without financial penalty (full fee refund): Friday, 24 July 2020
  • Without academic penalty (including no fee refund): Friday, 25 September 2020

Description

This course develop students skills in data cleaning and processing, data integration techniques and implementing data wrangling workflows for a real world datasets.

Data wrangling is the iterative process of transforming data from a source format into a format suitable for storage, analysis, visualization, communication. The process is constrained by the requirement of preserving as much of the relevant information contained in the dataset, as well as ensuring an ethical treatment of the data subject, e.g., protecting their security and privacy. The course aims to provide the students the tools to handle different source formats (csvs, spreadsheets, web pages, apis, …), some target formats (long / wide data frames, packages, …) and a variety of data kinds (dates, numeric, strings, text, …). Wherever possible, the students will work on real-world datasets and ethical facets of data wrangling will be explicitly discussed in class. During the course, R will be the default programming language, and the use of JupyterLab and Rstudio strongly encouraged. Reference to other programming languages (Python, Julia) will be provided. Peer, group, and class interaction will be explicitly required during the course.

Learning Outcomes

  • Having engaged in learning during the course, students will be able to:
  • Access (read in) different data formats;
  • Interact (manipulate) relational dataset (e.g., data frames) and hierarchical dataset (e.g., JSON);
  • Output (write to) different data formats;
  • Analyse a dataset in order to identify its format and possible errors;
  • Analyse a data wrangling problem: identify the available source format(s); define the suitable target format(s) and the relevant ethical / technical constraints; develop a flow to transform data from source to target formats.

Pre-requisites

Subject to approval of the Head of Department of Mathematics and Statistics.

Timetable 2020

Students must attend one activity from each section.

Lecture A
Activity Day Time Location Weeks
01 Tuesday 14:00 - 16:00 Meremere 108 Lecture Theatre 13 Jul - 23 Aug
7 Sep - 18 Oct
Lab A
Activity Day Time Location Weeks
01 Thursday 16:00 - 18:00 Ernest Rutherford 212 Computer Lab 13 Jul - 23 Aug
7 Sep - 18 Oct
02 Wednesday 13:00 - 15:00 Ernest Rutherford 464 Computer Lab 13 Jul - 23 Aug
7 Sep - 18 Oct
03 Wednesday 08:00 - 10:00 Ernest Rutherford 212 Computer Lab 13 Jul - 23 Aug
7 Sep - 18 Oct
04 Thursday 10:00 - 12:00 Beatrice Tinsley 105 (16/7-20/8)
Ernest Rutherford 212 Computer Lab (10/9-15/10)
13 Jul - 23 Aug
7 Sep - 18 Oct

Course Coordinator / Lecturer

Heyang (Thomas) Li

Lecturer

Giulio Dalla Riva

Textbooks / Resources

Recommended Reading

Stephanie Locke; Statistical Computing with R;

Indicative Fees

Domestic fee $1,022.00

* Fees include New Zealand GST and do not include any programme level discount or additional course related expenses.

For further information see Mathematics and Statistics.

All DATA422 Occurrences

  • DATA422-20S2 (C) Semester Two 2020