DATA201-20S2 (C) Semester Two 2020

Data Wrangling

15 points

Details:
Start Date: Monday, 13 July 2020
End Date: Sunday, 8 November 2020
Withdrawal Dates
Last Day to withdraw from this course:
  • Without financial penalty (full fee refund): Friday, 24 July 2020
  • Without academic penalty (including no fee refund): Friday, 25 September 2020

Description

This course introduces students to data cleaning, standardisation, and the integration of disparate data sources and structures. Students will learn how to convert data from many different sources into a consistent format ready for analysis, and will learn about data quality, ethics, management, storage, and persistency.

Data comes in a variety of shapes and formats: text documents, images, tables, social network graphs, databases, webpages. Data is used for a variety of uses: archiving, analysis, visualization, communication, and even art. Data wrangling is the process of reshaping data so that it can be more efficiently used. The process can be difficult because it is important to preserve, as much as possible, the relevant information contained in the dataset, while at the same time ensuring an ethical treatment of the data subjects, e.g., protecting people’s security and privacy. Data scientist, thus, need to take careful decisions, and it is estimated that up to 80% of the worktime of a data scientist is spent in cleaning and wrangling data. Learning to do this efficiently, thus, proves to be essential across many discipline and industries.

The course aims to provide the students with the tools to handle different sources of data (csvs, spreadsheets, web pages, apis, …), some target formats (long / wide data frames, packages, …) and a variety of data kinds (dates, numeric, strings, text, …). Wherever possible, the students will work on real-world datasets and ethical facets of data wrangling will be explicitly discussed in class. During the course, R will be the default programming language, and the use of JupyterLab and Rstudio strongly encouraged. Reference to other programming languages, e.g. Julia, will be provided. Peer, group, and class interaction will be explicitly required during the course.

Learning Outcomes

  • Having engaged in learning during the course, students will be able to:

  • Access (read in) different data formats;
  • Interact (manipulate) relational dataset (e.g., data frames) and hierarchical datasets;
  • Output (write to) different data formats;
  • Analyse a dataset in order to identify its format and possible errors;
  • Analyse a data wrangling problem: identify the available source format(s); define the suitable target format(s) and the relevant ethical / technical constraints; develop a flow to transform data from source to target formats.

Pre-requisites

15 Points of 100-level COSC, MATH or
STAT or
INFO125

Timetable 2020

Students must attend one activity from each section.

Lecture A
Activity Day Time Location Weeks
01 Tuesday 14:00 - 16:00 Meremere 108 Lecture Theatre 13 Jul - 23 Aug
7 Sep - 18 Oct
Computer Lab A
Activity Day Time Location Weeks
01 Wednesday 10:00 - 12:00 Ernest Rutherford 212 Computer Lab 13 Jul - 23 Aug
7 Sep - 18 Oct
02 Thursday 13:00 - 15:00 Rehua 008 Computer Lab 13 Jul - 23 Aug
7 Sep - 18 Oct

Course Coordinator / Lecturer

Giulio Dalla Riva

Textbooks / Resources

Suggested Textbook:
Stephanie Locke, Data Manipulation in R

Indicative Fees

Domestic fee $777.00

International fee $3,375.00

* Fees include New Zealand GST and do not include any programme level discount or additional course related expenses.

For further information see Mathematics and Statistics.

All DATA201 Occurrences

  • DATA201-20S2 (C) Semester Two 2020