DATA301-20S1 (C) Semester One 2020

Big Data Computing and Systems

15 points

Details:
Start Date: Monday, 17 February 2020
End Date: Sunday, 21 June 2020
Withdrawal Dates
Last Day to withdraw from this course:
  • Without financial penalty (full fee refund): Friday, 28 February 2020
  • Without academic penalty (including no fee refund): Friday, 29 May 2020

Description

The course introduces distributed computational techniques, distributed algorithms and systems/programming support for large-scale processing of data.

Learning Outcomes

  • Description

    This course teaches parallel and distributed programming, algorithms, and systems principles that are relevant for large-scale processing of big data sets on high performance computing clusters and cloud computing resources.

    Learning Outcomes: At the end of this course, students will be able to...

  • Understand and explain the fundamentals of cloud computing systems (SaaS, PaaS, IaaS, storage and networking architectures, virtual machines and their management, job scheduling).
  • Understand and explain different programming models for parallel and distributed computing (shared memory, shared-nothing / message-passing architectures) and common design patterns for distributed computations on big data sets (e.g. leader/follower, Map/Reduce, Gossiping).
  • Understand the drawbacks and advantages of different cloud solutions and distributed programming models and select appropriate solutions for a given situation.
  • Understand and explain fundamental distributed algorithms (e.g. leader election, consensus) and their properties as well as selected specialized algorithms for distributed processing of big data (e.g. matrix algorithms in parallel / distributed environments, distributed optimization)
  • Be able to design, implement and evaluate distributed processing programs for large data sets using appropriate software frameworks like MPI, CUDA, Hadoop or Apache SPARK.
  • Be able to communicate the results and argue from evidence.
  • Be able to work in teams.

Pre-requisites

Timetable 2020

Students must attend one activity from each section.

Lecture A
Activity Day Time Location Weeks
01 Monday 14:00 - 15:00 - (23/3, 20/4, 4/5-25/5)
A8 Lecture Theatre (17/2-16/3)
17 Feb - 29 Mar
20 Apr - 26 Apr
4 May - 31 May
Lecture B
Activity Day Time Location Weeks
01 Thursday 15:00 - 16:00 - (23/4-28/5)
Meremere 105 Lecture Theatre (20/2-19/3)
17 Feb - 22 Mar
20 Apr - 31 May
Computer Lab A
Activity Day Time Location Weeks
01 Thursday 11:00 - 13:00 - (23/4-28/5)
Jack Erskine 001 Computer Lab (20/2-19/3)
17 Feb - 22 Mar
20 Apr - 31 May
02 Wednesday 12:00 - 14:00 - (25/3, 22/4-27/5)
Jack Erskine 001 Computer Lab (11/3-18/3)
9 Mar - 29 Mar
20 Apr - 31 May

Course Coordinator

James Atlas

Textbooks / Resources

Recommended Reading

Blaise Barney; Introduction to Parallel Computing; (Introduction to Parallel Computing (and other tutorials). https://hpc.llnl.gov/training/tutorials).

CUDA; CUDA Toolkit Documentation; v10.0.130; (CUDA Programming Guide: https://docs.nvidia.com/cuda).

Jure Leskovec, Anand Rajarman, Jeffrey David Ullman; Mining of Massive Datasets; 2nd; Cambridge University Press, 2014 (http://www.mmds.org).

Additional Course Outline Information

Academic integrity

You are encouraged to discuss the general aspects of a problem with others. However, anything you submit for credit must be entirely your own work and not copied, with or without modification, from any other person. If you share details of your work with anybody else then you are likely to be in breach of the University's General Course and Examination Regulations and/or Computer Regulations (both of which are set out in the University Calendar) and/or the Computer Science Department's policy (see section 9). The Department treats cases of dishonesty very seriously and, where appropriate, will not hesitate to notify the University Proctor.

If you need help with specific details relating to your work, or are not sure what you are allowed to do, then contact your tutors or lecturer for advice.

Assessment and grading system

Lab assessment - 30%

In the labs students will practice the design and implementation of distributed algorithms and they will gain practical experience with contemporary Big Data and Cloud Computing frameworks such as Apache SPARK, MPI, CUDA and Google Cloud / Amazon Web Services. LO2, LO4, LO5


Project - 40%

In this series of artifacts, students will complete a short, application focused project. Students will work in teams of two or three students on an analysis task for a big data set, which requires them to design, write progress reports, implement and test an appropriate distributed algorithm in an appropriate software framework, to critique their design and to communicate the design and analysis results in a professional manner in a written report. This assessment item addresses LO3, LO5, LO6, LO7


Final exam - 30%

The final exam will allow a summative assessment of learning outcomes related to the full semester. This can include theoretical aspects, algorithms, programming, and techniques covered in lectures and assignments. LO1, LO2, LO3, LO4

Course Outline

The topics covered in lectures will be organized generally with the following progression:

•Introduction: Big Data
•5 Vs (Variety, Velocity, Volume, Veracity, Value)
•Storage and networking architectures
•Divide and Conquer, Map, Reduce, Map/Reduce functional programming in SPARK
•Algorithms in SPARK: Group By, Union, Intersection, Difference, Matrix-Vector and Matrix-Matrix Multiplication
•Systems: SaaS, PaaS, IaaS, Google Cloud / Amazon Web Services, storage and networking architectures, virtual machines and their management, job scheduling, cloud resources
•Algorithms in SPARK on cloud: Hashing, PageRank
•Data Processing: Distributed Data Structures, Graphs, Leader Election, Consensus
•Memory Hierarchy, Shared memory, Shared-nothing, distributed file systems, replication, communication cost, complexity theory
•Programming: Message-Passing (MPI)
•Programming: Threads, Locks and Atomics (CUDA)
•Programming: Work Queues, Schedulers, Streaming
•Heterogenous Processing: Systems and Programming

Preparation

The course assumes that you are proficient in Python, as taught in COSC121, and in algorithm design and analysis, as taught in COSC262. If you are enrolling in DATA301 but haven't already passed COSC121 and COSC262 or the equivalents, you should consult the course supervisor before enrolling.

Indicative Fees

Domestic fee $777.00

International fee $3,375.00

* Fees include New Zealand GST and do not include any programme level discount or additional course related expenses.

For further information see Computer Science and Software Engineering.

All DATA301 Occurrences

  • DATA301-20S1 (C) Semester One 2020