DATA301-19S1 (C) Semester One 2019

Big Data Computing and Systems

15 points
18 Feb 2019 - 23 Jun 2019

Description

The course introduces distributed computational techniques, distributed algorithms and systems/programming support for large-scale processing of data.

Learning Outcomes

  • Description

    This course teaches parallel and distributed programming, algorithms, and systems principles that are relevant for large-scale processing of big data sets on high performance computing clusters and cloud computing resources.

    Learning Outcomes: At the end of this course, students will be able to...

  • Understand and explain the fundamentals of cloud computing systems (SaaS, PaaS, IaaS, storage and networking architectures, virtual machines and their management, job scheduling).
  • Understand and explain different programming models for parallel and distributed computing (shared memory, shared-nothing / message-passing architectures) and common design patterns for distributed computations on big data sets (e.g. leader/follower, Map/Reduce, Gossiping).
  • Understand the drawbacks and advantages of different cloud solutions and distributed programming models and select appropriate solutions for a given situation.
  • Understand and explain fundamental distributed algorithms (e.g. leader election, consensus) and their properties as well as selected specialized algorithms for distributed processing of big data (e.g. matrix algorithms in parallel / distributed environments, distributed optimization)
  • Be able to design, implement and evaluate distributed processing programs for large data sets using appropriate software frameworks like MPI, CUDA, Hadoop or Apache SPARK.
  • Be able to communicate the results and argue from evidence.
  • Be able to work in teams.

Pre-requisites

Timetable 2019

Students must attend one activity from each section.

Lecture A
Activity Day Time Location Weeks
01 Tuesday 10:00 - 11:00 Jack Erskine 242 18 Feb - 7 Apr
29 Apr - 2 Jun
Lecture B
Activity Day Time Location Weeks
01 Monday 13:00 - 14:00 Jack Erskine 242 18 Feb - 7 Apr
29 Apr - 2 Jun
Computer Lab A
Activity Day Time Location Weeks
01 Tuesday 12:00 - 14:00 Ernest Rutherford 212 Computer Lab 18 Feb - 7 Apr
29 Apr - 2 Jun

Course Coordinator

James Atlas

Textbooks

Recommended Reading

Blaise Barney; Introduction to Parallel Computing; (Introduction to Parallel Computing (and other tutorials). https://hpc.llnl.gov/training/tutorials).

CUDA; CUDA Toolkit Documentation; v10.0.130; (CUDA Programming Guide: https://docs.nvidia.com/cuda).

Jure Leskovec, Anand Rajarman, Jeffrey David Ullman; Mining of Massive Datasets; 2nd; Cambridge University Press, 2014 (http://www.mmds.org).

Additional Course Outline Information

Academic integrity

You are encouraged to discuss the general aspects of a problem with others. However, anything you submit for credit must be entirely your own work and not copied, with or without modification, from any other person. If you share details of your work with anybody else then you are likely to be in breach of the University's General Course and Examination Regulations and/or Computer Regulations (both of which are set out in the University Calendar) and/or the Computer Science Department's policy (see section 9). The Department treats cases of dishonesty very seriously and, where appropriate, will not hesitate to notify the University Proctor.

If you need help with specific details relating to your work, or are not sure what you are allowed to do, then contact your tutors or lecturer for advice.

Assessment and grading system

Lab assessment - 30%

In the labs students will practice the design and implementation of distributed algorithms and they will gain practical experience with contemporary Big Data and Cloud Computing frameworks such as Apache SPARK, MPI, CUDA and Google Cloud / Amazon Web Services. LO2, LO4, LO5


Project - 40%

In this series of artifacts, students will complete a short, application focused project. Students will work in teams of two or three students on an analysis task for a big data set, which requires them to design, write progress reports, implement and test an appropriate distributed algorithm in an appropriate software framework, to critique their design and to communicate the design and analysis results in a professional manner in a written report. This assessment item addresses LO3, LO5, LO6, LO7


Final exam - 30%

The final exam will allow a summative assessment of learning outcomes related to the full semester. This can include theoretical aspects, algorithms, programming, and techniques covered in lectures and assignments. LO1, LO2, LO3, LO4

Course Outline

The topics covered in lectures will be organized generally with the following progression:

•Introduction: Big Data
•5 Vs (Variety, Velocity, Volume, Veracity, Value)
•Storage and networking architectures
•Divide and Conquer, Map, Reduce, Map/Reduce functional programming in SPARK
•Algorithms in SPARK: Group By, Union, Intersection, Difference, Matrix-Vector and Matrix-Matrix Multiplication
•Systems: SaaS, PaaS, IaaS, Google Cloud / Amazon Web Services, storage and networking architectures, virtual machines and their management, job scheduling, cloud resources
•Algorithms in SPARK on cloud: Hashing, PageRank
•Data Processing: Distributed Data Structures, Graphs, Leader Election, Consensus
•Memory Hierarchy, Shared memory, Shared-nothing, distributed file systems, replication, communication cost, complexity theory
•Programming: Message-Passing (MPI)
•Programming: Threads, Locks and Atomics (CUDA)
•Programming: Work Queues, Schedulers, Streaming
•Heterogenous Processing: Systems and Programming

Preparation

The course assumes that you are proficient in Python, as taught in COSC121, and in algorithm design and analysis, as taught in COSC262. If you are enrolling in DATA301 but haven't already passed COSC121 and COSC262 or the equivalents, you should consult the course supervisor before enrolling.

Indicative Fees

Domestic fee $761.00

International fee $3,188.00

* Fees include New Zealand GST and do not include any programme level discount or additional course related expenses.

For further information see Computer Science and Software Engineering.

All DATA301 Occurrences

  • DATA301-19S1 (C) Semester One 2019