DATA301-22S1 (C) Semester One 2022

Big Data Computing and Systems

15 points

Details:
Start Date: Monday, 21 February 2022
End Date: Sunday, 26 June 2022
Withdrawal Dates
Last Day to withdraw from this course:
  • Without financial penalty (full fee refund): Sunday, 6 March 2022
  • Without academic penalty (including no fee refund): Sunday, 15 May 2022

Description

The course introduces distributed computational techniques, distributed algorithms and systems/programming support for large-scale processing of data.

2022 Covid-19 Update: Please refer to the course page on AKO | Learn for all information about your course, including lectures, labs, tutorials and assessments.

Learning Outcomes

  • Description

    This course teaches parallel and distributed programming, algorithms, and systems principles that are relevant for large-scale processing of big data sets on high performance computing clusters and cloud computing resources.

    Learning Outcomes: At the end of this course, students will be able to...

  • Understand and explain the fundamentals of cloud computing systems (SaaS, PaaS, IaaS, storage and networking architectures, virtual machines and their management, job scheduling).
  • Understand and explain different programming models for parallel and distributed computing (shared memory, shared-nothing / message-passing architectures) and common design patterns for distributed computations on big data sets (e.g. leader/follower, Map/Reduce, Gossiping).
  • Understand the drawbacks and advantages of different cloud solutions and distributed programming models and select appropriate solutions for a given situation.
  • Understand and explain fundamental distributed algorithms (e.g. leader election, consensus) and their properties as well as selected specialized algorithms for distributed processing of big data (e.g. matrix algorithms in parallel / distributed environments, distributed optimization)
  • Be able to design, implement and evaluate distributed processing programs for large data sets using appropriate software frameworks like MPI, CUDA, Hadoop or Apache SPARK.
  • Be able to communicate the results and argue from evidence.
  • Be able to work in teams.

Prerequisites

Course Coordinator

James Atlas

Assessment

2022 Covid-19 Update: Please refer to the course page on AKO | Learn for all information about your course, including lectures, labs, tutorials and assessments.

Textbooks / Resources

Recommended Reading

Blaise Barney; Introduction to Parallel Computing ; (Introduction to Parallel Computing (and other tutorials). https://hpc.llnl.gov/training/tutorials).

CUDA; CUDA Toolkit Documentation ; v10.0.130; (CUDA Programming Guide: https://docs.nvidia.com/cuda).

Jure Leskovec, Anand Rajarman, Jeffrey David Ullman; Mining of Massive Datasets ; 2nd; Cambridge University Press, 2014 (http://www.mmds.org).

Additional Course Outline Information

Academic integrity

You are encouraged to discuss the general aspects of a problem with others. However, anything you submit for credit must be entirely your own work and not copied, with or without modification, from any other person. If you share details of your work with anybody else then you are likely to be in breach of the University's General Course and Examination Regulations and/or Computer Regulations (both of which are set out in the University Calendar) and/or the Computer Science Department's policy (see section 9). The Department treats cases of dishonesty very seriously and, where appropriate, will not hesitate to notify the University Proctor.

If you need help with specific details relating to your work, or are not sure what you are allowed to do, then contact your tutors or lecturer for advice.

Assessment and grading system

Lab assessment - 30%

In the labs students will practice the design and implementation of distributed algorithms and they will gain practical experience with contemporary Big Data and Cloud Computing frameworks such as Apache SPARK, MPI, CUDA and Google Cloud / Amazon Web Services. LO2, LO4, LO5


Project - 40%

In this series of artifacts, students will complete a short, application focused project. Students will work in teams of two or three students on an analysis task for a big data set, which requires them to design, write progress reports, implement and test an appropriate distributed algorithm in an appropriate software framework, to critique their design and to communicate the design and analysis results in a professional manner in a written report. This assessment item addresses LO3, LO5, LO6, LO7


Final exam - 30%

The final exam will allow a summative assessment of learning outcomes related to the full semester. This can include theoretical aspects, algorithms, programming, and techniques covered in lectures and assignments. LO1, LO2, LO3, LO4

Grade moderation

The Computer Science department's grading policy states that in order to pass a course you must meet two requirements:
1. You must achieve an average grade of at least 50% over all assessment items.
2. You must achieve an average mark of at least 45% on invigilated assessment items.

If you satisfy both these criteria, your grade will be determined by the following University-wide scale for converting marks to grades: an average mark of 50% is sufficient for a C- grade, an average mark of 55% earns a C grade, 60% earns a C+ grade and so forth. However if you do not satisfy both the passing criteria you will be given either a D or E grade depending on marks. Marks are sometimes scaled to achieve consistency between courses from year to year.

Students may apply for special consideration if their performance in an assessment is affected by extenuating circumstances beyond their control.

Applications for special consideration should be submitted via the Examinations Office website within five days of the assessment.

Where an extension may be granted for an assessment, this will be decided by direct application to the Department and an application to the Examinations Office may not be required.

Special consideration is not available for items worth less than 10% of the course.

Students prevented by extenuating circumstances from completing the course after the final date for withdrawing, may apply for special consideration for late discontinuation of the course. Applications must be submitted to the Examinations Office within five days of the end of the main examination period for the semester.

Course Outline

The topics covered in lectures will be organized generally with the following progression:

•Introduction: Big Data
•5 Vs (Variety, Velocity, Volume, Veracity, Value)
•Storage and networking architectures
•Divide and Conquer, Map, Reduce, Map/Reduce functional programming in SPARK
•Algorithms in SPARK: Group By, Union, Intersection, Difference, Matrix-Vector and Matrix-Matrix Multiplication
•Systems: SaaS, PaaS, IaaS, Google Cloud / Amazon Web Services, storage and networking architectures, virtual machines and their management, job scheduling, cloud resources
•Algorithms in SPARK on cloud: Hashing, PageRank
•Data Processing: Distributed Data Structures, Graphs, Leader Election, Consensus
•Memory Hierarchy, Shared memory, Shared-nothing, distributed file systems, replication, communication cost, complexity theory
•Programming: Message-Passing (MPI)
•Programming: Threads, Locks and Atomics (CUDA)
•Programming: Work Queues, Schedulers, Streaming
•Heterogenous Processing: Systems and Programming

Preparation

The course assumes that you are proficient in Python, as taught in COSC121, and in algorithm design and analysis, as taught in COSC262. If you are enrolling in DATA301 but haven't already passed COSC121 and COSC262 or the equivalents, you should consult the course supervisor before enrolling.

Indicative Fees

Domestic fee $799.00

International fee $3,600.00

* All fees are inclusive of NZ GST or any equivalent overseas tax, and do not include any programme level discount or additional course-related expenses.

For further information see Computer Science and Software Engineering .

All DATA301 Occurrences

  • DATA301-22S1 (C) Semester One 2022