STA 221 - Big Data & High Performance Statistical Computing

Subject: STA 221
Title: Big Data & High Performance Statistical Computing
Units: 4.0
School: College of Letters and Science LS
Department: Statistics STA
Effective Term: 2020 Spring Quarter

Learning Activities

  • Lecture - 3.0 hours
  • Discussion - 1.0 hours

Description

High-performance computing in high-level data analysis languages; different computational approaches and paradigms for efficient analysis of big data; interfaces to compiled languages; R and Python programming languages; high-level parallel computing; MapReduce; parallel algorithms and reasoning.

Prerequisites

STA 220

Expanded Course Description

Summary of Course Content:
This course explores aspects of scaling statistical computing for large data and simulations. It
moves from identifying inefficiencies in code, to idioms for more efficient code, to interfacing to
compiled code for speed and memory improvements. We then focus on high-level approaches
to parallel and distributed computing for data analysis and machine learning and the
fundamental general principles involved. We also explore different languages and frameworks
for statistical/machine learning and the different concepts underlying these, and their
advantages and disadvantages. We also take the opportunity to introduce statistical methods
specifically designed for large data, e.g. the bag of little bootstraps.

Illustrative Reading:
● Advanced R, Wickham. Parallel R, McCallum & Weston.
● Python for Data Analysis, Weston.
● Hadoop: The Definitive Guide, White.

Potential Course Overlap:
The course covers the same general topics as STA 141C, but at a more advanced level, and
includes additional topics on research-level tools. Examples of such tools are Scikit-learn
functions, as well as key elements of deep learning (such as convolutional neural networks, and
long short-term memory units). ECS 158 covers parallel computing, but uses different
technologies and has a more technical focus on machine-level details. ECS 145 covers Python,
but from a more computer-science and software engineering perspective than a focus on data
analysis.

Final Exam:
Yes Final Exam