INFX 573: Introduction to Data Science

Winter 2013
University of Washington School of Information
Lectures: Tuesday and Thursday 1:30-3:20, MGH 420

Course Description | Prerequisites | Course Outline |Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

This course offers students a practical, hands-on introduction to the growing field of "Data Science," and common methods for quantitative and computational analytics. As "big data" become the norm in modern business and research environments, there is a growing demand for individuals who are able to derive meaningful insight from large, unruly data. This requires a heterogeneous mix of skills, from data munging and ETL; to machine learning and econometrics; to effective visualization and communication. Through a combination of data-intensive exercises and guest lectures by experts in the field, this course provides an overview of key concepts, skills, and technologies used by data scientists. Interested students must have college-level exposure to statistics and programming.

Prerequisites

A data scientist is often referred to as someone who knows more statistics than a computer scientist and more computer science than a statistician. Most assignments for the course will use the R programming language, and students are highly encouraged to familiarize themselves with R prior to the first day of class. Please note, students who do not meet these requirements will find it extremely difficult to successfully complete required problem sets.

Programming: Students should be able to comfortably program in a high level programming language like Java, python, php, or C/C#/C++. Note that html, javascript, and VBA are not sufficient in this context. .Comfortably. implies that students should be able to write simple programs from scratch, like a web scraper, or a text parser, or a simple game of scrabble or tic-tac-toe.

Statistics: Students should have had introductory coursework in both probability and statistics prior to enrolling in this course. At a minimum, students should have an operational understanding of hypothesis testing, statistical significance, and regression analysis.

Course Outline:

Section 1: Introduction to Data Science

Lecture 1: What is Data Science?

Lecture 2: A Crash Course in R

Section 2: Empirical Frameworks and Experimental Design

Lecture 3: Developing an Empirical Framework

Lecture 4: Business experiments

Section 3: Working with Data

Lecture 5: Exploring Data, part 1

Lecture 6: Exploring Data, part 2

Section 4: Basic analytics

Lecture 7: Distributions, t-tests, and the importance of basic statistics

Lecture 8: How far can basic statistics get you?

Lecture 9: Regression

Section 5: Advanced analytics

Lecture 10: Machine Learning I: Introduction

Lecture 11: Machine Learning 2: Supervised Learning

Lecture 12: Machine Learning 3: Unsupervised Learning

Lecture 13: Social Network Analysis

Visualizing and Communicating Data

Lecture 14: Visualizing Quantitative Information

Lecture 15: Communicating Results Effectively

Scaling to terabytes and petabytes

Lecture 17: Scaling: What works and what doesn’t (and what might in the future)

Lecture 18: Common applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Perspectives from industry and academia

Lecture 19: Group Project Presentations

Lecture 20: The future of Data Science and Big Data

Detailed Syllabus

Introduction to Data Science

What is Data Science?

A Crash Course in R

Empirical Frameworks and Experimental Design

Developing an Empirical Framework

Due: Background and interests survey 

Business Experiments, A-B testing, and RCT's

Working With Data

Exploring Data

Analytics and Inference

Distributions, t-tests, and the importance of basic statistics

Due: Problem Set 1

How far can basic statistics get you? [Guest: Andres Monroy-Hernandez, Microsoft Research]

Regression [Guest: Desmond Murray, Cisco]

Advanced Analytics

Machine Learning I: Introduction

Due: Problem Set 2

Machine Learning 2: Supervised Learning

Required Readings
Optional Readings

Machine Learning 3: Unsupervised Learning

Due: Problem Set 3

Nov 21: Social Network Analysis

Visualizing and Communicating Data

Visualizing Quantitative Data

Scaling to Terabytes and Petabytes

Storage and Organization: Databases, Scalable SQL and NoSQL

Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Group Presentations

ENDCONTENT; //$dat[] = h2("Note: This course was last offered in September 2013. Archived material is included below."); //$dat[] = h3("Note: Current students should access course materials through Canvas"); //$dat[] = "A draft (.pdf) syllabus for this course can be downloaded here."; // $content[] = h2($courses["infx598"], array("class"=>"hborder", "style"=>"padding-top:20px;")); $content[] = h2("Note: This is archived material from September 2013.", array("style"=>"text-align:center; background: $background; color: $light")); //$content[] = divclass(implode("\n",$dat),"padsides"); $content[] = "
 "; $content[] = $content_html; Warning: Cannot modify header information - headers already sent by (output started at /var/www/jblumenstock/live/states/teaching/infx573.php:1) in /var/www/jblumenstock/live/template.html on line 8 Joshua Blumenstock
Warning: implode(): Invalid arguments passed in /var/www/jblumenstock/live/states/teaching/courses.php on line 23