Note: This is archived material from September 2013.


INFX 573: Introduction to Data Science

Winter 2013
University of Washington School of Information
Lectures: Tuesday and Thursday 1:30-3:20, MGH 420

Course Description | Prerequisites | Course Outline |Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

This course offers students a practical, hands-on introduction to the growing field of "Data Science," and common methods for quantitative and computational analytics. As "big data" become the norm in modern business and research environments, there is a growing demand for individuals who are able to derive meaningful insight from large, unruly data. This requires a heterogeneous mix of skills, from data munging and ETL; to machine learning and econometrics; to effective visualization and communication. Through a combination of data-intensive exercises and guest lectures by experts in the field, this course provides an overview of key concepts, skills, and technologies used by data scientists. Interested students must have college-level exposure to statistics and programming.


A data scientist is often referred to as someone who knows more statistics than a computer scientist and more computer science than a statistician. Most assignments for the course will use the R programming language, and students are highly encouraged to familiarize themselves with R prior to the first day of class. Please note, students who do not meet these requirements will find it extremely difficult to successfully complete required problem sets.

Programming: Students should be able to comfortably program in a high level programming language like Java, python, php, or C/C#/C++. Note that html, javascript, and VBA are not sufficient in this context. .Comfortably. implies that students should be able to write simple programs from scratch, like a web scraper, or a text parser, or a simple game of scrabble or tic-tac-toe.

Statistics: Students should have had introductory coursework in both probability and statistics prior to enrolling in this course. At a minimum, students should have an operational understanding of hypothesis testing, statistical significance, and regression analysis.

Course Outline:

Section 1: Introduction to Data Science

Lecture 1: What is Data Science?

Lecture 2: A Crash Course in R

Section 2: Empirical Frameworks and Experimental Design

Lecture 3: Developing an Empirical Framework

Lecture 4: Business experiments

Section 3: Working with Data

Lecture 5: Exploring Data, part 1

Lecture 6: Exploring Data, part 2

Section 4: Basic analytics

Lecture 7: Distributions, t-tests, and the importance of basic statistics

Lecture 8: How far can basic statistics get you?

Lecture 9: Regression

Section 5: Advanced analytics

Lecture 10: Machine Learning I: Introduction

Lecture 11: Machine Learning 2: Supervised Learning

Lecture 12: Machine Learning 3: Unsupervised Learning

Lecture 13: Social Network Analysis

Visualizing and Communicating Data

Lecture 14: Visualizing Quantitative Information

Lecture 15: Communicating Results Effectively

Scaling to terabytes and petabytes

Lecture 17: Scaling: What works and what doesn’t (and what might in the future)

Lecture 18: Common applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Perspectives from industry and academia

Lecture 19: Group Project Presentations

Lecture 20: The future of Data Science and Big Data

Detailed Syllabus

Introduction to Data Science

What is Data Science?

A Crash Course in R

  • Chapter 1 of Torgo, Data Mining with R [read this first]
  • Chapter 1 of Spector, Data Manipulation with R [follow along with all examples]
  • Chapter 2 of Spector, Data Manipulation with R [follow along with all examples]

Empirical Frameworks and Experimental Design

Developing an Empirical Framework

Due: Background and interests survey 

Business Experiments, A-B testing, and RCT's

Working With Data

Exploring Data

  • Chapter 3 of Zumel & Mount, Practical Data Science  with R
  • Getting Started with Charts in R, make sure you understand everything thoroughly
  • Complete the ggplot2 tutorial on canvas
  • Review other examples here.

Analytics and Inference

Distributions, t-tests, and the importance of basic statistics

Due: Problem Set 1

How far can basic statistics get you? [Guest: Andres Monroy-Hernandez, Microsoft Research]

  • Chapters 6 & 19 of Freedman, Pisani, and Purvis: Statistics
  • Chapters 1 & 2 of Freedman: Statistical Models: Theory and Practice
  • Watch
  • Chapters TBD of Torgo: Data Mining with R: Learning with Case Studies

Regression [Guest: Desmond Murray, Cisco]

  • Chapter 6 of: A Handbook of Statistical Analyses Using R

Advanced Analytics

Machine Learning I: Introduction

Due: Problem Set 2

Machine Learning 2: Supervised Learning

Required Readings
  • Chapter 8 of The Signal and the Noise: “Less and Less and Less Wrong” (Bayes’ Theorem)
Optional Readings
  • C. Haruechaiyasak: “A Tutorial on Naive Bayes Classification”, 2008.
  • Watch This video [from 32:00]

Machine Learning 3: Unsupervised Learning

Due: Problem Set 3

Nov 21: Social Network Analysis

  • N. Godbole et. al: “Large-scale sentiment analysis for news and blogs”, International Conference on Weblogs and Social Media, 2007.
  • L. Page, et. al: “The PageRank citation ranking: Bringing order to the web”, Stanford, 1999.
  • Albert R, Jeong H, Barabasi AL. (1999): “Diameter of the World Wide Web”. Nature, 401:130-131 

Visualizing and Communicating Data

Visualizing Quantitative Data

  • Excerpts from Edward Tufte, “Visual Display of Quantitative Information”
  • Review the ggplot2 tutorial on canvas
  • [Optional:] WSJ Guide to Information Visualization
  • [Optional:] How to Lie with Charts” and “How to Lie with Maps”
  • [Optional:] Excerpts from Beautiful Data

Scaling to Terabytes and Petabytes

Storage and Organization: Databases, Scalable SQL and NoSQL

Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Group Presentations

  • No required readings