Note: This is archived material from June 2016.

INFO 371: Core Methods in Data Science

Spring 2016
University of Washington School of Information
Lectures: Monday and Wednesday 3:30-5:20, SAV 156
Labs: Monday 6:30-7:20, MGH 430

Course Description | Prerequisites | Course Outline |Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

Introduces students to modern methods in applied data science. Emphasis is given to practical applications and analysis of real-world data, through a survey of common techniques in supervised and unsupervised machine learning, and methods for experimental design and causal inference. Students will learn functional, procedural, and statistical programming techniques for working with data.

Prerequisites

INFX 370 or permission of instructor.

Course Outline:

(Re-)Introduction to Data Science

March 28: Introduction to the course

Causal Inference

March 30: Econometrics I: Experimental Methods

April 4: Crash course in Python for data science (part 2)

April 6: Econometrics II: Regression and Impact Evaluation

April 11: Econometrics III: Heterogeneity and Fixed Effects

April 13: Econometrics IV: Non-experimental methods

Supervised Learning

April 18: Design of Machine Learning experiments

April 20: Nearest neighbors

April 25: Gradient Descent

April 27 & May 2: Regularization and linear methods

May 4: Naive Bayes

May 9: Trees and Forests (Jevin West)

May 11: Neural Networks

May 16: Supervised Learning Wrap-up

Unsupervised learning

May 18: Dimensionality Reduction

May 23: Cluster Analysis

Applicatoins and Implementation

May 25: Recommender Systems

May 30: Scaling and Map-Reduce

June 1: Last Day of Class: Jeopardy + min-Final

Grading

Course grades will be based primarily on problem sets, quizzes, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows::

Problem Sets: 70%
Exams, quizzes, and mini-assignments: 15%
Lab and classroom participation: 15%

Detailed Syllabus

(Re-) Introduction to Data Science

March 28: Introduction to the course

Introductions
Nuts and bolts of the class: structure, homework, policies, learning objectives

March 28 (Lab): Crash course in Python for data science (part 1)

Install necessary software on your personal computer. I strongly recommend using Anaconda (python 2.7): https://www.continuum.io/downloads. If you don’t use Anaconda, make to install python, IPython, IPython, Pandas, NumPy, SciPy, and Matplotlib.
Read and complete at least the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
Watch 10-minute tour of pandas: https://vimeo.com/59324550

Causal Inference

March 30: Econometrics I: Experimental Methods

Correlation and Causation
Counterfactuals and Control Groups
A-B testing, Business Experiments, Randomized Control Trials
Experimental design and statistical power

Required Readings

INTRODUCTION (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105

Optional Readings

Pages 1-47 of: Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit"
Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18 (2009), 140–181.

April 4: Crash course in Python for data science (part 2)

Programming paradigms
Working with data
Crash course in python

Required Readings

Chapters 3-5 and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
Strongly recommended: Read and complete lessons 1-7 of Learn Pandas (https://bitbucket.org/hrojas/learn-pandas)

April 4 (Lab): NO LAB TODAY

April 6: Econometrics II: Regression and Impact Evaluation

Impact Evaluation
Regression

Required Readings

Sections 1-3 of Shultz: School subsidies for the poor
Chapters 2-3 of Khandker er al. (2010), “Handbook on Impact Evaluation"

Optional Readings

David Albouy: Lecture notes on Differences in Differences Estimation

April 11: Econometrics III: Heterogeneity and Fixed Effects

Interactions
Difference in difference
Fixed and Random effects models

Required Readings

Lecture notes on “Fixed Effects Models"
Chapter 5 of Khandker et al. (2010), “Handbook on Impact Evaluation"

April 11 (Lab): Simple statistical tests

Perform t-tests
Run a basic regression
Regression with dummy variables

April 13: Econometrics IV: Non-experimental methods

Instrumental Variables
Regression discontinuity

Required Readings

Chapters 6 and 7 of Khandker et al. (2010), “Handbook on Impact Evaluation"

Optional Readings

Chapter 10 of Stock & Watson (2010) on “Instrumental Variables”
Buddlemeyer & Skoufias (2004). An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA.
Duflo (2001). Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment

Supervised Learning

April 18: Design of Machine Learning experiments

Supervised and unsupervised learning
Training and test data
Cross-validation and bootstrapping
Evaluation and baselines
Generalization and overfitting
Features and feature selection

Required Readings

Chapter 1 of Daume (in preparation). A course in machine learning
Chapter 5 of Whitten, Frank, Hall: Data Mining

Optional Readings

Syed, A. (2011). A review of cross validation and adaptive model selection.

April 18(Lab): Regression and prediction

Generate random numbers
Create training and test data
Fit a regression on training data, evaluate performance on test data
Compare different measures of performance

April 20: Nearest neighbors

Instance-based learning
Nearest neighbors
Curse of dimensionality

Required Readings

Chapter 2 of Daume (2015) A course in machine learning

Optional Readings

Chapter 13 (sections 13.1 - 13.3) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
Chapter 6 of Provost & Fawcett: Data Science for Business

April 25: Gradient Descent

Cost functions
Gradient descent
Convexity

Required Readings

Reread section 4.6 of Whitten, Frank, Hall: Data Mining
Chapter 6 of Daume (in preparation). A course in machine learning

Optional Readings

Chapter 5 of Schutt & O’Neill (2013): Doing Data Science
Zumel and Mount, Chaper 7

April 25 (Lab): Logistic regression

Compare regression to LASSO
Explore issues of overfitting

April 27 & May 2: Regularization and linear models

Regularization
Ridge and Lasso
Logistic regression
Support vector machines
Kernel methods

Required Readings

Chapter 6 of Daume (in preparation). A course in machine learning
Chapter 9 of Daume (in preparation). A course in machine learning

Optional Readings

Chapter 3 (sections 3.3 and 3.4), of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
Zumel and Mount, Chaper 7

May 2 (Lab): Regularization

May 4: Naïve Bayes

Probability review: Bayes rule, independence, distributions
Generative models and Naive Bayes
Maximum likelihood estimation and smoothing

Required Readings

Chapter 4 of Schutt & O’Neill (2013): Doing Data Science
Reread section 4.2 of Whitten, Frank, Hall: Data Mining
Michael Collin’s lecture notes on Naïve Bayes(especially pp. 1-4)

Optional Readings

Paul Graham (2002) “Better Bayesian Filtering" http://www.paulgraham.com/better.html
Kevin Murphy's example of Bayes' Rule for medical diagnosis: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html

May 9: Trees and Forests (Jevin West)

Decision trees
Adaboost: combining decision stumps
Random forests and combining classifiers

Required Readings

Chapter 4 of Whitten, Frank, Hall: Data Mining

Optional Readings

Chapter 9 (section 9.2) and Chapter 15 of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)

May 9 (Lab): TBD

May 11: Neural Networks

Perceptrons
Biological origins
Model representation
Cost functions
Backpropagation

Required Readings

Chapter 8 of Daume (in preparation). A course in machine learning

Optional Readings

Schmidhuber (2014). “Deep Learning in Neural Networks: An Overview”. Technical Report
Edwards (2015). “Growing pains for deep learning.” Communications of the ACM

May 16: Supervised Learning Wrap-up

Required Readings

Wu et al (2008) “Top 10 Algorithms in Data Mining
Domingos, “A Few Useful Things to Know about Machine Learning” Communications of the ACM, 55 (10), 78-87, 2012.

May 16 (Lab): Neural networks in python

Perceptrons
Neural networks in sklearn

Unsupervised Learning

May 18: Dimensionality Reduction

Dimensionality Reduction
Principal Component Analysis
Case study: Eigenfaces
Other methods for dimensionality reduction: SVD, NNMF, LDA

Required Readings

Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
Chapter 11 (sections 11.1 – 11.3) of Rajarman et al: Mining of Massive Datasets

Optional Readings

Watch Pedro Domingos talk about the curse of dimensionality: https://class.coursera.org/machlearning-001/lecture/215 (segment 4 of week 4)
Chapter 14 (sections 14.2, 14.5 - 14.10) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
Turk & Pentland (1991) “Eigenfaces for Recognition

May 23: Cluster Analysis

Introduction to unsupervised learning
Distance metrics
K-Means clustering
Hierarchical clustering

Required Readings

Chapter 13 of Daume (in preparation). A course in machine learning
Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf

Optional Readings

Chapter 6 of Provost & Fawcett: Data Science for Business

May 23 (Lab): Unsupervised Learning

Clustering
PCA

Applications and Implementation

May 25: Recommender Systems

The Netflix challenge
Content-based methods
Learning features and parameters
Nearest-neighbor collaborative filtering

Required Readings

Chapter 8 of Schutt & O’Neill (2013): Doing Data Science

Optional Readings

Chapter 9 of Leskovec et al: Mining of Massive Datasets (freely available online)
Yehuda Koren (2009) “The BellKor Solution to the Netﬂix Grand Prize"
Resnick et al (1994) “GroupLens: an open architecture for collaborative filtering of netnews”, CSCW ’94, pp. 175-186
RM Bell, Y Koren (2007) “Lessons from the Netflix prize challenge”, ACM SIGKDD Explorations Newsletter

May 30: Scaling and Map-Reduce

Scaling
Map-Reduce
The Hadoop ecosystem

Optional Readings

Chapter 2 of Leskovec et al (2014) Mining of Massive Datasets (freely available online)
Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.
Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010

May 30 (Lab): Map-Reduce in Python

June 1: Last Day of Class: Jeopardy + mini-Final

Review all lecture notes and readings for the quarter