## Note: This is archived material from June 2016.

# INFO 371: Core Methods in Data Science

**Spring 2016**

**University of Washington School of Information**

**Lectures: Monday and Wednesday 3:30-5:20, SAV 156**

**Labs: Monday 6:30-7:20, MGH 430**

Course Description | Prerequisites | Course Outline |Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Introduces students to modern methods in applied data science. Emphasis is given to practical applications and analysis of real-world data, through a survey of common techniques in supervised and unsupervised machine learning, and methods for experimental design and causal inference. Students will learn functional, procedural, and statistical programming techniques for working with data.

# Course Outline:

## (Re-)Introduction to Data Science

#### March 28: Introduction to the course

## Applicatoins and Implementation

#### May 25: Recommender Systems

#### May 30: Scaling and Map-Reduce

#### June 1: Last Day of Class: Jeopardy + min-Final

Course grades will be based primarily on problem sets, quizzes, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows::

- Problem Sets: 70%
- Exams, quizzes, and mini-assignments: 15%
- Lab and classroom participation: 15%

# Detailed Syllabus

## (Re-) Introduction to Data Science

### March 28: Introduction to the course

- Introductions
- Nuts and bolts of the class: structure, homework, policies, learning objectives

##### March 28 (Lab): Crash course in Python for data science (part 1)

- Install necessary software on your personal computer. I strongly recommend using Anaconda (python 2.7): https://www.continuum.io/downloads. If you don’t use Anaconda, make to install python, IPython, IPython, Pandas, NumPy, SciPy, and Matplotlib.
- Read and complete at least the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
- Watch 10-minute tour of pandas: https://vimeo.com/59324550

## Causal Inference

### March 30: Econometrics I: Experimental Methods

- Correlation and Causation
- Counterfactuals and Control Groups
- A-B testing, Business Experiments, Randomized Control Trials
- Experimental design and statistical power

##### Required Readings

- INTRODUCTION (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
- Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105

##### Optional Readings

- Pages 1-47 of: Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit"
- Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
- Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18 (2009), 140–181.

### April 4: Crash course in Python for data science (part 2)

- Programming paradigms
- Working with data
- Crash course in python

##### Required Readings

- Chapters 3-5 and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
- Strongly recommended: Read and complete lessons 1-7 of Learn Pandas (https://bitbucket.org/hrojas/learn-pandas)

##### April 4 (Lab): NO LAB TODAY

### April 6: Econometrics II: Regression and Impact Evaluation

- Impact Evaluation
- Regression

##### Required Readings

- Sections 1-3 of Shultz: School subsidies for the poor
- Chapters 2-3 of Khandker er al. (2010), “Handbook on Impact Evaluation"

##### Optional Readings

- David Albouy: Lecture notes on Differences in Differences Estimation

### April 11: Econometrics III: Heterogeneity and Fixed Effects

- Interactions
- Difference in difference
- Fixed and Random effects models

##### Required Readings

- Lecture notes on “Fixed Effects Models"
- Chapter 5 of Khandker et al. (2010), “Handbook on Impact Evaluation"

##### April 11 (Lab): Simple statistical tests

- Perform t-tests
- Run a basic regression
- Regression with dummy variables

### April 13: Econometrics IV: Non-experimental methods

- Instrumental Variables
- Regression discontinuity

##### Required Readings

- Chapters 6 and 7 of Khandker et al. (2010), “Handbook on Impact Evaluation"

##### Optional Readings

- Chapter 10 of Stock & Watson (2010) on “Instrumental Variables”
- Buddlemeyer & Skoufias (2004). An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA.
- Duflo (2001). Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment

## Supervised Learning

### April 18: Design of Machine Learning experiments

- Supervised and unsupervised learning
- Training and test data
- Cross-validation and bootstrapping
- Evaluation and baselines
- Generalization and overfitting
- Features and feature selection

##### Required Readings

- Chapter 1 of Daume (in preparation). A course in machine learning
- Chapter 5 of Whitten, Frank, Hall: Data Mining

##### Optional Readings

- Syed, A. (2011). A review of cross validation and adaptive model selection.

##### April 18(Lab): Regression and prediction

- Generate random numbers
- Create training and test data
- Fit a regression on training data, evaluate performance on test data
- Compare different measures of performance

### April 20: Nearest neighbors

- Instance-based learning
- Nearest neighbors
- Curse of dimensionality

##### Required Readings

- Chapter 2 of Daume (2015) A course in machine learning

##### Optional Readings

- Chapter 13 (sections 13.1 - 13.3) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Chapter 6 of Provost & Fawcett: Data Science for Business

### April 25: Gradient Descent

- Cost functions
- Gradient descent
- Convexity

##### Required Readings

- Reread section 4.6 of Whitten, Frank, Hall: Data Mining
- Chapter 6 of Daume (in preparation). A course in machine learning

##### Optional Readings

- Chapter 5 of Schutt & O’Neill (2013): Doing Data Science
- Zumel and Mount, Chaper 7

##### April 25 (Lab): Logistic regression

- Compare regression to LASSO
- Explore issues of overfitting

### April 27 & May 2: Regularization and linear models

- Regularization
- Ridge and Lasso
- Logistic regression
- Support vector machines
- Kernel methods

##### Required Readings

- Chapter 6 of Daume (in preparation). A course in machine learning
- Chapter 9 of Daume (in preparation). A course in machine learning

##### Optional Readings

- Chapter 3 (sections 3.3 and 3.4), of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Zumel and Mount, Chaper 7

##### May 2 (Lab): Regularization

- TBD

### May 4: Naïve Bayes

- Probability review: Bayes rule, independence, distributions
- Generative models and Naive Bayes
- Maximum likelihood estimation and smoothing

##### Required Readings

- Chapter 4 of Schutt & O’Neill (2013): Doing Data Science
- Reread section 4.2 of Whitten, Frank, Hall: Data Mining
- Michael Collin’s lecture notes on Naïve Bayes(especially pp. 1-4)

##### Optional Readings

- Paul Graham (2002) “Better Bayesian Filtering" http://www.paulgraham.com/better.html
- Kevin Murphy's example of Bayes' Rule for medical diagnosis: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html

### May 9: Trees and Forests (Jevin West)

- Decision trees
- Adaboost: combining decision stumps
- Random forests and combining classifiers

##### Required Readings

- Chapter 4 of Whitten, Frank, Hall: Data Mining

##### Optional Readings

- Chapter 9 (section 9.2) and Chapter 15 of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)

##### May 9 (Lab): TBD

### May 11: Neural Networks

- Perceptrons
- Biological origins
- Model representation
- Cost functions
- Backpropagation

##### Required Readings

- Chapter 8 of Daume (in preparation). A course in machine learning

##### Optional Readings

- Schmidhuber (2014). “Deep Learning in Neural Networks: An Overview”. Technical Report
- Edwards (2015). “Growing pains for deep learning.” Communications of the ACM

### May 16: Supervised Learning Wrap-up

##### Required Readings

- Wu et al (2008) “Top 10 Algorithms in Data Mining
- Domingos, “A Few Useful Things to Know about Machine Learning” Communications of the ACM, 55 (10), 78-87, 2012.

##### May 16 (Lab): Neural networks in python

- Perceptrons
- Neural networks in sklearn

## Unsupervised Learning

### May 18: Dimensionality Reduction

- Dimensionality Reduction
- Principal Component Analysis
- Case study: Eigenfaces
- Other methods for dimensionality reduction: SVD, NNMF, LDA

##### Required Readings

- Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
- Chapter 11 (sections 11.1 – 11.3) of Rajarman et al: Mining of Massive Datasets

##### Optional Readings

- Watch Pedro Domingos talk about the curse of dimensionality: https://class.coursera.org/machlearning-001/lecture/215 (segment 4 of week 4)
- Chapter 14 (sections 14.2, 14.5 - 14.10) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Turk & Pentland (1991) “Eigenfaces for Recognition

### May 23: Cluster Analysis

- Introduction to unsupervised learning
- Distance metrics
- K-Means clustering
- Hierarchical clustering

##### Required Readings

- Chapter 13 of Daume (in preparation). A course in machine learning
- Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf

##### Optional Readings

- Chapter 6 of Provost & Fawcett: Data Science for Business

##### May 23 (Lab): Unsupervised Learning

- Clustering
- PCA

## Applications and Implementation

### May 25: Recommender Systems

- The Netflix challenge
- Content-based methods
- Learning features and parameters
- Nearest-neighbor collaborative filtering

##### Required Readings

- Chapter 8 of Schutt & O’Neill (2013): Doing Data Science

##### Optional Readings

- Chapter 9 of Leskovec et al: Mining of Massive Datasets (freely available online)
- Yehuda Koren (2009) “The BellKor Solution to the Netﬂix Grand Prize"
- Resnick et al (1994) “GroupLens: an open architecture for collaborative filtering of netnews”, CSCW ’94, pp. 175-186
- RM Bell, Y Koren (2007) “Lessons from the Netflix prize challenge”, ACM SIGKDD Explorations Newsletter

### May 30: Scaling and Map-Reduce

- Scaling
- Map-Reduce
- The Hadoop ecosystem

##### Optional Readings

- Chapter 2 of Leskovec et al (2014) Mining of Massive Datasets (freely available online)
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.
- Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010

##### May 30 (Lab): Map-Reduce in Python

### June 1: Last Day of Class: Jeopardy + mini-Final

- Review all lecture notes and readings for the quarter