## Note: This is archived material from January 2016.

# INFX 574: Advanced methods in Data Science

**Winter 2016**

**University of Washington School of Information**

**Lectures: Tuesday and Thursday 3:30-5:20, BLD 070**

Course Description | Prerequisites | Course Outline | Grading | Assignments | Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Provides theoretical and practical introduction to modern techniques for the analysis of large-scale, heterogeneous data. Covers key concepts in inferential statistics, supervised and unsupervised machine learning, and network analysis. Students will learn functional, procedural, and statistical programming techniques for working with real-world data.

INFX 573 or permission of instructor. Students are expected to be comfortable programming in R, and to have mastery of one higher-level programming language such as Python, Php, Java, C++, etc.

# Course Outline:

## (Re-)Introduction to Data Science

#### January 5: Python for Data Science

#### January 7: Introduction to data science

## Causal Inference

#### January 12: Econometrics, Part I: Experimental Methods

#### January 14: Econometrics, Part II: Regression and Impact Evaluation

#### January 19: Econometrics, Part III: Heterogeneity and Fixed Effects

#### January 21: Econometrics, Part IV: Non-Experimenal Methods

## Supervised Learning

#### January 26: Intro to Machine Learning

#### January 26 (part 2!): Design of Machine Learning Experiments

#### January 28: Nearest Neighbors

#### February 2: Gradient Descent

#### February 4: Regularization and linear models

#### February 9: Naive Bayes

#### February 11: Trees and Forests

#### February 16: Neural Networks

#### February 23: Supervised Learning Wrap-Up

## Applications and Implementation

#### March 3: Recommender Systems

#### March 8: Special Topics

#### March 10: Wrap-up

Course grades will be based primarily on problem sets, quizzes, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:

- Problem Sets: 93%
- Participation, quizzes, and mini-assignments: 3%

# Detailed Syllabus

## (Re-)Introduction to Data Science

### January 5: Python for Data Science

- Crash course in Python
- Pandas, Numpy, SciPy
- IPython and IPython Notebook

##### Required Readings:

- Chapters 3-5, and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
- Install python, IPython, and the numerical analysis libraries on your laptop and bring it to class. See course announcement for details.
- Read and complete at least the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
- Watch 10-minute tour of pandas
- Strongly recommended: Read and complete lessons 1-7 of Learn Pandas

### January 7: Introduction to data science

- What is Data Science?
- Nuts and bolts of the class: structure, homework, policies, learning objectives
- Correlation and Causation
- Counterfactuals and Control Groups

##### Required Readings

- Introduction (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
- Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
- Pages 1-19 of: E. Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit

##### Optional Readings

- Andrew Gelman: “There are four ways to get fired from Ceasars"

## Causal Inference

### January 12: Econometrics, Part I: Experimental Methods

- A-B testing, Business Experiments, Randomized Control Trials
- Experimental design and statistical power

##### Required Readings

- Pages 19-47 of E. Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit"

##### Optional Readings

- List (2011). "Why Economists Should Conduct Field Experiments and 14 Tips for Pulling One Off." Journal of Economic Perspectives.
- Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
- Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18 (2009), 140–181.
- Smith & Pell (2013). Parachute use to prevent death and major trauma related to gravitational challenge: Systematic review of randomised controlled trials.

### January 14: Econometrics Part II: Regression and Impact Evaluation

- Regression
- Impact Evaluation

##### Required Readings

- Sections 1-3 of Shultz: School subsidies for the poor
- David Albouy: Lecture notes on Differences in Differences Estimation

##### Optional Readings

- Chapters 2, 3, and 5 of Khandker (2010) Handbook on Impact Evaluation

### January 19: Econometrics III: Heterogeneity and Fixed Effects

- Interactions
- Difference in difference
- Fixed and Random effects models

##### Required Readings

- Lecture notes on Fixed Effects models

##### Optional Readings

- Chapters 6 and 7 of Khandker (2010) Handbook on Impact Evaluation

### January 21: Econometrics IV: Non-Experimental Methods

- Instrumental Variables
- Regression discontinuity

##### Optional Readings

- Chapters 6 and 7 of Khandker (2010) Handbook on Impact Evaluation
- Varian, Hal R. 2014. "Big Data - New Tricks for Econometrics" Journal of Economic Perspectives, 28(2): 3-28.
- Buddlemeyer & Skoufias (2004). An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA.
- Duflo (2001). Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment

## Supervised Learning

### January 26: Intro to Machine Learning

- Training and test data
- Introduction to Machine Learning
- Supervised vs. Unsupervised Learning
- Key Issues in (Supervised) Machine Learning
- Philosophical Interlude

##### Required Readings

- P. Domingos, “A Few Useful Things to Know about Machine Learning.” Communications of the ACM, 55 (10), 78-87, 2012.

### January 26 (part 2!): Design of Machine Learning Experiments

- Training and test data
- Cross-validation and bootstrapping
- Evaluation and baselines
- Generalization and overfitting
- Features and feature selection

##### Optional Readings

- Chapter 1 of Daume (in preparation). A course in machine learning
- Chapter 5 of Whitten, Frank, Hall: Data Mining

### January 28: Nearest Neighbors

- Instance-based learning
- Nearest neighbors
- Curse of dimensionality
- Locally-weighted regression

##### Required Readings

- Chapter 4 (especially section 4.7) of Whitten, Frank, Hall: Data Mining
- Chapter 2 of Daume (in preparation). A course in machine learning

##### Optional Readings

- Chapter 13 (sections 13.1 - 13.3) of Hastie, Tibshirani, Friedman, View in a new windowThe Elements of Statistical Learning (10th edition)
- Chapter 6 (sections 6.1 - 6.6) on Kernel Estimation in Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Chapter 6 of Provost & Fawcett: Data Science for Business

### February 2: Gradient Descent

- Cost functions
- Gradient descent
- Convexity

##### Required Readings

- Chapter 6 of Daume (in preparation). A course in machine learning
- Reread section 4.6 of Whitten, Frank, Hall: Data Mining

##### Optional Readings

- Chapter 5 of Schutt & O’Neill (2013): Doing Data Science
- Chapter 3 (sections 3.1 and 3.1), Chapter 4 (especially section 4.4), and Chapter 6 (sections 6.1-6.3) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Zumel and Mount, Chaper 7.

### February 4: Regularization and linear models

- Regularization
- Ridge and LASSO
- Logistic regression
- Support vector machines

##### Required Readings

- Chapter 9 of Daume (in preparation). A course in machine learning

##### Optional Readings

- Chapter 3 (especial 3.4) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)

### February 9: Naive Bayes

- Probability review: Bayes rule, independence, distributions
- Generative models and Naive Bayes
- Maximum likelihood estimation and smoothing

##### Required Readings

- Chapter 4 of Schutt & O’Neill (2013): Doing Data Science.
- Reread section 4.2 of Whitten, Frank, Hall: Data Mining
- Michael Collin’s lecture notes on Naïve Bayes (especially pp. 1-4)

##### Optional Readings

- Paul Graham (2002) “Better Bayesian Filtering”. http://www.paulgraham.com/better.html
- Kevin Murphy's example of Bayes' Rule for medical diagnosis: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html
- Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion." ACM conference on Computer and communications security

### February 11: Trees and Forests

- Decision trees
- Adaboost: combining decision stumps
- Random forests and combining classifiers

##### Required Readings

- Chapter 4 of Schutt & O'Neill (2013): Doing Data Science
- Reread section 4.2 of Whitten, Frank, Hall: Data Mining
- Michael Collin's lecture notes on Naive Bayes (especially pp. 1-4)

##### Optional Readings

- Paul Graham (2002) "Better Bayesian FilteringPreview the documentView in a new window". http://www.paulgraham.com/better.html
- Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion" ACM conference on Computer and communications security.
- Kevin Murphy's example of Bayes' Rule for medical diagnosis

### February 16: Neural Networks

- Perceptrons
- Biological origins
- Model representation
- Cost functions
- Backpropagation

##### Required Readings

- Chapter 8 of Daume (in preparation). A course in machine learning

##### Optional Readings

- Chapter 11 (sections 11.1-11.1) of Hastie, Tibshirani, Friedman The Elements of Statistical Learning (10th edition)
- Schmidhuber (2014). “Deep Learning in Neural Networks: An Overview”. Technical Report
- Egmont-Petersen et al. (2002). “Image processing with neural networks--a review” Pattern recognition.

### February 23: Supervised Learning Wrap-Up

##### Required Readings

- Wu et al (2008) “Top 10 Algorithms in Data Mining

## Unsupervised Learning

### February 25: Dimensionality Reduction

- Dimensionality Reduction
- Principal Component Analysis
- Case study: Eigenfaces
- Other methods for dimensionality reduction: SVD, NNMF, LDA

##### Required Readings

- Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
- Chapter 11 (sections 11.1 – 11.3) of Rajarman et al: Mining of Massive Datasets

##### Optional Readings

- Watch Pedro Domingos talk about the curse of dimensionality: https://class.coursera.org/machlearning-001/lecture/215 (segment 4 of week 4)
- Chapter 14 (sections 14.2, 14.5 - 14.10) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Turk & Pentland (1991) "Eigenfaces for Recognition

### March 1: Cluster Analysis

- Introduction to unsupervised learning
- Distance metrics
- K-Means clustering
- Hierarchical clustering

##### Required Readings

- Chapter 13 of Daume (in preparation). A course in machine learning
- Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf

##### Optional Readings

- Chapter 13 of Daume (in preparation). A course in machine learning
- Chapter 6 of Provost & Fawcett: Data Science for Business

## Applications and Implementation

### March 3: Recommender Systems

- The Netflix challenge
- Content-based methods
- Learning features and parameters
- Nearest-neighbor collaborative filtering

##### Required Readings

- Chapter 8 of Schutt & O’Neill (2013): Doing Data Science

##### Optional Readings

- Chapter 9 of Rajarman et al.: Mining of Massive Datasets
- Yehuda Koren (2009) “The BellKor Solution to the Netﬂix Grand Prize”
- Resnick et al (1994) “GroupLens: an open architecture for collaborative filtering of netnews”, CSCW ’94, pp. 175-186
- RM Bell, Y Koren (2007) “Lessons from the Netflix prize challenge”, ACM SIGKDD Explorations Newsletter

### March 8: Special Topics

- Scaling
- Map-Reduce
- The Hadoop ecosystem

##### Optional Readings

- Chapter 2 of Leskovec et al (2014) Mining of Massive Datasets (freely available online)
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.
- Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010