Note: This is archived material from January 2016.

INFX 574: Advanced methods in Data Science

Winter 2016
University of Washington School of Information
Lectures: Tuesday and Thursday 3:30-5:20, BLD 070

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

Provides theoretical and practical introduction to modern techniques for the analysis of large-scale, heterogeneous data. Covers key concepts in inferential statistics, supervised and unsupervised machine learning, and network analysis. Students will learn functional, procedural, and statistical programming techniques for working with real-world data.

Prerequisites

INFX 573 or permission of instructor. Students are expected to be comfortable programming in R, and to have mastery of one higher-level programming language such as Python, Php, Java, C++, etc.

Course Outline:

(Re-)Introduction to Data Science

January 5: Python for Data Science

January 7: Introduction to data science

Causal Inference

January 12: Econometrics, Part I: Experimental Methods

January 14: Econometrics, Part II: Regression and Impact Evaluation

January 19: Econometrics, Part III: Heterogeneity and Fixed Effects

January 21: Econometrics, Part IV: Non-Experimenal Methods

Supervised Learning

January 26: Intro to Machine Learning

January 26 (part 2!): Design of Machine Learning Experiments

January 28: Nearest Neighbors

February 2: Gradient Descent

February 4: Regularization and linear models

February 9: Naive Bayes

February 11: Trees and Forests

February 16: Neural Networks

February 23: Supervised Learning Wrap-Up

Unsupervised learning

February 25: Dimensionality Reduction

March 1: Cluster Analysis

Applications and Implementation

March 3: Recommender Systems

March 8: Special Topics

March 10: Wrap-up

Grading

Course grades will be based primarily on problem sets, quizzes, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:

Problem Sets: 93%
Participation, quizzes, and mini-assignments: 3%

Detailed Syllabus

(Re-)Introduction to Data Science

January 5: Python for Data Science

Crash course in Python
Pandas, Numpy, SciPy
IPython and IPython Notebook

Required Readings:

Chapters 3-5, and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
Install python, IPython, and the numerical analysis libraries on your laptop and bring it to class. See course announcement for details.
Read and complete at least the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
Watch 10-minute tour of pandas
Strongly recommended: Read and complete lessons 1-7 of Learn Pandas

January 7: Introduction to data science

What is Data Science?
Nuts and bolts of the class: structure, homework, policies, learning objectives
Correlation and Causation
Counterfactuals and Control Groups

Required Readings

Introduction (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
Pages 1-19 of: E. Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit

Optional Readings

Andrew Gelman: “There are four ways to get fired from Ceasars"

Causal Inference

January 12: Econometrics, Part I: Experimental Methods

A-B testing, Business Experiments, Randomized Control Trials
Experimental design and statistical power

Required Readings

Pages 19-47 of E. Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit"

Optional Readings

List (2011). "Why Economists Should Conduct Field Experiments and 14 Tips for Pulling One Off." Journal of Economic Perspectives.
Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18 (2009), 140–181.
Smith & Pell (2013). Parachute use to prevent death and major trauma related to gravitational challenge: Systematic review of randomised controlled trials.

January 14: Econometrics Part II: Regression and Impact Evaluation

Regression
Impact Evaluation

Required Readings

Sections 1-3 of Shultz: School subsidies for the poor
David Albouy: Lecture notes on Differences in Differences Estimation

Optional Readings

Chapters 2, 3, and 5 of Khandker (2010) Handbook on Impact Evaluation

January 19: Econometrics III: Heterogeneity and Fixed Effects

Interactions
Difference in difference
Fixed and Random effects models

Required Readings

Lecture notes on Fixed Effects models

Optional Readings

Chapters 6 and 7 of Khandker (2010) Handbook on Impact Evaluation

January 21: Econometrics IV: Non-Experimental Methods

Instrumental Variables
Regression discontinuity

Optional Readings

Chapters 6 and 7 of Khandker (2010) Handbook on Impact Evaluation
Varian, Hal R. 2014. "Big Data - New Tricks for Econometrics" Journal of Economic Perspectives, 28(2): 3-28.
Buddlemeyer & Skoufias (2004). An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA.
Duflo (2001). Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment

Supervised Learning

January 26: Intro to Machine Learning

Training and test data
Introduction to Machine Learning
Supervised vs. Unsupervised Learning
Key Issues in (Supervised) Machine Learning
Philosophical Interlude

Required Readings

P. Domingos, “A Few Useful Things to Know about Machine Learning.” Communications of the ACM, 55 (10), 78-87, 2012.

January 26 (part 2!): Design of Machine Learning Experiments

Training and test data
Cross-validation and bootstrapping
Evaluation and baselines
Generalization and overfitting
Features and feature selection

Optional Readings

Chapter 1 of Daume (in preparation). A course in machine learning
Chapter 5 of Whitten, Frank, Hall: Data Mining

January 28: Nearest Neighbors

Instance-based learning
Nearest neighbors
Curse of dimensionality
Locally-weighted regression

Required Readings

Chapter 4 (especially section 4.7) of Whitten, Frank, Hall: Data Mining
Chapter 2 of Daume (in preparation). A course in machine learning

Optional Readings

Chapter 13 (sections 13.1 - 13.3) of Hastie, Tibshirani, Friedman, View in a new windowThe Elements of Statistical Learning (10th edition)
Chapter 6 (sections 6.1 - 6.6) on Kernel Estimation in Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
Chapter 6 of Provost & Fawcett: Data Science for Business

February 2: Gradient Descent

Cost functions
Gradient descent
Convexity

Required Readings

Chapter 6 of Daume (in preparation). A course in machine learning
Reread section 4.6 of Whitten, Frank, Hall: Data Mining

Optional Readings

Chapter 5 of Schutt & O’Neill (2013): Doing Data Science
Chapter 3 (sections 3.1 and 3.1), Chapter 4 (especially section 4.4), and Chapter 6 (sections 6.1-6.3) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
Zumel and Mount, Chaper 7.

February 4: Regularization and linear models

Regularization
Ridge and LASSO
Logistic regression
Support vector machines

Required Readings

Chapter 9 of Daume (in preparation). A course in machine learning

Optional Readings

Chapter 3 (especial 3.4) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)

February 9: Naive Bayes

Probability review: Bayes rule, independence, distributions
Generative models and Naive Bayes
Maximum likelihood estimation and smoothing

Required Readings

Chapter 4 of Schutt & O’Neill (2013): Doing Data Science.
Reread section 4.2 of Whitten, Frank, Hall: Data Mining
Michael Collin’s lecture notes on Naïve Bayes (especially pp. 1-4)

Optional Readings

Paul Graham (2002) “Better Bayesian Filtering”. http://www.paulgraham.com/better.html
Kevin Murphy's example of Bayes' Rule for medical diagnosis: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html
Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion." ACM conference on Computer and communications security

February 11: Trees and Forests

Decision trees
Adaboost: combining decision stumps
Random forests and combining classifiers

Required Readings

Chapter 4 of Schutt & O'Neill (2013): Doing Data Science
Reread section 4.2 of Whitten, Frank, Hall: Data Mining
Michael Collin's lecture notes on Naive Bayes (especially pp. 1-4)

Optional Readings

Paul Graham (2002) "Better Bayesian FilteringPreview the documentView in a new window". http://www.paulgraham.com/better.html
Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion" ACM conference on Computer and communications security.
Kevin Murphy's example of Bayes' Rule for medical diagnosis

February 16: Neural Networks

Perceptrons
Biological origins
Model representation
Cost functions
Backpropagation

Required Readings

Chapter 8 of Daume (in preparation). A course in machine learning

Optional Readings

Chapter 11 (sections 11.1-11.1) of Hastie, Tibshirani, Friedman The Elements of Statistical Learning (10th edition)
Schmidhuber (2014). “Deep Learning in Neural Networks: An Overview”. Technical Report
Egmont-Petersen et al. (2002). “Image processing with neural networks--a review” Pattern recognition.

February 23: Supervised Learning Wrap-Up

Required Readings

Wu et al (2008) “Top 10 Algorithms in Data Mining

Unsupervised Learning

February 25: Dimensionality Reduction

Dimensionality Reduction
Principal Component Analysis
Case study: Eigenfaces
Other methods for dimensionality reduction: SVD, NNMF, LDA

Required Readings

Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
Chapter 11 (sections 11.1 – 11.3) of Rajarman et al: Mining of Massive Datasets

Optional Readings

Watch Pedro Domingos talk about the curse of dimensionality: https://class.coursera.org/machlearning-001/lecture/215 (segment 4 of week 4)
Chapter 14 (sections 14.2, 14.5 - 14.10) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
Turk & Pentland (1991) "Eigenfaces for Recognition

March 1: Cluster Analysis

Introduction to unsupervised learning
Distance metrics
K-Means clustering
Hierarchical clustering

Required Readings

Chapter 13 of Daume (in preparation). A course in machine learning
Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf

Optional Readings

Chapter 13 of Daume (in preparation). A course in machine learning
Chapter 6 of Provost & Fawcett: Data Science for Business

Applications and Implementation

March 3: Recommender Systems

The Netflix challenge
Content-based methods
Learning features and parameters
Nearest-neighbor collaborative filtering

Required Readings

Chapter 8 of Schutt & O’Neill (2013): Doing Data Science

Optional Readings

Chapter 9 of Rajarman et al.: Mining of Massive Datasets
Yehuda Koren (2009) “The BellKor Solution to the Netﬂix Grand Prize”
Resnick et al (1994) “GroupLens: an open architecture for collaborative filtering of netnews”, CSCW ’94, pp. 175-186
RM Bell, Y Koren (2007) “Lessons from the Netflix prize challenge”, ACM SIGKDD Explorations Newsletter

March 8: Special Topics

Scaling
Map-Reduce
The Hadoop ecosystem

Optional Readings

Chapter 2 of Leskovec et al (2014) Mining of Massive Datasets (freely available online)
Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.
Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010

Note: This is archived material from January 2016.

INFX 574: Advanced methods in Data Science

Course Description

Prerequisites

Course Outline:

(Re-)Introduction to Data Science

Causal Inference

Supervised Learning

Unsupervised learning

Applications and Implementation

Grading

Detailed Syllabus

(Re-)Introduction to Data Science

January 5: Python for Data Science

Required Readings:

January 7: Introduction to data science

Required Readings

Optional Readings

Causal Inference

January 12: Econometrics, Part I: Experimental Methods

Required Readings

Optional Readings

January 14: Econometrics Part II: Regression and Impact Evaluation

Required Readings

Optional Readings

January 19: Econometrics III: Heterogeneity and Fixed Effects

Required Readings

Optional Readings

January 21: Econometrics IV: Non-Experimental Methods

Optional Readings

Supervised Learning

January 26: Intro to Machine Learning

Required Readings

January 26 (part 2!): Design of Machine Learning Experiments

Optional Readings

January 28: Nearest Neighbors

Required Readings

Optional Readings

February 2: Gradient Descent

Required Readings

Optional Readings

February 4: Regularization and linear models

Required Readings

Optional Readings

February 9: Naive Bayes

Required Readings

Optional Readings

February 11: Trees and Forests

Required Readings

Optional Readings

February 16: Neural Networks

Required Readings

Optional Readings

February 23: Supervised Learning Wrap-Up

Required Readings

Unsupervised Learning

February 25: Dimensionality Reduction

Required Readings

Optional Readings

March 1: Cluster Analysis

Required Readings

Optional Readings

Applications and Implementation

March 3: Recommender Systems

Required Readings

Optional Readings

March 8: Special Topics

Optional Readings

March 10: Wrap Up