Note: This is archived material from January 2016.
INFX 574: Advanced methods in Data Science
Winter 2016
University of Washington School of Information
Lectures: Tuesday and Thursday 3:30-5:20, BLD 070
Course Description | Prerequisites | Course Outline | Grading | Assignments | Calendar
Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall
Provides theoretical and practical introduction to modern techniques for the analysis of large-scale, heterogeneous data. Covers key concepts in inferential statistics, supervised and unsupervised machine learning, and network analysis. Students will learn functional, procedural, and statistical programming techniques for working with real-world data.
INFX 573 or permission of instructor. Students are expected to be comfortable programming in R, and to have mastery of one higher-level programming language such as Python, Php, Java, C++, etc.
Course Outline:
(Re-)Introduction to Data Science
January 5: Python for Data Science
January 7: Introduction to data science
Causal Inference
January 12: Econometrics, Part I: Experimental Methods
January 14: Econometrics, Part II: Regression and Impact Evaluation
January 19: Econometrics, Part III: Heterogeneity and Fixed Effects
January 21: Econometrics, Part IV: Non-Experimenal Methods
Supervised Learning
January 26: Intro to Machine Learning
January 26 (part 2!): Design of Machine Learning Experiments
January 28: Nearest Neighbors
February 2: Gradient Descent
February 4: Regularization and linear models
February 9: Naive Bayes
February 11: Trees and Forests
February 16: Neural Networks
February 23: Supervised Learning Wrap-Up
Applications and Implementation
March 3: Recommender Systems
March 8: Special Topics
March 10: Wrap-up
Course grades will be based primarily on problem sets, quizzes, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:
- Problem Sets: 93%
- Participation, quizzes, and mini-assignments: 3%
Detailed Syllabus
(Re-)Introduction to Data Science
January 5: Python for Data Science
- Crash course in Python
- Pandas, Numpy, SciPy
- IPython and IPython Notebook
Required Readings:
- Chapters 3-5, and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
- Install python, IPython, and the numerical analysis libraries on your laptop and bring it to class. See course announcement for details.
- Read and complete at least the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
- Watch 10-minute tour of pandas
- Strongly recommended: Read and complete lessons 1-7 of Learn Pandas
January 7: Introduction to data science
- What is Data Science?
- Nuts and bolts of the class: structure, homework, policies, learning objectives
- Correlation and Causation
- Counterfactuals and Control Groups
Required Readings
- Introduction (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
- Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
- Pages 1-19 of: E. Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit
Optional Readings
- Andrew Gelman: “There are four ways to get fired from Ceasars"
Causal Inference
January 12: Econometrics, Part I: Experimental Methods
- A-B testing, Business Experiments, Randomized Control Trials
- Experimental design and statistical power
Required Readings
- Pages 19-47 of E. Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit"
Optional Readings
- List (2011). "Why Economists Should Conduct Field Experiments and 14 Tips for Pulling One Off." Journal of Economic Perspectives.
- Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
- Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18 (2009), 140–181.
- Smith & Pell (2013). Parachute use to prevent death and major trauma related to gravitational challenge: Systematic review of randomised controlled trials.
January 14: Econometrics Part II: Regression and Impact Evaluation
- Regression
- Impact Evaluation
Required Readings
- Sections 1-3 of Shultz: School subsidies for the poor
- David Albouy: Lecture notes on Differences in Differences Estimation
Optional Readings
- Chapters 2, 3, and 5 of Khandker (2010) Handbook on Impact Evaluation
January 19: Econometrics III: Heterogeneity and Fixed Effects
- Interactions
- Difference in difference
- Fixed and Random effects models
Required Readings
- Lecture notes on Fixed Effects models
Optional Readings
- Chapters 6 and 7 of Khandker (2010) Handbook on Impact Evaluation
January 21: Econometrics IV: Non-Experimental Methods
- Instrumental Variables
- Regression discontinuity
Optional Readings
- Chapters 6 and 7 of Khandker (2010) Handbook on Impact Evaluation
- Varian, Hal R. 2014. "Big Data - New Tricks for Econometrics" Journal of Economic Perspectives, 28(2): 3-28.
- Buddlemeyer & Skoufias (2004). An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA.
- Duflo (2001). Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment
Supervised Learning
January 26: Intro to Machine Learning
- Training and test data
- Introduction to Machine Learning
- Supervised vs. Unsupervised Learning
- Key Issues in (Supervised) Machine Learning
- Philosophical Interlude
Required Readings
- P. Domingos, “A Few Useful Things to Know about Machine Learning.” Communications of the ACM, 55 (10), 78-87, 2012.
January 26 (part 2!): Design of Machine Learning Experiments
- Training and test data
- Cross-validation and bootstrapping
- Evaluation and baselines
- Generalization and overfitting
- Features and feature selection
Optional Readings
- Chapter 1 of Daume (in preparation). A course in machine learning
- Chapter 5 of Whitten, Frank, Hall: Data Mining
January 28: Nearest Neighbors
- Instance-based learning
- Nearest neighbors
- Curse of dimensionality
- Locally-weighted regression
Required Readings
- Chapter 4 (especially section 4.7) of Whitten, Frank, Hall: Data Mining
- Chapter 2 of Daume (in preparation). A course in machine learning
Optional Readings
- Chapter 13 (sections 13.1 - 13.3) of Hastie, Tibshirani, Friedman, View in a new windowThe Elements of Statistical Learning (10th edition)
- Chapter 6 (sections 6.1 - 6.6) on Kernel Estimation in Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Chapter 6 of Provost & Fawcett: Data Science for Business
February 2: Gradient Descent
- Cost functions
- Gradient descent
- Convexity
Required Readings
- Chapter 6 of Daume (in preparation). A course in machine learning
- Reread section 4.6 of Whitten, Frank, Hall: Data Mining
Optional Readings
- Chapter 5 of Schutt & O’Neill (2013): Doing Data Science
- Chapter 3 (sections 3.1 and 3.1), Chapter 4 (especially section 4.4), and Chapter 6 (sections 6.1-6.3) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Zumel and Mount, Chaper 7.
February 4: Regularization and linear models
- Regularization
- Ridge and LASSO
- Logistic regression
- Support vector machines
Required Readings
- Chapter 9 of Daume (in preparation). A course in machine learning
Optional Readings
- Chapter 3 (especial 3.4) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
February 9: Naive Bayes
- Probability review: Bayes rule, independence, distributions
- Generative models and Naive Bayes
- Maximum likelihood estimation and smoothing
Required Readings
- Chapter 4 of Schutt & O’Neill (2013): Doing Data Science.
- Reread section 4.2 of Whitten, Frank, Hall: Data Mining
- Michael Collin’s lecture notes on Naïve Bayes (especially pp. 1-4)
Optional Readings
- Paul Graham (2002) “Better Bayesian Filtering”. http://www.paulgraham.com/better.html
- Kevin Murphy's example of Bayes' Rule for medical diagnosis: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html
- Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion." ACM conference on Computer and communications security
February 11: Trees and Forests
- Decision trees
- Adaboost: combining decision stumps
- Random forests and combining classifiers
Required Readings
- Chapter 4 of Schutt & O'Neill (2013): Doing Data Science
- Reread section 4.2 of Whitten, Frank, Hall: Data Mining
- Michael Collin's lecture notes on Naive Bayes (especially pp. 1-4)
Optional Readings
- Paul Graham (2002) "Better Bayesian FilteringPreview the documentView in a new window". http://www.paulgraham.com/better.html
- Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion" ACM conference on Computer and communications security.
- Kevin Murphy's example of Bayes' Rule for medical diagnosis
February 16: Neural Networks
- Perceptrons
- Biological origins
- Model representation
- Cost functions
- Backpropagation
Required Readings
- Chapter 8 of Daume (in preparation). A course in machine learning
Optional Readings
- Chapter 11 (sections 11.1-11.1) of Hastie, Tibshirani, Friedman The Elements of Statistical Learning (10th edition)
- Schmidhuber (2014). “Deep Learning in Neural Networks: An Overview”. Technical Report
- Egmont-Petersen et al. (2002). “Image processing with neural networks--a review” Pattern recognition.
February 23: Supervised Learning Wrap-Up
Required Readings
- Wu et al (2008) “Top 10 Algorithms in Data Mining
Unsupervised Learning
February 25: Dimensionality Reduction
- Dimensionality Reduction
- Principal Component Analysis
- Case study: Eigenfaces
- Other methods for dimensionality reduction: SVD, NNMF, LDA
Required Readings
- Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
- Chapter 11 (sections 11.1 – 11.3) of Rajarman et al: Mining of Massive Datasets
Optional Readings
- Watch Pedro Domingos talk about the curse of dimensionality: https://class.coursera.org/machlearning-001/lecture/215 (segment 4 of week 4)
- Chapter 14 (sections 14.2, 14.5 - 14.10) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
- Turk & Pentland (1991) "Eigenfaces for Recognition
March 1: Cluster Analysis
- Introduction to unsupervised learning
- Distance metrics
- K-Means clustering
- Hierarchical clustering
Required Readings
- Chapter 13 of Daume (in preparation). A course in machine learning
- Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf
Optional Readings
- Chapter 13 of Daume (in preparation). A course in machine learning
- Chapter 6 of Provost & Fawcett: Data Science for Business
Applications and Implementation
March 3: Recommender Systems
- The Netflix challenge
- Content-based methods
- Learning features and parameters
- Nearest-neighbor collaborative filtering
Required Readings
- Chapter 8 of Schutt & O’Neill (2013): Doing Data Science
Optional Readings
- Chapter 9 of Rajarman et al.: Mining of Massive Datasets
- Yehuda Koren (2009) “The BellKor Solution to the Netflix Grand Prize”
- Resnick et al (1994) “GroupLens: an open architecture for collaborative filtering of netnews”, CSCW ’94, pp. 175-186
- RM Bell, Y Koren (2007) “Lessons from the Netflix prize challenge”, ACM SIGKDD Explorations Newsletter
March 8: Special Topics
- Scaling
- Map-Reduce
- The Hadoop ecosystem
Optional Readings
- Chapter 2 of Leskovec et al (2014) Mining of Massive Datasets (freely available online)
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.
- Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010