Note: This is archived material from January 2016.

 

INFX 574: Advanced methods in Data Science

Winter 2016
University of Washington School of Information
Lectures: Tuesday and Thursday 3:30-5:20, BLD 070

Course Description | Prerequisites | Course Outline | Grading | Assignments | Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

Provides theoretical and practical introduction to modern techniques for the analysis of large-scale, heterogeneous data. Covers key concepts in inferential statistics, supervised and unsupervised machine learning, and network analysis. Students will learn functional, procedural, and statistical programming techniques for working with real-world data.

Prerequisites

INFX 573 or permission of instructor. Students are expected to be comfortable programming in R, and to have mastery of one higher-level programming language such as Python, Php, Java, C++, etc.

Course Outline:

Grading

Course grades will be based primarily on problem sets, quizzes, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:

  • Problem Sets: 93%
  • Participation, quizzes, and mini-assignments: 3%

Detailed Syllabus

(Re-)Introduction to Data Science

January 5: Python for Data Science

  • Crash course in Python
  • Pandas, Numpy, SciPy
  • IPython and IPython Notebook
Required Readings:
  • Chapters 3-5, and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
  • Install python, IPython, and the numerical analysis libraries on your laptop and bring it to class. See course announcement for details.
  • Read and complete at least the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
  • Watch 10-minute tour of pandas
  • Strongly recommended: Read and complete lessons 1-7 of Learn Pandas

January 7: Introduction to data science

  • What is Data Science?
  • Nuts and bolts of the class: structure, homework, policies, learning objectives
  • Correlation and Causation
  • Counterfactuals and Control Groups
Required Readings
  • Introduction (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
  • Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
  • Pages 1-19 of: E. Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit
Optional Readings

Causal Inference

January 12: Econometrics, Part I: Experimental Methods

  • A-B testing, Business Experiments, Randomized Control Trials
  • Experimental design and statistical power
Required Readings
  • Pages 19-47 of E. Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit"
Optional Readings

January 14: Econometrics Part II: Regression and Impact Evaluation

  • Regression
  • Impact Evaluation
Required Readings
  • Sections 1-3 of Shultz: School subsidies for the poor
  • David Albouy: Lecture notes on Differences in Differences Estimation
Optional Readings
  • Chapters 2, 3, and 5 of Khandker (2010) Handbook on Impact Evaluation

January 19: Econometrics III: Heterogeneity and Fixed Effects

  • Interactions
  • Difference in difference
  • Fixed and Random effects models
Required Readings
  • Lecture notes on Fixed Effects models
Optional Readings
  • Chapters 6 and 7 of Khandker (2010) Handbook on Impact Evaluation

January 21: Econometrics IV: Non-Experimental Methods

  • Instrumental Variables
  • Regression discontinuity
Optional Readings
  • Chapters 6 and 7 of Khandker (2010) Handbook on Impact Evaluation
  • Varian, Hal R. 2014. "Big Data - New Tricks for Econometrics" Journal of Economic Perspectives, 28(2): 3-28.
  • Buddlemeyer & Skoufias (2004). An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA.
  • Duflo (2001). Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment

Supervised Learning

January 26: Intro to Machine Learning

  • Training and test data
  • Introduction to Machine Learning
  • Supervised vs. Unsupervised Learning
  • Key Issues in (Supervised) Machine Learning
  • Philosophical Interlude
Required Readings
  • P. Domingos, “A Few Useful Things to Know about Machine Learning.” Communications of the ACM, 55 (10), 78-87, 2012.

January 26 (part 2!): Design of Machine Learning Experiments

  • Training and test data
  • Cross-validation and bootstrapping
  • Evaluation and baselines
  • Generalization and overfitting
  • Features and feature selection
Optional Readings
  • Chapter 1 of Daume (in preparation). A course in machine learning
  • Chapter 5 of Whitten, Frank, Hall: Data Mining

January 28: Nearest Neighbors

  • Instance-based learning
  • Nearest neighbors
  • Curse of dimensionality
  • Locally-weighted regression
Required Readings
  • Chapter 4 (especially section 4.7) of Whitten, Frank, Hall: Data Mining
  • Chapter 2 of Daume (in preparation). A course in machine learning
Optional Readings
  • Chapter 13 (sections 13.1 - 13.3) of Hastie, Tibshirani, Friedman, View in a new windowThe Elements of Statistical Learning (10th edition)
  • Chapter 6 (sections 6.1 - 6.6) on Kernel Estimation in Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
  • Chapter 6 of Provost & Fawcett: Data Science for Business

February 2: Gradient Descent

  • Cost functions
  • Gradient descent
  • Convexity
Required Readings
  • Chapter 6 of Daume (in preparation). A course in machine learning
  • Reread section 4.6 of Whitten, Frank, Hall: Data Mining
Optional Readings
  • Chapter 5 of Schutt & O’Neill (2013): Doing Data Science
  • Chapter 3 (sections 3.1 and 3.1), Chapter 4 (especially section 4.4), and Chapter 6 (sections 6.1-6.3) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
  • Zumel and Mount, Chaper 7.

February 4: Regularization and linear models

  • Regularization
  • Ridge and LASSO
  • Logistic regression
  • Support vector machines
Required Readings
  • Chapter 9 of Daume (in preparation). A course in machine learning
Optional Readings
  • Chapter 3 (especial 3.4) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)

February 9: Naive Bayes

  • Probability review: Bayes rule, independence, distributions
  • Generative models and Naive Bayes
  • Maximum likelihood estimation and smoothing
Required Readings
  • Chapter 4 of Schutt & O’Neill (2013): Doing Data Science.
  • Reread section 4.2 of Whitten, Frank, Hall: Data Mining
  • Michael Collin’s lecture notes on Naïve Bayes (especially pp. 1-4)
Optional Readings
  • Paul Graham (2002) “Better Bayesian Filtering”. http://www.paulgraham.com/better.html
  • Kevin Murphy's example of Bayes' Rule for medical diagnosis: http://www.cs.ubc.ca/~murphyk/Bayes/bayesrule.html
  • Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion." ACM conference on Computer and communications security

February 11: Trees and Forests

  • Decision trees
  • Adaboost: combining decision stumps
  • Random forests and combining classifiers
Required Readings
  • Chapter 4 of Schutt & O'Neill (2013): Doing Data Science
  • Reread section 4.2 of Whitten, Frank, Hall: Data Mining
  • Michael Collin's lecture notes on Naive Bayes (especially pp. 1-4)
Optional Readings

February 16: Neural Networks

  • Perceptrons
  • Biological origins
  • Model representation
  • Cost functions
  • Backpropagation
Required Readings
  • Chapter 8 of Daume (in preparation). A course in machine learning
Optional Readings
  • Chapter 11 (sections 11.1-11.1) of Hastie, Tibshirani, Friedman The Elements of Statistical Learning (10th edition)
  • Schmidhuber (2014). “Deep Learning in Neural Networks: An Overview”. Technical Report
  • Egmont-Petersen et al. (2002). “Image processing with neural networks--a review” Pattern recognition.

February 23: Supervised Learning Wrap-Up

Required Readings
  • Wu et al (2008) “Top 10 Algorithms in Data Mining

Unsupervised Learning

February 25: Dimensionality Reduction

  • Dimensionality Reduction
  • Principal Component Analysis
  • Case study: Eigenfaces
  • Other methods for dimensionality reduction: SVD, NNMF, LDA
Required Readings
  • Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
  • Chapter 11 (sections 11.1 – 11.3) of Rajarman et al: Mining of Massive Datasets
Optional Readings
  • Watch Pedro Domingos talk about the curse of dimensionality: https://class.coursera.org/machlearning-001/lecture/215 (segment 4 of week 4)
  • Chapter 14 (sections 14.2, 14.5 - 14.10) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
  • Turk & Pentland (1991) "Eigenfaces for Recognition

March 1: Cluster Analysis

  • Introduction to unsupervised learning
  • Distance metrics
  • K-Means clustering
  • Hierarchical clustering
Required Readings
  • Chapter 13 of Daume (in preparation). A course in machine learning
  • Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf
Optional Readings
  • Chapter 13 of Daume (in preparation). A course in machine learning
  • Chapter 6 of Provost & Fawcett: Data Science for Business

Applications and Implementation

March 3: Recommender Systems

  • The Netflix challenge
  • Content-based methods
  • Learning features and parameters
  • Nearest-neighbor collaborative filtering
Required Readings
  • Chapter 8 of Schutt & O’Neill (2013): Doing Data Science
Optional Readings
  • Chapter 9 of Rajarman et al.: Mining of Massive Datasets
  • Yehuda Koren (2009) “The BellKor Solution to the Netflix Grand Prize”
  • Resnick et al (1994) “GroupLens: an open architecture for collaborative filtering of netnews”, CSCW ’94, pp. 175-186
  • RM Bell, Y Koren (2007) “Lessons from the Netflix prize challenge”, ACM SIGKDD Explorations Newsletter

March 8: Special Topics

  • Scaling
  • Map-Reduce
  • The Hadoop ecosystem
Optional Readings
  • Chapter 2 of Leskovec et al (2014) Mining of Massive Datasets (freely available online)
  • Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.
  • Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010

March 10: Wrap Up

Home - Contact