Note: This is archived material from June 2016.

 

INFO 371: Core Methods in Data Science

Spring 2016
University of Washington School of Information
Lectures: Monday and Wednesday 3:30-5:20, SAV 156
Labs: Monday 6:30-7:20, MGH 430

Course Description | Prerequisites | Course Outline |Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

Introduces students to modern methods in applied data science. Emphasis is given to practical applications and analysis of real-world data, through a survey of common techniques in supervised and unsupervised machine learning, and methods for experimental design and causal inference. Students will learn functional, procedural, and statistical programming techniques for working with data.

Prerequisites

INFX 370 or permission of instructor.

Grading

Course grades will be based primarily on problem sets, quizzes, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows::

  • Problem Sets: 70%
  • Exams, quizzes, and mini-assignments: 15%
  • Lab and classroom participation: 15%

Detailed Syllabus

(Re-) Introduction to Data Science

March 28: Introduction to the course

  • Introductions
  • Nuts and bolts of the class: structure, homework, policies, learning objectives
March 28 (Lab): Crash course in Python for data science (part 1)

Causal Inference

March 30: Econometrics I: Experimental Methods

  • Correlation and Causation
  • Counterfactuals and Control Groups
  • A-B testing, Business Experiments, Randomized Control Trials
  • Experimental design and statistical power
Required Readings
  • INTRODUCTION (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
  • Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
Optional Readings
  • Pages 1-47 of: Duflo, M. Kremer and R. Glennerster (2006). "Using Randomization in Development Economics Research: A Toolkit"
  • Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
  • Kohavi, R., Longbotham, R., Sommerfield, D. and Henne, R. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge Discovery 18 (2009), 140–181.

April 4: Crash course in Python for data science (part 2)

  • Programming paradigms
  • Working with data
  • Crash course in python
Required Readings
  • Chapters 3-5 and Chapter 9 of McKinney (2013): Python for Data Analysis. O’Reilly Media, Inc.
  • Strongly recommended: Read and complete lessons 1-7 of Learn Pandas (https://bitbucket.org/hrojas/learn-pandas)
April 4 (Lab): NO LAB TODAY

April 6: Econometrics II: Regression and Impact Evaluation

  • Impact Evaluation
  • Regression
Required Readings
  • Sections 1-3 of Shultz: School subsidies for the poor
  • Chapters 2-3 of Khandker er al. (2010), “Handbook on Impact Evaluation"
Optional Readings
  • David Albouy: Lecture notes on Differences in Differences Estimation

April 11: Econometrics III: Heterogeneity and Fixed Effects

  • Interactions
  • Difference in difference
  • Fixed and Random effects models
Required Readings
  • Lecture notes on “Fixed Effects Models"
  • Chapter 5 of Khandker et al. (2010), “Handbook on Impact Evaluation"
April 11 (Lab): Simple statistical tests
  • Perform t-tests
  • Run a basic regression
  • Regression with dummy variables

April 13: Econometrics IV: Non-experimental methods

  • Instrumental Variables
  • Regression discontinuity
Required Readings
  • Chapters 6 and 7 of Khandker et al. (2010), “Handbook on Impact Evaluation"
Optional Readings
  • Chapter 10 of Stock & Watson (2010) on “Instrumental Variables”
  • Buddlemeyer & Skoufias (2004). An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA.
  • Duflo (2001). Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment

Supervised Learning

April 18: Design of Machine Learning experiments

  • Supervised and unsupervised learning
  • Training and test data
  • Cross-validation and bootstrapping
  • Evaluation and baselines
  • Generalization and overfitting
  • Features and feature selection
Required Readings
  • Chapter 1 of Daume (in preparation). A course in machine learning
  • Chapter 5 of Whitten, Frank, Hall: Data Mining
Optional Readings
  • Syed, A. (2011). A review of cross validation and adaptive model selection.
April 18(Lab): Regression and prediction
  • Generate random numbers
  • Create training and test data
  • Fit a regression on training data, evaluate performance on test data
  • Compare different measures of performance

April 20: Nearest neighbors

  • Instance-based learning
  • Nearest neighbors
  • Curse of dimensionality
Required Readings
  • Chapter 2 of Daume (2015) A course in machine learning
Optional Readings
  • Chapter 13 (sections 13.1 - 13.3) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
  • Chapter 6 of Provost & Fawcett: Data Science for Business

April 25: Gradient Descent

  • Cost functions
  • Gradient descent
  • Convexity
Required Readings
  • Reread section 4.6 of Whitten, Frank, Hall: Data Mining
  • Chapter 6 of Daume (in preparation). A course in machine learning
Optional Readings
  • Chapter 5 of Schutt & O’Neill (2013): Doing Data Science
  • Zumel and Mount, Chaper 7
April 25 (Lab): Logistic regression
  • Compare regression to LASSO
  • Explore issues of overfitting

April 27 & May 2: Regularization and linear models

  • Regularization
  • Ridge and Lasso
  • Logistic regression
  • Support vector machines
  • Kernel methods
Required Readings
  • Chapter 6 of Daume (in preparation). A course in machine learning
  • Chapter 9 of Daume (in preparation). A course in machine learning
Optional Readings
  • Chapter 3 (sections 3.3 and 3.4), of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
  • Zumel and Mount, Chaper 7
May 2 (Lab): Regularization
  • TBD

May 4: Naïve Bayes

  • Probability review: Bayes rule, independence, distributions
  • Generative models and Naive Bayes
  • Maximum likelihood estimation and smoothing
Required Readings
  • Chapter 4 of Schutt & O’Neill (2013): Doing Data Science
  • Reread section 4.2 of Whitten, Frank, Hall: Data Mining
  • Michael Collin’s lecture notes on Naïve Bayes(especially pp. 1-4)
Optional Readings

May 9: Trees and Forests (Jevin West)

  • Decision trees
  • Adaboost: combining decision stumps
  • Random forests and combining classifiers
Required Readings
  • Chapter 4 of Whitten, Frank, Hall: Data Mining
Optional Readings
  • Chapter 9 (section 9.2) and Chapter 15 of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
May 9 (Lab): TBD

May 11: Neural Networks

  • Perceptrons
  • Biological origins
  • Model representation
  • Cost functions
  • Backpropagation
Required Readings
  • Chapter 8 of Daume (in preparation). A course in machine learning
Optional Readings
  • Schmidhuber (2014). “Deep Learning in Neural Networks: An Overview”. Technical Report
  • Edwards (2015). “Growing pains for deep learning.” Communications of the ACM

May 16: Supervised Learning Wrap-up

Required Readings
  • Wu et al (2008) “Top 10 Algorithms in Data Mining
  • Domingos, “A Few Useful Things to Know about Machine Learning” Communications of the ACM, 55 (10), 78-87, 2012.
May 16 (Lab): Neural networks in python
  • Perceptrons
  • Neural networks in sklearn

Unsupervised Learning

May 18: Dimensionality Reduction

  • Dimensionality Reduction
  • Principal Component Analysis
  • Case study: Eigenfaces
  • Other methods for dimensionality reduction: SVD, NNMF, LDA
Required Readings
  • Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
  • Chapter 11 (sections 11.1 – 11.3) of Rajarman et al: Mining of Massive Datasets
Optional Readings
  • Watch Pedro Domingos talk about the curse of dimensionality: https://class.coursera.org/machlearning-001/lecture/215 (segment 4 of week 4)
  • Chapter 14 (sections 14.2, 14.5 - 14.10) of Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (10th edition)
  • Turk & Pentland (1991) “Eigenfaces for Recognition

May 23: Cluster Analysis

  • Introduction to unsupervised learning
  • Distance metrics
  • K-Means clustering
  • Hierarchical clustering
Required Readings
Optional Readings
  • Chapter 6 of Provost & Fawcett: Data Science for Business
May 23 (Lab): Unsupervised Learning
  • Clustering
  • PCA

Applications and Implementation

May 25: Recommender Systems

  • The Netflix challenge
  • Content-based methods
  • Learning features and parameters
  • Nearest-neighbor collaborative filtering
Required Readings
  • Chapter 8 of Schutt & O’Neill (2013): Doing Data Science
Optional Readings
  • Chapter 9 of Leskovec et al: Mining of Massive Datasets (freely available online)
  • Yehuda Koren (2009) “The BellKor Solution to the Netflix Grand Prize"
  • Resnick et al (1994) “GroupLens: an open architecture for collaborative filtering of netnews”, CSCW ’94, pp. 175-186
  • RM Bell, Y Koren (2007) “Lessons from the Netflix prize challenge”, ACM SIGKDD Explorations Newsletter

May 30: Scaling and Map-Reduce

  • Scaling
  • Map-Reduce
  • The Hadoop ecosystem
Optional Readings
  • Chapter 2 of Leskovec et al (2014) Mining of Massive Datasets (freely available online)
  • Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.
  • Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010
May 30 (Lab): Map-Reduce in Python

June 1: Last Day of Class: Jeopardy + mini-Final

  • Review all lecture notes and readings for the quarter
Home - Contact