Note: This is archived material from January 2014.

INFX 598 B: Core Methods in Data Science

Winter 2014
University of Washington School of Information
Lectures: Monday and Wednesday 1:30-3:20pm, MGH 251

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

Provides skills required to analyze and derive insight from large-scale, heterogeneous data. Covers key concepts of functional and imperative programming for storing, extracting, analyzing and presenting large data projects; and data analysis skills using inferential statistics, supervised and unsupervised machine learning. Students gain experience with modeling social and behavioral data.

Prerequisites

INFX 573 or permission of instructor. Students are expected to be comfortable programming in R, and to have mastery of one higher-level programming language such as Python, Php, Java, C++, etc.

Course Outline:

(Re-)Introduction to Data Science

January 6: Introduction to the course

Probability

January 8: Introduction to Probability

January 13: Conditional Probability

January 15: Programming with Data

January 22: Random Variable and Distributions

Inferential Statistics

January 27: Learning from Data

January 29: Estimation

February 3: Testing Hypotheses and Linear Models

Supervised and Unsupervised learning

February 5: Design of Machine Learning Experiments

February 10: Linear Models Revisited

February 12: Nearest Neighbors and Perceptrons

February 19: Generative models and Naive Bayes

February 24: Trees and Forests

February 26: Neural Networks

Applications and Implementation

March 3: Feature Selection and Dimensionality Reduction

March 5: Recommender Systems

March 12: Special Topics

Grading

Course grades will be based primarily on a final group project, problem sets, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:

Problem Sets: 80%

Problem sets 1-3: 10% each
Problem sets 4-5: 25% each

Participation, quizzes, and mini-assignments: 20%
Extra credit: 3%

Grading Policy

All assignments are to be submitted on Canvas by 11:59pm on the due date.
Assignments turned in *up to* 24 hours after the due date will be penalized 20%.
Any assignments turned in more than 24 hours late will receive no credit.

Academic Integrity Policy

Discussion with instructors and classmates is encouraged, but each student must turn in individual, original work and cite appropriate sources where appropriate.Any assignment or exam that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments

Readings and Participation

Students are also encouraged to participate in the class by engaging with the readings and the course content outside of class. There are myriad ways to engage: by posting summaries of optional readings, writing blog posts on a related topic (and posting the link on Canvas), or discussing how a news article or radio show relates to the concepts raised in class. This engagement can be done in any forum, but to ensure that the other students (and the instructor) are aware of your participation, be sure to add a link to what you've done on Canvas.

Detailed Syllabus

January 6: Introduction to the course

Readings:

Chapter 1, Provost and Fawcett (2013) Data Science for Business.

Optional Readings:

Press (2013) "A Very Short History of Data Science." Forbes
Patil. (2011) "Building Data Science Teams" O'Reilly Radar.

Probability

January 8: Introduction to Probability

Readings

Chapter 2, Provost and Fawcett (2013) Data Science for Business.
Chapter 2, Ross. (2012) A First Course in Probability.

Optional Readings

Chapter 1, DeGroot and Schevish (2002) Probability and Statistics.
Chapter 1, Wasserman. (2004) All of Statistics.
Watch The Joy of Statistics

January 13: Conditional Probability

Readings

Chapter 3, Ross. (xxxx) A First Course in Probability

Optional Readings

Chapter 2, DeGroot and Schevish (2002) Probability and Statistics
Quick Intro to Counting Methods: Permutations and Combinations

January 15: Programming with Data

Readings

Install IPython on your laptop and bring it to class: http://ipython.org/install.html (I recommended the Anaconda version, but if you want to assemble the packages yourself, make sure you have python, ipython notebook, numpi, scipy, and matplotlib)
Read and complete the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
Chapters 4 and 5 of McKinney (2012): Python for Data Analysis.
Watch 10-minute tour of pandas

Due

Problem Set 1

January 20: NO CLASS - HOLIDAY - MLK Day

January 22: Random Variable and Distributions

Readings

Chapter 3, Pages 93-151 DeGroot and Schervish (2002) Probability and Statistics
Chapter 4, DeGroot and Schervish (2002) Probability and Statistics

Optional Readings

Chapter 2, Wasserman (2004) All of Statistics
Chapter 4, Ross xxxx A First Course in Probability

Inferential Statistics

January 27: Learning from Data

Readings

Section on Normal Distribution, DeGroot and Schervish (2002) Probability and Statistics
Chapter 6, DeGroot and Schervish (2002) Probability and Statistics

January 29: Estimation

Readings

Chapter 7, DeGroot and Schervish (2002) Probability and Statistics
http://www.ics.uci.edu/~smyth/courses/cs274/papers/MLtutorial.pdf

Due

Problem Set 2

February 3: Testing Hypotheses and Linear Models

Readings

Chapter 9, Section 9.1-9.6 DeGroot and Schervish (2002) Probability and Statistics
Chapter 6 of: A Handbook of Statistical Analyses Using R (If you haven't already read it!)

Machine Learning

February 5: Design of Machine Learning Experiments

Readings

Chapter 5 of Whitten, Frank, Hall: Data MiningPreview the document
Chapters 3 and 4 of Provost & Fawcett: Data Science for Business (if you haven't already read them)
Chapters 4 and 5 of McKinney (2012): Python for Data Analysis [Important if you're new to python!]

Optional Readings

Chapter 2 of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
P. Domingos, A Few Useful Things to Know about Machine Learning Communications of the ACM, 55 (10), 78-87, 2012.

February 10: Linear Models Revisited

Readings

Chapter 5 of Schutt & O'Neill (2013): Doing Data Science
Reread section 4.6 of Whitten, Frank, Hall: Data MiningPreview the documentView in a new window

Optional Readings

Chapter 3 (sections 3.1 and 3.2), Chapter 4 (especially section 4.4), and Chapter 6 (sections 6.1-6.3) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)

February 12: Nearest Neighbors and Perceptrons

Required Readings

Chapter 4 (especially section 4.7) of Whitten, Frank, Hall: Data MiningPreview the document
Chapter 6 of Provost & Fawcett: Data Science for Business

Optional Readings

Chapter 3 (sections 13.1-13.3) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
Chapter 6 (sections 6.1-6.6) on Kernel Regression in Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edtion)
Chapter 3 in a new window of Schutt & O'Neill (2013): Doing Data Science
Chapter 3 in a new window of Daume (in preparation). A course in machine learning
Watch Pedro Domingos talk about the Curse of Dimensionality

Due

Problem Set 3

February 17: HOLIDAY

Due

Problem Set 4 (Optional Section)

February 19: Generative models and Naive Bayes. Guest Speaker: Johan Ugander

Required Readings

Chapter 4 of Schutt & O'Neill (2013): Doing Data Science
Reread section 4.2 of Whitten, Frank, Hall: Data Mining
Michael Collin's lecture notes on Naive Bayes (especially pp. 1-4)

Optional Readings

Paul Graham (2002) "Better Bayesian FilteringPreview the documentView in a new window". http://www.paulgraham.com/better.html
Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion" ACM conference on Computer and communications security.
Kevin Murphy's example of Bayes' Rule for medical diagnosis

February 24: Trees and Forests. Guest Speaker: Brendan O'Connor

Required Readings

Chapter 6 (section 6.1) of Whitten, Frank, Hall: Data MiningPreview the document
Chapter 8 of Whitten, Frank, Hall: Data Mining

Optional Readings

Chapter 9 (section 9.2) and Chapter 15 of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)

February 26: Neural Networks

Required Readings

Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
Chapter 8 of Daume (in preparation). A course in machine learning

Optional Readings

Egmont-Petersen et al. (2002). "Image processing with neural networks: A review" Pattern recognition.

Applications and Implementation

March 3: Feature Selection and Dimensionality Reduction

Readings

Chapter 11 (sections 11.1 - 11.3) of Rajarman et al: Mining of Massive Datasets
Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining

Optional Readings

Chapter 14 (sections 14.2, 14.5 - 14.10) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning(10th edition)
Turk & Pentland (1991) "Eigenfaces for Recognition

March 5: Recommender Systems

Readings

Chapter 8 of Schutt & O'Neill (2013): Doing Data Science

Optional Readings

Chapter 9 of Rajarman et al: Mining of Massive Datasets
Yehuda Koren (2009) "The BellKor Solution to the Net?ix Grand Prize
Resnick et al (1994) "GroupLens: an open architecture for collaborative filtering of netnews, CSCW '94, pp. 175-186
RM Bell, Y Koren (2007) "Lessons from the Netflix prize challenge", ACM SIGKDD Explorations Newsletter

Note: This is archived material from January 2014.

INFX 598 B: Core Methods in Data Science

Course Description

Prerequisites

Course Outline:

(Re-)Introduction to Data Science

Probability

Inferential Statistics

Supervised and Unsupervised learning

Applications and Implementation

Grading

Grading Policy

Academic Integrity Policy

Readings and Participation

Detailed Syllabus

January 6: Introduction to the course

Readings:

Optional Readings:

Probability

January 8: Introduction to Probability

Readings

Optional Readings

January 13: Conditional Probability

Readings

Optional Readings

January 15: Programming with Data

Readings

Due

January 20: NO CLASS - HOLIDAY - MLK Day

January 22: Random Variable and Distributions

Readings

Optional Readings

Inferential Statistics

January 27: Learning from Data

Readings

January 29: Estimation

Readings

Due

February 3: Testing Hypotheses and Linear Models

Readings

Machine Learning

February 5: Design of Machine Learning Experiments

Readings

Optional Readings

February 10: Linear Models Revisited

Readings

Optional Readings

February 12: Nearest Neighbors and Perceptrons

Required Readings

Optional Readings

Due

February 17: HOLIDAY

Due

February 19: Generative models and Naive Bayes. Guest Speaker: Johan Ugander

Required Readings

Optional Readings

February 24: Trees and Forests. Guest Speaker: Brendan O'Connor

Required Readings

Optional Readings

February 26: Neural Networks

Required Readings

Optional Readings

Applications and Implementation

March 3: Feature Selection and Dimensionality Reduction

Readings

Optional Readings

March 5: Recommender Systems

Readings

Optional Readings

March 10: Recommender Systems, part 2. Guest Speaker Chris DuBois

Readings

March 12: Special Topics

Due