## Note: This is archived material from January 2014.

# INFX 598 B: Core Methods in Data Science

**Winter 2014**

**University of Washington School of Information**

**Lectures: Monday and Wednesday 1:30-3:20pm, MGH 251**

Course Description | Prerequisites | Course Outline | Grading | Assignments | Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Provides skills required to analyze and derive insight from large-scale, heterogeneous data. Covers key concepts of functional and imperative programming for storing, extracting, analyzing and presenting large data projects; and data analysis skills using inferential statistics, supervised and unsupervised machine learning. Students gain experience with modeling social and behavioral data.

INFX 573 or permission of instructor. Students are expected to be comfortable programming in R, and to have mastery of one higher-level programming language such as Python, Php, Java, C++, etc.

# Course Outline:

## (Re-)Introduction to Data Science

#### January 6: Introduction to the course

## Probability

#### January 8: Introduction to Probability

#### January 13: Conditional Probability

#### January 15: Programming with Data

#### January 22: Random Variable and Distributions

## Inferential Statistics

#### January 27: Learning from Data

#### January 29: Estimation

#### February 3: Testing Hypotheses and Linear Models

## Supervised and Unsupervised learning

#### February 5: Design of Machine Learning Experiments

#### February 10: Linear Models Revisited

#### February 12: Nearest Neighbors and Perceptrons

#### February 19: Generative models and Naive Bayes

#### February 24: Trees and Forests

#### February 26: Neural Networks

## Applications and Implementation

#### March 3: Feature Selection and Dimensionality Reduction

#### March 5: Recommender Systems

#### March 12: Special Topics

Course grades will be based primarily on a final group project, problem sets, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:

- Problem Sets: 80%
- Problem sets 1-3: 10% each
- Problem sets 4-5: 25% each
- Participation, quizzes, and mini-assignments: 20%
- Extra credit: 3%

### Grading Policy

- All assignments are to be submitted on Canvas by 11:59pm on the due date.
- Assignments turned in *up to* 24 hours after the due date will be penalized 20%.
- Any assignments turned in more than 24 hours late will receive no credit.

### Academic Integrity Policy

Discussion with instructors and classmates is encouraged, but each student must turn in individual, original work and cite appropriate sources where appropriate.Any assignment or exam that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments

## Readings and Participation

Students are also encouraged to participate in the class by engaging with the readings and the course content outside of class. There are myriad ways to engage: by posting summaries of optional readings, writing blog posts on a related topic (and posting the link on Canvas), or discussing how a news article or radio show relates to the concepts raised in class. This engagement can be done in any forum, but to ensure that the other students (and the instructor) are aware of your participation, be sure to add a link to what you've done on Canvas.

# Detailed Syllabus

### January 6: Introduction to the course

##### Readings:

- Chapter 1, Provost and Fawcett (2013) Data Science for Business.

##### Optional Readings:

- Press (2013) "A Very Short History of Data Science." Forbes
- Patil. (2011) "Building Data Science Teams" O'Reilly Radar.

## Probability

### January 8: Introduction to Probability

##### Readings

- Chapter 2, Provost and Fawcett (2013) Data Science for Business.
- Chapter 2, Ross. (2012) A First Course in Probability.

##### Optional Readings

- Chapter 1, DeGroot and Schevish (2002) Probability and Statistics.
- Chapter 1, Wasserman. (2004) All of Statistics.
- Watch The Joy of Statistics

### January 13: Conditional Probability

##### Readings

- Chapter 3, Ross. (xxxx) A First Course in Probability

##### Optional Readings

- Chapter 2, DeGroot and Schevish (2002) Probability and Statistics
- Quick Intro to Counting Methods: Permutations and Combinations

### January 15: Programming with Data

##### Readings

- Install IPython on your laptop and bring it to class: http://ipython.org/install.html (I recommended the Anaconda version, but if you want to assemble the packages yourself, make sure you have python, ipython notebook, numpi, scipy, and matplotlib)
- Read and complete the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
- Chapters 4 and 5 of McKinney (2012): Python for Data Analysis.
- Watch 10-minute tour of pandas

##### Due

- Problem Set 1

### January 20: NO CLASS - HOLIDAY - MLK Day

### January 22: Random Variable and Distributions

##### Readings

- Chapter 3, Pages 93-151 DeGroot and Schervish (2002) Probability and Statistics
- Chapter 4, DeGroot and Schervish (2002) Probability and Statistics

##### Optional Readings

- Chapter 2, Wasserman (2004) All of Statistics
- Chapter 4, Ross xxxx A First Course in Probability

## Inferential Statistics

### January 27: Learning from Data

##### Readings

- Section on Normal Distribution, DeGroot and Schervish (2002) Probability and Statistics
- Chapter 6, DeGroot and Schervish (2002) Probability and Statistics

### January 29: Estimation

##### Readings

- Chapter 7, DeGroot and Schervish (2002) Probability and Statistics
- http://www.ics.uci.edu/~smyth/courses/cs274/papers/MLtutorial.pdf

##### Due

- Problem Set 2

### February 3: Testing Hypotheses and Linear Models

##### Readings

- Chapter 9, Section 9.1-9.6 DeGroot and Schervish (2002) Probability and Statistics
- Chapter 6 of: A Handbook of Statistical Analyses Using R (If you haven't already read it!)

## Machine Learning

### February 5: Design of Machine Learning Experiments

##### Readings

- Chapter 5 of Whitten, Frank, Hall: Data MiningPreview the document
- Chapters 3 and 4 of Provost & Fawcett: Data Science for Business (if you haven't already read them)
- Chapters 4 and 5 of McKinney (2012): Python for Data Analysis [Important if you're new to python!]

##### Optional Readings

- Chapter 2 of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
- P. Domingos, A Few Useful Things to Know about Machine Learning Communications of the ACM, 55 (10), 78-87, 2012.

### February 10: Linear Models Revisited

##### Readings

- Chapter 5 of Schutt & O'Neill (2013): Doing Data Science
- Reread section 4.6 of Whitten, Frank, Hall: Data MiningPreview the documentView in a new window

##### Optional Readings

- Chapter 3 (sections 3.1 and 3.2), Chapter 4 (especially section 4.4), and Chapter 6 (sections 6.1-6.3) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)

### February 12: Nearest Neighbors and Perceptrons

##### Required Readings

- Chapter 4 (especially section 4.7) of Whitten, Frank, Hall: Data MiningPreview the document
- Chapter 6 of Provost & Fawcett: Data Science for Business

##### Optional Readings

- Chapter 3 (sections 13.1-13.3) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
- Chapter 6 (sections 6.1-6.6) on Kernel Regression in Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edtion)
- Chapter 3 in a new window of Schutt & O'Neill (2013): Doing Data Science
- Chapter 3 in a new window of Daume (in preparation). A course in machine learning
- Watch Pedro Domingos talk about the Curse of Dimensionality

##### Due

- Problem Set 3

### February 17: HOLIDAY

##### Due

- Problem Set 4 (Optional Section)

### February 19: Generative models and Naive Bayes. Guest Speaker: Johan Ugander

##### Required Readings

- Chapter 4 of Schutt & O'Neill (2013): Doing Data Science
- Reread section 4.2 of Whitten, Frank, Hall: Data Mining
- Michael Collin's lecture notes on Naive Bayes (especially pp. 1-4)

##### Optional Readings

- Paul Graham (2002) "Better Bayesian FilteringPreview the documentView in a new window". http://www.paulgraham.com/better.html
- Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion" ACM conference on Computer and communications security.
- Kevin Murphy's example of Bayes' Rule for medical diagnosis

### February 24: Trees and Forests. Guest Speaker: Brendan O'Connor

##### Required Readings

- Chapter 6 (section 6.1) of Whitten, Frank, Hall: Data MiningPreview the document
- Chapter 8 of Whitten, Frank, Hall: Data Mining

##### Optional Readings

- Chapter 9 (section 9.2) and Chapter 15 of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)

### February 26: Neural Networks

##### Required Readings

- Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
- Chapter 8 of Daume (in preparation). A course in machine learning

##### Optional Readings

- Egmont-Petersen et al. (2002). "Image processing with neural networks: A review" Pattern recognition.

## Applications and Implementation

### March 3: Feature Selection and Dimensionality Reduction

##### Readings

- Chapter 11 (sections 11.1 - 11.3) of Rajarman et al: Mining of Massive Datasets
- Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining

##### Optional Readings

- Chapter 14 (sections 14.2, 14.5 - 14.10) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning(10th edition)
- Turk & Pentland (1991) "Eigenfaces for Recognition

### March 5: Recommender Systems

##### Readings

- Chapter 8 of Schutt & O'Neill (2013): Doing Data Science

##### Optional Readings

- Chapter 9 of Rajarman et al: Mining of Massive Datasets
- Yehuda Koren (2009) "The BellKor Solution to the Net?ix Grand Prize
- Resnick et al (1994) "GroupLens: an open architecture for collaborative filtering of netnews, CSCW '94, pp. 175-186
- RM Bell, Y Koren (2007) "Lessons from the Netflix prize challenge", ACM SIGKDD Explorations Newsletter

### March 10: Recommender Systems, part 2. Guest Speaker Chris DuBois

##### Readings

- TDB

### March 12: Special Topics

##### Due

- Problem Set 5