## Note: This is archived material from September 2013.

# INFX 573: Introduction to Data Science

**Winter 2013**

**University of Washington School of Information**

**Lectures: Tuesday and Thursday 1:30-3:20, MGH 420**

Course Description | Prerequisites | Course Outline |Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

This course offers students a practical, hands-on introduction to the growing field of "Data Science," and common methods for quantitative and computational analytics. As "big data" become the norm in modern business and research environments, there is a growing demand for individuals who are able to derive meaningful insight from large, unruly data. This requires a heterogeneous mix of skills, from data munging and ETL; to machine learning and econometrics; to effective visualization and communication. Through a combination of data-intensive exercises and guest lectures by experts in the field, this course provides an overview of key concepts, skills, and technologies used by data scientists. Interested students must have college-level exposure to statistics and programming.

A data scientist is often referred to as someone who knows more statistics than a computer scientist and more computer science than a statistician. Most assignments for the course will use the R programming language, and students are highly encouraged to familiarize themselves with R prior to the first day of class. Please note, students who do not meet these requirements will find it extremely difficult to successfully complete required problem sets.

**Programming: **Students should be able to comfortably program in a high level programming language like Java, python, php, or C/C#/C++. Note that html, javascript, and VBA are not sufficient in this context. .Comfortably. implies that students should be able to write simple programs from scratch, like a web scraper, or a text parser, or a simple game of scrabble or tic-tac-toe.

**Statistics:** Students should have had introductory coursework in both probability and statistics prior to enrolling in this course. At a minimum, students should have an operational understanding of hypothesis testing, statistical significance, and regression analysis.

# Course Outline:

## Section 1: Introduction to Data Science

#### Lecture 1: What is Data Science?

#### Lecture 2: A Crash Course in R

## Section 2: Empirical Frameworks and Experimental Design

#### Lecture 3: Developing an Empirical Framework

#### Lecture 4: Business experiments

## Section 3: Working with Data

#### Lecture 5: Exploring Data, part 1

#### Lecture 6: Exploring Data, part 2

## Section 4: Basic analytics

#### Lecture 7: Distributions, t-tests, and the importance of basic statistics

#### Lecture 8: How far can basic statistics get you?

#### Lecture 9: Regression

## Section 5: Advanced analytics

#### Lecture 10: Machine Learning I: Introduction

#### Lecture 11: Machine Learning 2: Supervised Learning

#### Lecture 12: Machine Learning 3: Unsupervised Learning

#### Lecture 13: Social Network Analysis

## Visualizing and Communicating Data

#### Lecture 14: Visualizing Quantitative Information

#### Lecture 15: Communicating Results Effectively

## Scaling to terabytes and petabytes

#### Lecture 17: Scaling: What works and what doesn’t (and what might in the future)

#### Lecture 18: Common applications and tools: MapReduce, Hadoop, Hive, and Alternatives

## Perspectives from industry and academia

#### Lecture 19: Group Project Presentations

#### Lecture 20: The future of Data Science and Big Data

# Detailed Syllabus

## Introduction to Data Science

### What is Data Science?

- [Optional] Executive summary of: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity.
- [Optional] Thomas Davenport (2006). “Competing on Analytics”, Harvard Business Review, Jan. 2006, Vol. 84 Issue 1, pp. 99-107.
- [Optional] “So you want to be a Data Scientist” - Nature Blogs
- [Optional] Provost & Fawcett: “Data Science and its relationship to big data and data-driven decision making”, Harvard Business Review

### A Crash Course in R

- Chapter 1 of Torgo, Data Mining with R [
**read this first]** - Chapter 1 of Spector, Data Manipulation with R [
**follow along with all examples]** - Chapter 2 of Spector, Data Manipulation with R [
**follow along with all examples]**

## Empirical Frameworks and Experimental Design

### Developing an Empirical Framework

- Chapter 1 of Provost & Fawcett: Data Science for Business
- Chapter 2 of Provost & Fawcett: Data Science for Business
- [optional] Whom the Gods Would Destroy, They First Give Real-time Analytics
- [optional] Chapter 4 of Bernard: Social Research Methods
- [optional] Alamar and Mehrotra, Beyond Moneyball: The rapidly evolving world of sports analytics

##### Due: Background and interests survey

### Business Experiments, A-B testing, and RCT's

- Andrew Gelman: There are four ways to get fired from Ceasars
- INTRODUCTION (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment”
*Quarterly Journal of Economics*, 125(11) pp. 263-269 - Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”,
*Harvard Business Review*, pp. 99-105 - Davenport (2009). “How to Design Smart Business Experiments”,
*Harvard Business Review*pp. 69-76. - Ariely (2004). “Why Businesses Don’t Experiment”,
*Harvard Business Review*, p. 34 - Bertrand et al. (2009). “Does Ad Content Affect Consumer Demand?”
*Alliance*, 14:3, p.18 - Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment”
*Quarterly Journal of Economics*, 125(11) pp. 263-306

## Working With Data

### Exploring Data

- Chapter 3 of Zumel & Mount, Practical Data Science with R
- Getting Started with Charts in R, make sure you understand everything thoroughly
- Complete the ggplot2 tutorial on canvas
- Review other examples here.

## Analytics and Inference

### Distributions, t-tests, and the importance of basic statistics

- Chapter 1 of Freedman, Pisani, and Purvis: Statistics
- Chapter 2 of Freedman, Pisani, and Purvis: Statistics
- Chapter 3 of: A Handbook of Statistical Analyses Using R
- H. Stern: “Statistics and the College Football Championship,”
*The American Statistician*, 2004. - Huff: How to Lie With Statistics
- Watch http://www.ted.com/talks/lies_damned_lies_and_statistics_about_tedtalks.html

##### Due: Problem Set 1

### How far can basic statistics get you? [Guest: Andres Monroy-Hernandez, Microsoft Research]

- Chapters 6 & 19 of Freedman, Pisani, and Purvis: Statistics
- Chapters 1 & 2 of Freedman: Statistical Models: Theory and Practice
- Watch http://www.gapminder.org/videos/the-joy-of-stats/
- Chapters TBD of Torgo: Data Mining with R: Learning with Case Studies

### Regression [Guest: Desmond Murray, Cisco]

- Chapter 6 of: A Handbook of Statistical Analyses Using R

## Advanced Analytics

### Machine Learning I: Introduction

- Chapter 3 & Chapter 4 of Provost & Fawcett: Data Science for Business
- Chapter 1 of: Friedman, Hastie, Tibshirani (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition.
- Chapter 1 of: Mining of Massive Datasets.

##### Due: Problem Set 2

### Machine Learning 2: Supervised Learning

##### Required Readings

- Chapter 8 of The Signal and the Noise: “Less and Less and Less Wrong” (Bayes’ Theorem)

##### Optional Readings

- C. Haruechaiyasak: “A Tutorial on Naive Bayes Classification”, 2008.
- Watch This video [from 32:00]

### Machine Learning 3: Unsupervised Learning

- Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf
- Logistic regression: Tutorials on the logit and odds ratios
- [Optional] Haydn Shaughnessy, “How Semantic Clustering Helps Analyze Consumer Attitudes”
- [Optional:] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein (1998). “Cluster analysis and display of genome-wide expression patterns”
*Proceedings of the National Academy of Sciences*. Vol. 95 pp. 14863-14868 - [Optional:] Rajkumar Venkatesan (2007). “Cluster Analysis for Segmentation”,
*Darden Business Publishing*

##### Due: Problem Set 3

### Nov 21: Social Network Analysis

- N. Godbole et. al: “Large-scale sentiment analysis for news and blogs”,
*International Conference on Weblogs and Social Media*, 2007. - L. Page, et. al: “The PageRank citation ranking: Bringing order to the web”, Stanford, 1999.
- Albert R, Jeong H, Barabasi AL. (1999): “Diameter of the World Wide Web”.
*Nature*, 401:130-131

## Visualizing and Communicating Data

### Visualizing Quantitative Data

- Excerpts from Edward Tufte, “Visual Display of Quantitative Information”
- Review the ggplot2 tutorial on canvas
- [Optional:] WSJ Guide to Information Visualization
- [Optional:] How to Lie with Charts” and “How to Lie with Maps”
- [Optional:] Excerpts from Beautiful Data

## Scaling to Terabytes and Petabytes

### Storage and Organization: Databases, Scalable SQL and NoSQL

- AWS tutorial
- Read this MapReduce tutorial
- [Optional:] Read Yahoo Hadoop tutorial
- [Optional:] Cohen et al. (2009) “MAD Skills: New Analysis Practices for Big Data”
- [Optional:] Stonebraker et al (2010). “MapReduce and Parallel DBMS’s: Friends or Foes?”
- [Optional:] Rick Cattell, “Scalable SQL and NoSQL Data Stores,”
*SIGMOD Record*, December 2010 (39:4)

### Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives

- Chapter 2 (“Large-Scale File Systems and Map-Reduce”) of: Mining of Massive Datasets.
- Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.):
*CLEF2010*, pp. 64-69, 2010 - Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”,
*Communications of the ACM.*January 2010.

### Group Presentations

- No required readings