Note: This is archived material from September 2013.

INFX 573: Introduction to Data Science

Winter 2013
University of Washington School of Information
Lectures: Tuesday and Thursday 1:30-3:20, MGH 420

Course Description | Prerequisites | Course Outline |Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

This course offers students a practical, hands-on introduction to the growing field of "Data Science," and common methods for quantitative and computational analytics. As "big data" become the norm in modern business and research environments, there is a growing demand for individuals who are able to derive meaningful insight from large, unruly data. This requires a heterogeneous mix of skills, from data munging and ETL; to machine learning and econometrics; to effective visualization and communication. Through a combination of data-intensive exercises and guest lectures by experts in the field, this course provides an overview of key concepts, skills, and technologies used by data scientists. Interested students must have college-level exposure to statistics and programming.

Prerequisites

A data scientist is often referred to as someone who knows more statistics than a computer scientist and more computer science than a statistician. Most assignments for the course will use the R programming language, and students are highly encouraged to familiarize themselves with R prior to the first day of class. Please note, students who do not meet these requirements will find it extremely difficult to successfully complete required problem sets.

Programming: Students should be able to comfortably program in a high level programming language like Java, python, php, or C/C#/C++. Note that html, javascript, and VBA are not sufficient in this context. .Comfortably. implies that students should be able to write simple programs from scratch, like a web scraper, or a text parser, or a simple game of scrabble or tic-tac-toe.

Statistics: Students should have had introductory coursework in both probability and statistics prior to enrolling in this course. At a minimum, students should have an operational understanding of hypothesis testing, statistical significance, and regression analysis.

Course Outline:

Section 1: Introduction to Data Science

Lecture 1: What is Data Science?

Lecture 2: A Crash Course in R

Section 2: Empirical Frameworks and Experimental Design

Lecture 3: Developing an Empirical Framework

Lecture 4: Business experiments

Section 3: Working with Data

Lecture 5: Exploring Data, part 1

Lecture 6: Exploring Data, part 2

Section 4: Basic analytics

Lecture 7: Distributions, t-tests, and the importance of basic statistics

Lecture 8: How far can basic statistics get you?

Lecture 9: Regression

Section 5: Advanced analytics

Lecture 10: Machine Learning I: Introduction

Lecture 11: Machine Learning 2: Supervised Learning

Lecture 12: Machine Learning 3: Unsupervised Learning

Lecture 13: Social Network Analysis

Visualizing and Communicating Data

Lecture 14: Visualizing Quantitative Information

Lecture 15: Communicating Results Effectively

Scaling to terabytes and petabytes

Lecture 17: Scaling: What works and what doesn’t (and what might in the future)

Lecture 18: Common applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Perspectives from industry and academia

Lecture 19: Group Project Presentations

Lecture 20: The future of Data Science and Big Data

Detailed Syllabus

Introduction to Data Science

What is Data Science?

[Optional] Executive summary of: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity.
[Optional] Thomas Davenport (2006). “Competing on Analytics”, Harvard Business Review, Jan. 2006, Vol. 84 Issue 1, pp. 99-107.
[Optional] “So you want to be a Data Scientist” - Nature Blogs
[Optional] Provost & Fawcett: “Data Science and its relationship to big data and data-driven decision making”, Harvard Business Review

A Crash Course in R

Chapter 1 of Torgo, Data Mining with R [read this first]
Chapter 1 of Spector, Data Manipulation with R [follow along with all examples]
Chapter 2 of Spector, Data Manipulation with R [follow along with all examples]

Empirical Frameworks and Experimental Design

Developing an Empirical Framework

Chapter 1 of Provost & Fawcett: Data Science for Business
Chapter 2 of Provost & Fawcett: Data Science for Business
[optional] Whom the Gods Would Destroy, They First Give Real-time Analytics
[optional] Chapter 4 of Bernard: Social Research Methods
[optional] Alamar and Mehrotra, Beyond Moneyball: The rapidly evolving world of sports analytics

Due: Background and interests survey

Business Experiments, A-B testing, and RCT's

Andrew Gelman: There are four ways to get fired from Ceasars
INTRODUCTION (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
Davenport (2009). “How to Design Smart Business Experiments”, Harvard Business Review pp. 69-76.
Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
Bertrand et al. (2009). “Does Ad Content Affect Consumer Demand?” Alliance, 14:3, p.18
Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-306

Working With Data

Exploring Data

Chapter 3 of Zumel & Mount, Practical Data Science with R
Getting Started with Charts in R, make sure you understand everything thoroughly
Complete the ggplot2 tutorial on canvas
Review other examples here.

Analytics and Inference

Distributions, t-tests, and the importance of basic statistics

Chapter 1 of Freedman, Pisani, and Purvis: Statistics
Chapter 2 of Freedman, Pisani, and Purvis: Statistics
Chapter 3 of: A Handbook of Statistical Analyses Using R
H. Stern: “Statistics and the College Football Championship,” The American Statistician, 2004.
Huff: How to Lie With Statistics
Watch http://www.ted.com/talks/lies_damned_lies_and_statistics_about_tedtalks.html

Due: Problem Set 1

How far can basic statistics get you? [Guest: Andres Monroy-Hernandez, Microsoft Research]

Chapters 6 & 19 of Freedman, Pisani, and Purvis: Statistics
Chapters 1 & 2 of Freedman: Statistical Models: Theory and Practice
Watch http://www.gapminder.org/videos/the-joy-of-stats/
Chapters TBD of Torgo: Data Mining with R: Learning with Case Studies

Regression [Guest: Desmond Murray, Cisco]

Chapter 6 of: A Handbook of Statistical Analyses Using R

Advanced Analytics

Machine Learning I: Introduction

Chapter 3 & Chapter 4 of Provost & Fawcett: Data Science for Business
Chapter 1 of: Friedman, Hastie, Tibshirani (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition.
Chapter 1 of: Mining of Massive Datasets.

Due: Problem Set 2

Machine Learning 2: Supervised Learning

Required Readings

Chapter 8 of The Signal and the Noise: “Less and Less and Less Wrong” (Bayes’ Theorem)

Optional Readings

C. Haruechaiyasak: “A Tutorial on Naive Bayes Classification”, 2008.
Watch This video [from 32:00]

Machine Learning 3: Unsupervised Learning

Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf
Logistic regression: Tutorials on the logit and odds ratios
[Optional] Haydn Shaughnessy, “How Semantic Clustering Helps Analyze Consumer Attitudes”
[Optional:] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein (1998). “Cluster analysis and display of genome-wide expression patterns” Proceedings of the National Academy of Sciences. Vol. 95 pp. 14863-14868
[Optional:] Rajkumar Venkatesan (2007). “Cluster Analysis for Segmentation”, Darden Business Publishing

Due: Problem Set 3

Nov 21: Social Network Analysis

N. Godbole et. al: “Large-scale sentiment analysis for news and blogs”, International Conference on Weblogs and Social Media, 2007.
L. Page, et. al: “The PageRank citation ranking: Bringing order to the web”, Stanford, 1999.
Albert R, Jeong H, Barabasi AL. (1999): “Diameter of the World Wide Web”. Nature, 401:130-131

Visualizing and Communicating Data

Visualizing Quantitative Data

Excerpts from Edward Tufte, “Visual Display of Quantitative Information”
Review the ggplot2 tutorial on canvas
[Optional:] WSJ Guide to Information Visualization
[Optional:] How to Lie with Charts” and “How to Lie with Maps”
[Optional:] Excerpts from Beautiful Data

Scaling to Terabytes and Petabytes

Storage and Organization: Databases, Scalable SQL and NoSQL

AWS tutorial
Read this MapReduce tutorial
[Optional:] Read Yahoo Hadoop tutorial
[Optional:] Cohen et al. (2009) “MAD Skills: New Analysis Practices for Big Data”
[Optional:] Stonebraker et al (2010). “MapReduce and Parallel DBMS’s: Friends or Foes?”
[Optional:] Rick Cattell, “Scalable SQL and NoSQL Data Stores,” SIGMOD Record, December 2010 (39:4)

Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Chapter 2 (“Large-Scale File Systems and Map-Reduce”) of: Mining of Massive Datasets.
Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010
Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.

Group Presentations

No required readings