Note: This is archived material from Autumn 2013.

IMT 589 D: Introduction to Data Science

Autumn 2013
University of Washington School of Information
Lectures: Tuesday and Thursday 8:30-10:20am, MGH 271

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

This course offers students a practical, hands-on introduction to the growing field of "Data Science," and common methods for quantitative and computational analytics. As "big data" become the norm in modern business and research environments, there is a growing demand for individuals who are able to derive meaningful insight from large, unruly data. This requires a heterogeneous mix of skills, from data munging and ETL; to machine learning and econometrics; to effective visualization and communication. Through a combination of data-intensive exercises and guest lectures by experts in the field, this course provides an overview of key concepts, skills, and technologies used by data scientists. Interested students must have college-level exposure to statistics and programming.

Prerequisites

Students enrolled in the course must have completed college-level coursework in both statistics and programming. Most assignments for the course will use the R programming language, and students are highly encouraged to familiarize themselves with R prior to the first day of class. Please note, students who do not meet these requirements will find it extremely difficult to successfully complete required problem sets.

Programming: Students should be able to comfortably program in a high level programming language like Java, python, php, or C/C#/C++. Note that html, javascript, and VBA are not sufficient in this context. "Comfortably" implies that students should be able to write simple programs from scratch, like a web scraper, or a text parser, or a simple game of scrabble or tic-tac-toe.
Statistics: Students should have had introductory coursework in both probability and statistics prior to enrolling in this course. At a minimum, students should have an operational understanding of hypothesis testing, statistical significance, and regression analysis.

Course Outline:

Introduction to data science

Sep 26: What is Data Science?

Empirical Frameworks and Experimental Design

Oct 1: Developing an Empirical Framework

Oct 3: Business Experiments, A-B testing, and RCT's

Working with data

Oct 8: A Crash Course in R

Oct 10: Exploring Data

Basic analytics and inference

Oct 15: Distributions, t-tests, and the importance of basic statistics

Oct 17: How far can basic statistics get you?

Oct 22: Regression

Oct 24: Regression Continued

Machine learning and pattern recognition

Oct 29: Machine Learning I: Introduction

Oct 31: Machine Learning 1: Introduction

Nov 5: Machine Learning 2: Supervised Learning

Nov 7: Machine Learning 2: Supervised Learning

Nov 12: Machine Learning 3: Supervised Learning

Visualizing and communicating data

Nov 14: Visualizing Quantitative Data

Storage and scaling

Nov 19: Storage and Organization: Databases, Scalable SQL and NoSQL

Nov 21: Social Network Analysis

Nov 26: Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Grading

Course grades will be based primarily on a final group project, problem sets, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:

Group Final Project: 40%
Problem Set 1: 12%
Problem Set 2: 12%
Problem Set 3: 20%
Participation, Quizzes, Surveys: 16%
Extra credit: 4%

Grading Policy

All assignments are to be submitted on Canvas by 11:59pm on the due date.
Assignments turned in *up to* 24 hours after the due date will be penalized 20%.
Any assignments turned in more than 24 hours late will receive no credit.
If a student believes a mistake has been made in grading, the student has the option to request a regrade. However, the dispute must exceed 5 percent of the total grade for the assignment for the regrade to be processed. If processed, the entire assignment will be regraded, not just the disputed component, so there is a possibility that the net result will be negative. Regrade requests must be submitted as a .pdf attachment (i.e., not as the contents of an email) with the subject "[INFX598] Regrade Request", and must be submitted within 1 week of receiving the original grade.

Academic Integrity Policy

Discussion with instructors and classmates is encouraged, but each student must turn in individual, original work and cite appropriate sources where appropriate.

Readings and Participation

Required and optional readings will be announced in class and posted on the course website. The goal of these readings is to deepen your knowledge of Data Science, as the topic is so broad that not everything can be covered in class. Students who come to class unprepared detract from everyone's ability to learn in an active and engaged environment. To help foster this environment, I will periodically call on random students to solicit opinions of the readings or to summarize a core concept. Students who are clearly unprepared (or who are absent) will miss this opportunity to earn full credit participation.

Students are also encouraged to participate in the class by engaging with the readings and the course content outside of class. There are myriad ways to engage: by posting summaries of optional readings, writing blog posts on a related topic (and posting the link on Canvas), or discussing how a news article or radio show relates to the concepts raised in class. This engagement can be done in any forum, but to ensure that the other students (and the instructor) are aware of your participation, be sure to add a link to what you've done on Canvas.

Detailed Syllabus

Sep 26: What is Data Science?

Readings:

Executive summary of: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity.
Thomas Davenport (2006). “Competing on Analytics”, Harvard Business Review, Jan. 2006, Vol. 84 Issue 1, pp. 99-107.
“So you want to be a Data Scientist” - Nature Blogs
Provost & Fawcett: “Data Science and its relationship to big data and data-driven decision making”, Harvard Business Review

Due

Send in an example of "Data Science" from the real world, and be prepared to present it

Oct 1: Developing an Empirical Framework

Readings

Chapter 1 of Provost & Fawcett: Data Science for Business
Chapter 2 of Provost & Fawcett: Data Science for Business
Whom the Gods Would Destroy, They First Give Real-time Analytics
Chapter 4 of Bernard Social Research Methods
Alamar and Mehrotra, “Beyond ‘Moneyball’: The rapidly evolving world of sports analytics” Online at http://www.analytics-magazine.org/special-articles/391-beyond-moneyball-the-rapidly-evolving-world-of-sports-analytics-part-i

Due

Background and interests survey

Oct 3: Business Experiments, A-B testing, and RCT's

Readings

Andrew Gelman: “There are four ways to get fired from Ceasars”
INTRODUCTION (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
Davenport (2009). “How to Design Smart Business Experiments,” Harvard Business Review pp. 69-76.
Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
Bertrand et al. (2009). “Does Ad Content Affect Consumer Demand?” Alliance, 14:3, p.18
Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-306

Oct 8: A Crash Course in R

Readings

Chapter 1 of Torgo, Data Mining with R [read this first]
Chapter 1 of Spector, Data Manipulation with R [follow along with all examples]
Data Manipulation with R [follow along with all examples]

Oct 10: Exploring Data

Readings

Chapter 3 of Zumel & Mount, Practical Data Science with R
Getting Started with Charts in R, make sure you understand everything thoroughly
Complete the ggplot2 tutorial on canvas
Review other examples here.

Oct 15: Distributions, t-tests, and the importance of basic statistics

Readings

Chapter 1 of Freedman, Pisani, and Purvis: Statistics
Statistics
Chapter 3 of: A Handbook of Statistical Analyses Using R
H. Stern: “Statistics and the College Football Championship,” The American Statistician, 2004.
Huff: How to Lie With Statistics
Watch http://www.ted.com/talks/lies_damned_lies_and_statistics_about_tedtalks.html

Due

Problem Set 1

Oct 17: How far can basic statistics get you? [Guest: Andres Monroy-Hernandez, Microsoft Research]

Readings

Chapters 6 & 19 of Freedman, Pisani, and Purvis: Statistics
Chapters 1 & 2 of Freedman: Statistical Models: Theory and Practice
Watch http://www.gapminder.org/videos/the-joy-of-stats/
Chapters TBD of Torgo: Data Mining with R: Learning with Case Studies

Oct 22: Regression [Guest: Desmond Murray, Cisco]

Readings

Chapter 6 of: A Handbook of Statistical Analyses Using R

Oct 24: Regression Continued

Readings

None

Oct 29: Machine Learning I: Introduction

Readings

Chapter 3 & Chapter 4 of Provost & Fawcett: Data Science for Business

Due

Problem Set 2

Oct 31: Machine Learning 1: Introduction

Readings

Chapter 1 of: Friedman, Hastie, Tibshirani (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition.
Chapter 1 of: Mining of Massive Datasets.

Optional Readings

Haydn Shaughnessy, “How Semantic Clustering Helps Analyze Consumer Attitudes”

Nov 5 and Nov 7: Machine Learning 2: Supervised Learning

Required Readings

Chapter 8 of The Signal and the Noise:“Less and Less and Less Wrong” (Bayes’ Theorem)

Optional Readings

C. Haruechaiyasak: “A Tutorial on Naive Bayes Classification”, 2008.
Watchhttp://www.youtube.com/watch?v=UzxYlbK2c7E[from 32:00]

Nov 12: Machine Learning 3: Unsupervised Learning

Required Readings

Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf
Logistic regression: http://www.ats.ucla.edu/stat/r/dae/logit.htm and http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm

Optional Readings

Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein (1998). “Cluster analysis and display of genome-wide expression patterns” Proceedings of the National Academy of Sciences. Vol. 95 pp. 14863-14868
Rajkumar Venkatesan (2007). “Cluster Analysis for Segmentation”, Darden Business Publishing

Due

Problem Set 3

Nov 14: Visualizing Quantitative Data

Required Readings

Review the ggplot2 tutorial on canvas

Optional Readings

WSJ Guide to Information Visualization
“How to Lie with Charts” and “How to Lie with Maps”
Excerpts from Beautiful Data
Excerpts from Edward Tufte, “Visual Display of Quantitative Information”

Nov 19: Storage and Organization: Databases, Scalable SQL and NoSQL

Required Readings

AWS tutorial
Read this MapReduce tutorial

Optional Readings

Read Yahoo Hadoop tutorial
Cohen et al. (2009) “MAD Skills: New Analysis Practices for Big Data”
Stonebraker et al (2010). “MapReduce and Parallel DBMS’s: Friends or Foes?”
Rick Cattell, “Scalable SQL and NoSQL Data Stores”, SIGMOD Record, December 2010 (39:4)

Nov 21: Social Network Analysis

Optional Readings

N. Godbole et. al: “Large-scale sentiment analysis for news and blogs”, International Conference on Weblogs and Social Media, 2007.
L. Page, et. al: “The PageRank citation ranking: Bringing order to the web”, Stanford, 1999.
Albert R, Jeong H, Barabasi AL. (1999): “Diameter of the World Wide Web”. Nature, 401:130-131

Nov 26: Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Readings

Chapter 2 (“Large-Scale File Systems and Map-Reduce”) of: Mining of Massive Datasets.
Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010
Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.

Note: This is archived material from Autumn 2013.

IMT 589 D: Introduction to Data Science

Course Description

Prerequisites

Course Outline:

Introduction to data science

Empirical Frameworks and Experimental Design

Working with data

Basic analytics and inference

Machine learning and pattern recognition

Visualizing and communicating data

Storage and scaling

Grading

Grading Policy

Academic Integrity Policy

Readings and Participation

Detailed Syllabus

Sep 26: What is Data Science?

Readings:

Due

Oct 1: Developing an Empirical Framework

Readings

Due

Oct 3: Business Experiments, A-B testing, and RCT's

Readings

Oct 8: A Crash Course in R

Readings

Oct 10: Exploring Data

Readings

Oct 15: Distributions, t-tests, and the importance of basic statistics

Readings

Due

Oct 17: How far can basic statistics get you? [Guest: Andres Monroy-Hernandez, Microsoft Research]

Readings

Oct 22: Regression [Guest: Desmond Murray, Cisco]

Readings

Oct 24: Regression Continued

Readings

Oct 29: Machine Learning I: Introduction

Readings

Due

Oct 31: Machine Learning 1: Introduction

Readings

Optional Readings

Nov 5 and Nov 7: Machine Learning 2: Supervised Learning

Required Readings

Optional Readings

Nov 12: Machine Learning 3: Unsupervised Learning

Required Readings

Optional Readings

Due

Nov 14: Visualizing Quantitative Data

Required Readings

Optional Readings

Nov 19: Storage and Organization: Databases, Scalable SQL and NoSQL

Required Readings

Optional Readings

Nov 21: Social Network Analysis

Optional Readings

Nov 26: Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Readings

Nov 28: NO CLASS (Thanksgiving)

Readings

Dec 3: Group Presentations

Readings