Note: This is archived material from September 2013.
INFX 573: Introduction to Data Science
Winter 2013
University of Washington School of Information
Lectures: Tuesday and Thursday 1:30-3:20, MGH 420
Course Description | Prerequisites | Course Outline |Calendar
Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall
This course offers students a practical, hands-on introduction to the growing field of "Data Science," and common methods for quantitative and computational analytics. As "big data" become the norm in modern business and research environments, there is a growing demand for individuals who are able to derive meaningful insight from large, unruly data. This requires a heterogeneous mix of skills, from data munging and ETL; to machine learning and econometrics; to effective visualization and communication. Through a combination of data-intensive exercises and guest lectures by experts in the field, this course provides an overview of key concepts, skills, and technologies used by data scientists. Interested students must have college-level exposure to statistics and programming.
A data scientist is often referred to as someone who knows more statistics than a computer scientist and more computer science than a statistician. Most assignments for the course will use the R programming language, and students are highly encouraged to familiarize themselves with R prior to the first day of class. Please note, students who do not meet these requirements will find it extremely difficult to successfully complete required problem sets.
Programming: Students should be able to comfortably program in a high level programming language like Java, python, php, or C/C#/C++. Note that html, javascript, and VBA are not sufficient in this context. .Comfortably. implies that students should be able to write simple programs from scratch, like a web scraper, or a text parser, or a simple game of scrabble or tic-tac-toe.
Statistics: Students should have had introductory coursework in both probability and statistics prior to enrolling in this course. At a minimum, students should have an operational understanding of hypothesis testing, statistical significance, and regression analysis.
Course Outline:
Section 1: Introduction to Data Science
Lecture 1: What is Data Science?
Lecture 2: A Crash Course in R
Section 2: Empirical Frameworks and Experimental Design
Lecture 3: Developing an Empirical Framework
Lecture 4: Business experiments
Section 3: Working with Data
Lecture 5: Exploring Data, part 1
Lecture 6: Exploring Data, part 2
Section 4: Basic analytics
Lecture 7: Distributions, t-tests, and the importance of basic statistics
Lecture 8: How far can basic statistics get you?
Lecture 9: Regression
Section 5: Advanced analytics
Lecture 10: Machine Learning I: Introduction
Lecture 11: Machine Learning 2: Supervised Learning
Lecture 12: Machine Learning 3: Unsupervised Learning
Lecture 13: Social Network Analysis
Visualizing and Communicating Data
Lecture 14: Visualizing Quantitative Information
Lecture 15: Communicating Results Effectively
Scaling to terabytes and petabytes
Lecture 17: Scaling: What works and what doesn’t (and what might in the future)
Lecture 18: Common applications and tools: MapReduce, Hadoop, Hive, and Alternatives
Perspectives from industry and academia
Lecture 19: Group Project Presentations
Lecture 20: The future of Data Science and Big Data
Detailed Syllabus
Introduction to Data Science
What is Data Science?
- [Optional] Executive summary of: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity.
- [Optional] Thomas Davenport (2006). “Competing on Analytics”, Harvard Business Review, Jan. 2006, Vol. 84 Issue 1, pp. 99-107.
- [Optional] “So you want to be a Data Scientist” - Nature Blogs
- [Optional] Provost & Fawcett: “Data Science and its relationship to big data and data-driven decision making”, Harvard Business Review
A Crash Course in R
- Chapter 1 of Torgo, Data Mining with R [read this first]
- Chapter 1 of Spector, Data Manipulation with R [follow along with all examples]
- Chapter 2 of Spector, Data Manipulation with R [follow along with all examples]
Empirical Frameworks and Experimental Design
Developing an Empirical Framework
- Chapter 1 of Provost & Fawcett: Data Science for Business
- Chapter 2 of Provost & Fawcett: Data Science for Business
- [optional] Whom the Gods Would Destroy, They First Give Real-time Analytics
- [optional] Chapter 4 of Bernard: Social Research Methods
- [optional] Alamar and Mehrotra, Beyond Moneyball: The rapidly evolving world of sports analytics
Due: Background and interests survey
Business Experiments, A-B testing, and RCT's
- Andrew Gelman: There are four ways to get fired from Ceasars
- INTRODUCTION (pp. 263-269) to: Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-269
- Anderson & Simester (2011). “A Step-By-Step Guide to Smart Business Experiments”, Harvard Business Review, pp. 99-105
- Davenport (2009). “How to Design Smart Business Experiments”, Harvard Business Review pp. 69-76.
- Ariely (2004). “Why Businesses Don’t Experiment”, Harvard Business Review, p. 34
- Bertrand et al. (2009). “Does Ad Content Affect Consumer Demand?” Alliance, 14:3, p.18
- Bertrand M.; Karlan D.; Mullainathan S.; Shafir E.; Zinman J. (2012) “What's advertising content worth? Evidence from a consumer credit marketing field experiment” Quarterly Journal of Economics, 125(11) pp. 263-306
Working With Data
Exploring Data
- Chapter 3 of Zumel & Mount, Practical Data Science with R
- Getting Started with Charts in R, make sure you understand everything thoroughly
- Complete the ggplot2 tutorial on canvas
- Review other examples here.
Analytics and Inference
Distributions, t-tests, and the importance of basic statistics
- Chapter 1 of Freedman, Pisani, and Purvis: Statistics
- Chapter 2 of Freedman, Pisani, and Purvis: Statistics
- Chapter 3 of: A Handbook of Statistical Analyses Using R
- H. Stern: “Statistics and the College Football Championship,” The American Statistician, 2004.
- Huff: How to Lie With Statistics
- Watch http://www.ted.com/talks/lies_damned_lies_and_statistics_about_tedtalks.html
Due: Problem Set 1
How far can basic statistics get you? [Guest: Andres Monroy-Hernandez, Microsoft Research]
- Chapters 6 & 19 of Freedman, Pisani, and Purvis: Statistics
- Chapters 1 & 2 of Freedman: Statistical Models: Theory and Practice
- Watch http://www.gapminder.org/videos/the-joy-of-stats/
- Chapters TBD of Torgo: Data Mining with R: Learning with Case Studies
Regression [Guest: Desmond Murray, Cisco]
- Chapter 6 of: A Handbook of Statistical Analyses Using R
Advanced Analytics
Machine Learning I: Introduction
- Chapter 3 & Chapter 4 of Provost & Fawcett: Data Science for Business
- Chapter 1 of: Friedman, Hastie, Tibshirani (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition.
- Chapter 1 of: Mining of Massive Datasets.
Due: Problem Set 2
Machine Learning 2: Supervised Learning
Required Readings
- Chapter 8 of The Signal and the Noise: “Less and Less and Less Wrong” (Bayes’ Theorem)
Optional Readings
- C. Haruechaiyasak: “A Tutorial on Naive Bayes Classification”, 2008.
- Watch This video [from 32:00]
Machine Learning 3: Unsupervised Learning
- Chapter 7 [Read 7.1, 7.2, 7.3] of: Mining of Massive Datasets. Online at http://infolab.stanford.edu/~ullman/mmds/book.pdf
- Logistic regression: Tutorials on the logit and odds ratios
- [Optional] Haydn Shaughnessy, “How Semantic Clustering Helps Analyze Consumer Attitudes”
- [Optional:] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein (1998). “Cluster analysis and display of genome-wide expression patterns” Proceedings of the National Academy of Sciences. Vol. 95 pp. 14863-14868
- [Optional:] Rajkumar Venkatesan (2007). “Cluster Analysis for Segmentation”, Darden Business Publishing
Due: Problem Set 3
Nov 21: Social Network Analysis
- N. Godbole et. al: “Large-scale sentiment analysis for news and blogs”, International Conference on Weblogs and Social Media, 2007.
- L. Page, et. al: “The PageRank citation ranking: Bringing order to the web”, Stanford, 1999.
- Albert R, Jeong H, Barabasi AL. (1999): “Diameter of the World Wide Web”. Nature, 401:130-131
Visualizing and Communicating Data
Visualizing Quantitative Data
- Excerpts from Edward Tufte, “Visual Display of Quantitative Information”
- Review the ggplot2 tutorial on canvas
- [Optional:] WSJ Guide to Information Visualization
- [Optional:] How to Lie with Charts” and “How to Lie with Maps”
- [Optional:] Excerpts from Beautiful Data
Scaling to Terabytes and Petabytes
Storage and Organization: Databases, Scalable SQL and NoSQL
- AWS tutorial
- Read this MapReduce tutorial
- [Optional:] Read Yahoo Hadoop tutorial
- [Optional:] Cohen et al. (2009) “MAD Skills: New Analysis Practices for Big Data”
- [Optional:] Stonebraker et al (2010). “MapReduce and Parallel DBMS’s: Friends or Foes?”
- [Optional:] Rick Cattell, “Scalable SQL and NoSQL Data Stores,” SIGMOD Record, December 2010 (39:4)
Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives
- Chapter 2 (“Large-Scale File Systems and Map-Reduce”) of: Mining of Massive Datasets.
- Heimstra and Hauff, “MapReduce for Information Retrieval: Let’s Quickly Test This on 12 TB of Data”, In: M. Agosti et al. (Eds.): CLEF2010, pp. 64-69, 2010
- Dean and Ghemawat, “MapReduce: A Flexible Data Processing Tool”, Communications of the ACM. January 2010.
Group Presentations
- No required readings