Note: This is archived material from Autumn 2013.

 

IMT 589 D: Introduction to Data Science

Autumn 2013
University of Washington School of Information
Lectures: Tuesday and Thursday 8:30-10:20am, MGH 271

Course Description | Prerequisites | Course Outline | Grading | Assignments | Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

This course offers students a practical, hands-on introduction to the growing field of "Data Science," and common methods for quantitative and computational analytics. As "big data" become the norm in modern business and research environments, there is a growing demand for individuals who are able to derive meaningful insight from large, unruly data. This requires a heterogeneous mix of skills, from data munging and ETL; to machine learning and econometrics; to effective visualization and communication. Through a combination of data-intensive exercises and guest lectures by experts in the field, this course provides an overview of key concepts, skills, and technologies used by data scientists. Interested students must have college-level exposure to statistics and programming.

Prerequisites

Students enrolled in the course must have completed college-level coursework in both statistics and programming. Most assignments for the course will use the R programming language, and students are highly encouraged to familiarize themselves with R prior to the first day of class. Please note, students who do not meet these requirements will find it extremely difficult to successfully complete required problem sets.

  • Programming: Students should be able to comfortably program in a high level programming language like Java, python, php, or C/C#/C++. Note that html, javascript, and VBA are not sufficient in this context. "Comfortably" implies that students should be able to write simple programs from scratch, like a web scraper, or a text parser, or a simple game of scrabble or tic-tac-toe.
  • Statistics: Students should have had introductory coursework in both probability and statistics prior to enrolling in this course. At a minimum, students should have an operational understanding of hypothesis testing, statistical significance, and regression analysis.

Course Outline:

Introduction to data science

Sep 26: What is Data Science?

Empirical Frameworks and Experimental Design

Oct 1: Developing an Empirical Framework

Oct 3: Business Experiments, A-B testing, and RCT's

Visualizing and communicating data

Nov 14: Visualizing Quantitative Data

Grading

Course grades will be based primarily on a final group project, problem sets, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:

  • Group Final Project: 40%
  • Problem Set 1: 12%
  • Problem Set 2: 12%
  • Problem Set 3: 20%
  • Participation, Quizzes, Surveys: 16%
  • Extra credit: 4%

Grading Policy

  • All assignments are to be submitted on Canvas by 11:59pm on the due date.
  • Assignments turned in *up to* 24 hours after the due date will be penalized 20%.
  • Any assignments turned in more than 24 hours late will receive no credit.
  • If a student believes a mistake has been made in grading, the student has the option to request a regrade. However, the dispute must exceed 5 percent of the total grade for the assignment for the regrade to be processed. If processed, the entire assignment will be regraded, not just the disputed component, so there is a possibility that the net result will be negative. Regrade requests must be submitted as a .pdf attachment (i.e., not as the contents of an email) with the subject "[INFX598] Regrade Request", and must be submitted within 1 week of receiving the original grade.

Academic Integrity Policy

Discussion with instructors and classmates is encouraged, but each student must turn in individual, original work and cite appropriate sources where appropriate.

Readings and Participation

Required and optional readings will be announced in class and posted on the course website. The goal of these readings is to deepen your knowledge of Data Science, as the topic is so broad that not everything can be covered in class. Students who come to class unprepared detract from everyone's ability to learn in an active and engaged environment. To help foster this environment, I will periodically call on random students to solicit opinions of the readings or to summarize a core concept. Students who are clearly unprepared (or who are absent) will miss this opportunity to earn full credit participation.

Students are also encouraged to participate in the class by engaging with the readings and the course content outside of class. There are myriad ways to engage: by posting summaries of optional readings, writing blog posts on a related topic (and posting the link on Canvas), or discussing how a news article or radio show relates to the concepts raised in class. This engagement can be done in any forum, but to ensure that the other students (and the instructor) are aware of your participation, be sure to add a link to what you've done on Canvas.

Detailed Syllabus

Sep 26: What is Data Science?

Readings:
Due
  • Send in an example of "Data Science" from the real world, and be prepared to present it

Oct 1: Developing an Empirical Framework

Readings
Due
  • Background and interests survey 

Oct 3: Business Experiments, A-B testing, and RCT's

Readings

Oct 8: A Crash Course in R

Readings
  • Chapter 1 of Torgo, Data Mining with R [read this first]
  • Chapter 1 of Spector, Data Manipulation with R [follow along with all examples]
  • Data Manipulation with R [follow along with all examples]

Oct 10: Exploring Data

Readings
  • Chapter 3 of Zumel & Mount, Practical Data Science  with R
  • Getting Started with Charts in R, make sure you understand everything thoroughly
  • Complete the ggplot2 tutorial on canvas
  • Review other examples here.

Oct 15: Distributions, t-tests, and the importance of basic statistics

Readings
Due
  • Problem Set 1

Oct 17: How far can basic statistics get you? [Guest: Andres Monroy-Hernandez, Microsoft Research]

Readings
  • Chapters 6 & 19 of Freedman, Pisani, and Purvis: Statistics
  • Chapters 1 & 2 of Freedman: Statistical Models: Theory and Practice
  • Watch http://www.gapminder.org/videos/the-joy-of-stats/
  • Chapters TBD of Torgo: Data Mining with R: Learning with Case Studies

Oct 22: Regression [Guest: Desmond Murray, Cisco]

Readings
  • Chapter 6 of: A Handbook of Statistical Analyses Using R

Oct 24: Regression Continued

Readings
  • None

Oct 29: Machine Learning I: Introduction

Readings
  • Chapter 3 &  Chapter 4 of Provost & Fawcett: Data Science for Business
Due
  • Problem Set 2

Oct 31: Machine Learning 1: Introduction

Readings
Optional Readings

Nov 5 and Nov 7: Machine Learning 2: Supervised Learning

Required Readings
  • Chapter 8 of The Signal and the Noise:“Less and Less and Less Wrong” (Bayes’ Theorem)
Optional Readings

Nov 12: Machine Learning 3: Unsupervised Learning

Required Readings
Optional Readings
  • Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein (1998). “Cluster analysis and display of genome-wide expression patterns” Proceedings of the National Academy of Sciences. Vol. 95 pp. 14863-14868
  • Rajkumar Venkatesan (2007). “Cluster Analysis for Segmentation”, Darden Business Publishing
Due
  • Problem Set 3

Nov 14: Visualizing Quantitative Data

Required Readings
  • Review the ggplot2 tutorial on canvas
Optional Readings
  • WSJ Guide to Information Visualization
  • “How to Lie with Charts” and “How to Lie with Maps”
  • Excerpts from Beautiful Data
  • Excerpts from Edward Tufte, “Visual Display of Quantitative Information”

Nov 19:  Storage and Organization: Databases, Scalable SQL and NoSQL

Required Readings
Optional Readings

Nov 21: Social Network Analysis

Optional Readings
  • N. Godbole et. al: “Large-scale sentiment analysis for news and blogs”, International Conference on Weblogs and Social Media, 2007.
  • L. Page, et. al: “The PageRank citation ranking: Bringing order to the web”, Stanford, 1999.
  • Albert R, Jeong H, Barabasi AL. (1999): “Diameter of the World Wide Web”. Nature, 401:130-131 

Nov 26: Common Applications and tools: MapReduce, Hadoop, Hive, and Alternatives

Readings

Nov 28: NO CLASS (Thanksgiving)

Readings
  • None

Dec 3: Group Presentations

Readings
  • None