Note: This is archived material from January 2014.

 

INFX 598 B: Core Methods in Data Science

Winter 2014
University of Washington School of Information
Lectures: Monday and Wednesday 1:30-3:20pm, MGH 251

Course Description | Prerequisites | Course Outline | Grading | Assignments | Calendar

Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall

Course Description

Provides skills required to analyze and derive insight from large-scale, heterogeneous data. Covers key concepts of functional and imperative programming for storing, extracting, analyzing and presenting large data projects; and data analysis skills using inferential statistics, supervised and unsupervised machine learning. Students gain experience with modeling social and behavioral data.

Prerequisites

INFX 573 or permission of instructor. Students are expected to be comfortable programming in R, and to have mastery of one higher-level programming language such as Python, Php, Java, C++, etc.

Course Outline:

(Re-)Introduction to Data Science

January 6: Introduction to the course

Grading

Course grades will be based primarily on a final group project, problem sets, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:

  • Problem Sets: 80%
    • Problem sets 1-3: 10% each
    • Problem sets 4-5: 25% each
  • Participation, quizzes, and mini-assignments: 20%
  • Extra credit: 3%

Grading Policy

  • All assignments are to be submitted on Canvas by 11:59pm on the due date.
  • Assignments turned in *up to* 24 hours after the due date will be penalized 20%.
  • Any assignments turned in more than 24 hours late will receive no credit.

Academic Integrity Policy

Discussion with instructors and classmates is encouraged, but each student must turn in individual, original work and cite appropriate sources where appropriate.Any assignment or exam that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments

Readings and Participation

Students are also encouraged to participate in the class by engaging with the readings and the course content outside of class. There are myriad ways to engage: by posting summaries of optional readings, writing blog posts on a related topic (and posting the link on Canvas), or discussing how a news article or radio show relates to the concepts raised in class. This engagement can be done in any forum, but to ensure that the other students (and the instructor) are aware of your participation, be sure to add a link to what you've done on Canvas.

Detailed Syllabus

January 6: Introduction to the course

Readings:
  • Chapter 1, Provost and Fawcett (2013) Data Science for Business.
Optional Readings:

Probability

January 8: Introduction to Probability

Readings
  • Chapter 2, Provost and Fawcett (2013) Data Science for Business.
  • Chapter 2, Ross. (2012) A First Course in Probability.
Optional Readings
  • Chapter 1, DeGroot and Schevish (2002) Probability and Statistics.
  • Chapter 1, Wasserman. (2004) All of Statistics.
  • Watch The Joy of Statistics

January 13: Conditional Probability

Readings
  • Chapter 3, Ross. (xxxx) A First Course in Probability
Optional Readings

January 15: Programming with Data

Readings
Due
  • Problem Set 1

January 20: NO CLASS - HOLIDAY - MLK Day

January 22: Random Variable and Distributions

Readings
  • Chapter 3, Pages 93-151 DeGroot and Schervish (2002) Probability and Statistics
  • Chapter 4, DeGroot and Schervish (2002) Probability and Statistics
Optional Readings
  • Chapter 2, Wasserman (2004) All of Statistics
  • Chapter 4, Ross xxxx A First Course in Probability

Inferential Statistics

January 27: Learning from Data

Readings
  • Section on Normal Distribution, DeGroot and Schervish (2002) Probability and Statistics
  • Chapter 6, DeGroot and Schervish (2002) Probability and Statistics

January 29: Estimation

Readings
Due
  • Problem Set 2

February 3: Testing Hypotheses and Linear Models

Readings
  • Chapter 9, Section 9.1-9.6 DeGroot and Schervish (2002) Probability and Statistics
  • Chapter 6 of: A Handbook of Statistical Analyses Using R (If you haven't already read it!)

Machine Learning

February 5: Design of Machine Learning Experiments

Readings
  • Chapter 5 of Whitten, Frank, Hall: Data MiningPreview the document
  • Chapters 3 and 4 of Provost & Fawcett: Data Science for Business (if you haven't already read them)
  • Chapters 4 and 5 of McKinney (2012): Python for Data Analysis [Important if you're new to python!]
Optional Readings
  • Chapter 2 of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
  • P. Domingos, A Few Useful Things to Know about Machine Learning Communications of the ACM, 55 (10), 78-87, 2012.

February 10: Linear Models Revisited

Readings
  • Chapter 5 of Schutt & O'Neill (2013): Doing Data Science
  • Reread section 4.6 of Whitten, Frank, Hall: Data MiningPreview the documentView in a new window
Optional Readings
  • Chapter 3 (sections 3.1 and 3.2), Chapter 4 (especially section 4.4), and Chapter 6 (sections 6.1-6.3) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)

February 12: Nearest Neighbors and Perceptrons

Required Readings
  • Chapter 4 (especially section 4.7) of Whitten, Frank, Hall: Data MiningPreview the document
  • Chapter 6 of Provost & Fawcett: Data Science for Business
Optional Readings
  • Chapter 3 (sections 13.1-13.3) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
  • Chapter 6 (sections 6.1-6.6) on Kernel Regression in Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edtion)
  • Chapter 3 in a new window of Schutt & O'Neill (2013): Doing Data Science
  • Chapter 3 in a new window of Daume (in preparation). A course in machine learning
  • Watch Pedro Domingos talk about the Curse of Dimensionality
Due
  • Problem Set 3

February 17: HOLIDAY

Due
  • Problem Set 4 (Optional Section)

February 19: Generative models and Naive Bayes. Guest Speaker: Johan Ugander

Required Readings
  • Chapter 4 of Schutt & O'Neill (2013): Doing Data Science
  • Reread section 4.2 of Whitten, Frank, Hall: Data Mining
  • Michael Collin's lecture notes on Naive Bayes (especially pp. 1-4)
Optional Readings

February 24: Trees and Forests. Guest Speaker: Brendan O'Connor

Required Readings
  • Chapter 6 (section 6.1) of Whitten, Frank, Hall: Data MiningPreview the document
  • Chapter 8 of Whitten, Frank, Hall: Data Mining
Optional Readings
  • Chapter 9 (section 9.2) and Chapter 15 of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)

February 26: Neural Networks

Required Readings
  • Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
  • Chapter 8 of Daume (in preparation). A course in machine learning
Optional Readings
  • Egmont-Petersen et al. (2002). "Image processing with neural networks: A review" Pattern recognition.

Applications and Implementation

March 3: Feature Selection and Dimensionality Reduction

Readings
  • Chapter 11 (sections 11.1 - 11.3) of Rajarman et al: Mining of Massive Datasets
  • Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
Optional Readings
  • Chapter 14 (sections 14.2, 14.5 - 14.10) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning(10th edition)
  • Turk & Pentland (1991) "Eigenfaces for Recognition

March 5: Recommender Systems

Readings
  • Chapter 8 of Schutt & O'Neill (2013): Doing Data Science
Optional Readings
  • Chapter 9 of Rajarman et al: Mining of Massive Datasets
  • Yehuda Koren (2009) "The BellKor Solution to the Net?ix Grand Prize
  • Resnick et al (1994) "GroupLens: an open architecture for collaborative filtering of netnews, CSCW '94, pp. 175-186
  • RM Bell, Y Koren (2007) "Lessons from the Netflix prize challenge", ACM SIGKDD Explorations Newsletter

March 10: Recommender Systems, part 2. Guest Speaker Chris DuBois

Readings
  • TDB

March 12: Special Topics

Due
  • Problem Set 5
Home - Contact