Note: This is archived material from January 2014.
INFX 598 B: Core Methods in Data Science
Winter 2014
University of Washington School of Information
Lectures: Monday and Wednesday 1:30-3:20pm, MGH 251
Course Description | Prerequisites | Course Outline | Grading | Assignments | Calendar
Instructor: Prof. Joshua Blumenstock, Suite 370 Mary Gates Hall
Provides skills required to analyze and derive insight from large-scale, heterogeneous data. Covers key concepts of functional and imperative programming for storing, extracting, analyzing and presenting large data projects; and data analysis skills using inferential statistics, supervised and unsupervised machine learning. Students gain experience with modeling social and behavioral data.
INFX 573 or permission of instructor. Students are expected to be comfortable programming in R, and to have mastery of one higher-level programming language such as Python, Php, Java, C++, etc.
Course Outline:
(Re-)Introduction to Data Science
January 6: Introduction to the course
Probability
January 8: Introduction to Probability
January 13: Conditional Probability
January 15: Programming with Data
January 22: Random Variable and Distributions
Inferential Statistics
January 27: Learning from Data
January 29: Estimation
February 3: Testing Hypotheses and Linear Models
Supervised and Unsupervised learning
February 5: Design of Machine Learning Experiments
February 10: Linear Models Revisited
February 12: Nearest Neighbors and Perceptrons
February 19: Generative models and Naive Bayes
February 24: Trees and Forests
February 26: Neural Networks
Applications and Implementation
March 3: Feature Selection and Dimensionality Reduction
March 5: Recommender Systems
March 12: Special Topics
Course grades will be based primarily on a final group project, problem sets, and overall classroom participation. Details on each of these assignments will be provided on canvas. Each component will be weighted as follows:
- Problem Sets: 80%
- Problem sets 1-3: 10% each
- Problem sets 4-5: 25% each
- Participation, quizzes, and mini-assignments: 20%
- Extra credit: 3%
Grading Policy
- All assignments are to be submitted on Canvas by 11:59pm on the due date.
- Assignments turned in *up to* 24 hours after the due date will be penalized 20%.
- Any assignments turned in more than 24 hours late will receive no credit.
Academic Integrity Policy
Discussion with instructors and classmates is encouraged, but each student must turn in individual, original work and cite appropriate sources where appropriate.Any assignment or exam that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments
Readings and Participation
Students are also encouraged to participate in the class by engaging with the readings and the course content outside of class. There are myriad ways to engage: by posting summaries of optional readings, writing blog posts on a related topic (and posting the link on Canvas), or discussing how a news article or radio show relates to the concepts raised in class. This engagement can be done in any forum, but to ensure that the other students (and the instructor) are aware of your participation, be sure to add a link to what you've done on Canvas.
Detailed Syllabus
January 6: Introduction to the course
Readings:
- Chapter 1, Provost and Fawcett (2013) Data Science for Business.
Optional Readings:
- Press (2013) "A Very Short History of Data Science." Forbes
- Patil. (2011) "Building Data Science Teams" O'Reilly Radar.
Probability
January 8: Introduction to Probability
Readings
- Chapter 2, Provost and Fawcett (2013) Data Science for Business.
- Chapter 2, Ross. (2012) A First Course in Probability.
Optional Readings
- Chapter 1, DeGroot and Schevish (2002) Probability and Statistics.
- Chapter 1, Wasserman. (2004) All of Statistics.
- Watch The Joy of Statistics
January 13: Conditional Probability
Readings
- Chapter 3, Ross. (xxxx) A First Course in Probability
Optional Readings
- Chapter 2, DeGroot and Schevish (2002) Probability and Statistics
- Quick Intro to Counting Methods: Permutations and Combinations
January 15: Programming with Data
Readings
- Install IPython on your laptop and bring it to class: http://ipython.org/install.html (I recommended the Anaconda version, but if you want to assemble the packages yourself, make sure you have python, ipython notebook, numpi, scipy, and matplotlib)
- Read and complete the "Introduction" to the following Python tutorial: http://interactivepython.org/courselib/static/pythonds/index.html
- Chapters 4 and 5 of McKinney (2012): Python for Data Analysis.
- Watch 10-minute tour of pandas
Due
- Problem Set 1
January 20: NO CLASS - HOLIDAY - MLK Day
January 22: Random Variable and Distributions
Readings
- Chapter 3, Pages 93-151 DeGroot and Schervish (2002) Probability and Statistics
- Chapter 4, DeGroot and Schervish (2002) Probability and Statistics
Optional Readings
- Chapter 2, Wasserman (2004) All of Statistics
- Chapter 4, Ross xxxx A First Course in Probability
Inferential Statistics
January 27: Learning from Data
Readings
- Section on Normal Distribution, DeGroot and Schervish (2002) Probability and Statistics
- Chapter 6, DeGroot and Schervish (2002) Probability and Statistics
January 29: Estimation
Readings
- Chapter 7, DeGroot and Schervish (2002) Probability and Statistics
- http://www.ics.uci.edu/~smyth/courses/cs274/papers/MLtutorial.pdf
Due
- Problem Set 2
February 3: Testing Hypotheses and Linear Models
Readings
- Chapter 9, Section 9.1-9.6 DeGroot and Schervish (2002) Probability and Statistics
- Chapter 6 of: A Handbook of Statistical Analyses Using R (If you haven't already read it!)
Machine Learning
February 5: Design of Machine Learning Experiments
Readings
- Chapter 5 of Whitten, Frank, Hall: Data MiningPreview the document
- Chapters 3 and 4 of Provost & Fawcett: Data Science for Business (if you haven't already read them)
- Chapters 4 and 5 of McKinney (2012): Python for Data Analysis [Important if you're new to python!]
Optional Readings
- Chapter 2 of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
- P. Domingos, A Few Useful Things to Know about Machine Learning Communications of the ACM, 55 (10), 78-87, 2012.
February 10: Linear Models Revisited
Readings
- Chapter 5 of Schutt & O'Neill (2013): Doing Data Science
- Reread section 4.6 of Whitten, Frank, Hall: Data MiningPreview the documentView in a new window
Optional Readings
- Chapter 3 (sections 3.1 and 3.2), Chapter 4 (especially section 4.4), and Chapter 6 (sections 6.1-6.3) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
February 12: Nearest Neighbors and Perceptrons
Required Readings
- Chapter 4 (especially section 4.7) of Whitten, Frank, Hall: Data MiningPreview the document
- Chapter 6 of Provost & Fawcett: Data Science for Business
Optional Readings
- Chapter 3 (sections 13.1-13.3) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
- Chapter 6 (sections 6.1-6.6) on Kernel Regression in Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edtion)
- Chapter 3 in a new window of Schutt & O'Neill (2013): Doing Data Science
- Chapter 3 in a new window of Daume (in preparation). A course in machine learning
- Watch Pedro Domingos talk about the Curse of Dimensionality
Due
- Problem Set 3
February 17: HOLIDAY
Due
- Problem Set 4 (Optional Section)
February 19: Generative models and Naive Bayes. Guest Speaker: Johan Ugander
Required Readings
- Chapter 4 of Schutt & O'Neill (2013): Doing Data Science
- Reread section 4.2 of Whitten, Frank, Hall: Data Mining
- Michael Collin's lecture notes on Naive Bayes (especially pp. 1-4)
Optional Readings
- Paul Graham (2002) "Better Bayesian FilteringPreview the documentView in a new window". http://www.paulgraham.com/better.html
- Kanich, Chris, et al. (2008) "Spamalytics: An empirical analysis of spam marketing conversion" ACM conference on Computer and communications security.
- Kevin Murphy's example of Bayes' Rule for medical diagnosis
February 24: Trees and Forests. Guest Speaker: Brendan O'Connor
Required Readings
- Chapter 6 (section 6.1) of Whitten, Frank, Hall: Data MiningPreview the document
- Chapter 8 of Whitten, Frank, Hall: Data Mining
Optional Readings
- Chapter 9 (section 9.2) and Chapter 15 of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning (10th edition)
February 26: Neural Networks
Required Readings
- Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
- Chapter 8 of Daume (in preparation). A course in machine learning
Optional Readings
- Egmont-Petersen et al. (2002). "Image processing with neural networks: A review" Pattern recognition.
Applications and Implementation
March 3: Feature Selection and Dimensionality Reduction
Readings
- Chapter 11 (sections 11.1 - 11.3) of Rajarman et al: Mining of Massive Datasets
- Chapter 7 (section 7.4) of Whitten, Frank, Hall: Data Mining
Optional Readings
- Chapter 14 (sections 14.2, 14.5 - 14.10) of Friedman, Hastie, Tibshirani, The Elements of Statistical Learning(10th edition)
- Turk & Pentland (1991) "Eigenfaces for Recognition
March 5: Recommender Systems
Readings
- Chapter 8 of Schutt & O'Neill (2013): Doing Data Science
Optional Readings
- Chapter 9 of Rajarman et al: Mining of Massive Datasets
- Yehuda Koren (2009) "The BellKor Solution to the Net?ix Grand Prize
- Resnick et al (1994) "GroupLens: an open architecture for collaborative filtering of netnews, CSCW '94, pp. 175-186
- RM Bell, Y Koren (2007) "Lessons from the Netflix prize challenge", ACM SIGKDD Explorations Newsletter
March 10: Recommender Systems, part 2. Guest Speaker Chris DuBois
Readings
- TDB
March 12: Special Topics
Due
- Problem Set 5