Course Syllabus

Course Description:

Intro to statistical data science, using computing tools to gather, manage and analyze large and complex data sets. Topics include data wrangling and formatting, web scraping, data analysis, statistical modeling techniques, text mining, and language processing. 

Course Outline:

There are six major topics in Introduction to Data Science:

1. Data wrangling and formatting using the tidyverse set of R libraries.
2. Exploratory data analysis, including data visualization using ggplot2.
3. Data acquisition using web-scraping and APIs.
4. Statistical modeling and inference.
5. Machine learning.
6. Text mining and language processing.

We’ll work through many of these topics simultaneously. This class is all about building skills and techniques to begin your data science journey.

Location

After the first week of classes, we'll meet in person in Olin 207 on the second floor of Olin Hall.

Office Hours

In-person Office hours: TTh 11:30 am -12 pm (PT), 2:30-3 pm (PT) in Olin 219

Zoom Office hours: MWF 10:10:30 am

Office hours Zoom link: https://whitman.zoom.us/j/2381981909

Note: Please contact me if these hours do not work for your schedule!

Preferred method of contact: email ptukhim@whitman.edu.

If I'm free, I'll respond pretty quickly, but don't wait for me, keep working at whatever prompted you to reach out.

Textbook:

The main book for this course is Modern Data Science with R by Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton.

Additional resources include:

R for Data Science by Hadley Wickham 

RMarkdown: The Definitive Guide by Yihue Xi

Foundations of Data Science by Blum, Hopcroft, and Kannon

Tidyverse Skills for Data Science by Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng

Hands-On Machine Learning with R by Bradley Boehmke & Brandon Greenwell

Modern Dive: Statistical Inference via Data Science by Ismay and Kim

OpenIntro Statistics, 2nd Ed. by Diez, Barr, & Cetinkaya-Rundel

Lecture Notes are based on Data Science in a Box materials

Statistical Packages:

We will use the (free) statistical package R and the RStudio interface via RStudio Workbench.  You will be able to access RStudio Workbench on any device with internet access by typing http://math.whitman.edu:8787/ in the browser. Use your MathLab account credentials to access Rstudio Workbench. If you forgot your password you can reset it. If you need help with that, email Dustin Palmer at palmerdl@whitman.edu.

Course Goals

By the end of the semester, we hope

  • to develop skills in the use of R for data analysis.
  • to learn common tasks associated with data, e.g. filtering, merging, sorting, pivoting
  • to manipulate numerical, categorical, and time-like data.
  • to learn a variety of data visualizations, e.g. scatter plots, bar graphs, histograms, maps.
  • to learn techniques of exploratory data analysis, make hypotheses, and tell an evidence-based story about new data sets.
  • to learn to read the documentation and search the web in order to apply new methods and visualizations on your own.
  • to write brief reports that tell the story of the data supported by visualizations.
  • to understand how data analysis can be used and misused in the areas of public policy, decision making, and social change.

Time Commitment

Intro to Data Science is a 3 cr class. Generally speaking, you should spend about 3 hours a week in-class and 6 hours per week outside of class working on assignments, presentations, and projects. It is a good idea to schedule your time outside of class and stick to that schedule.

Canvas Modules

All of the work you are expected to complete will be organized in Canvas modules. Just follow the modules and you'll be fine. You won't miss anything. What follows is a summary of the kinds of work you'll be doing in the class.

Course Assessment 

Your grade will contain the following components.

1. Reading Responses/Class Prep/In-class Discussion (15%): Individual class periods are designed with the assumption that you have read the day's material in advance. To support this activity some days will have a short pre-class "Readback" form for you to fill out. Remember it's timestamped and is due the night before class. No late Readbacks will be accepted.

2. Weekly Labs (35%): Each Thursday we'll start a lab designed to explore new techniques for working with data. Most labs are designed to take about 2-3 hours to complete, so you'll most likely need to finish them outside of class. Labs are due in Canvas the following week.
3. Mini-Projects (15% each): There will be two Mini-Projects assigned during the course. Mini-Projects will be due on October 5 and November 2.
4. Final Project (including a  presentation) (20%): Your Final Project will represent a complete exploration of a large-scale data project, suitable for use in a portfolio of your work.  The Final Project is in lieu of a Final Exam. More details to follow.

All assignments must be readable, and when appropriate, all work must be shown to receive credit.

Late work will receive a 5 percentage points deduction per calendar day, with no work accepted more than 3 calendar days after the deadline (unless other arrangements have been made before the due date). My main recommendation to avoid the late submission penalty is to pay close attention to deadlines and start working on the assignments early to avoid the stress of trying to complete them at the last minute.

You are encouraged to work together on labs and in-class activities, but all work you submit must be your own (unless the assignment specifically states otherwise). The first act of academic dishonesty will result in a score of zero on the item in question. A subsequent offense will result in an F for the course. Students should consult the Academic Honesty Procedures if they have any questions.

Course Grade 

In this, class I regard a “B” as the default grade you get for doing what is expected.

An “A” requires going above & beyond – show intellectual curiosity, strive to understand the “big ideas,” don’t stop at the recipe. 

A “C” means you pass – but barely, with serious gaps in your knowledge that you need to address.

Any grade lower than a "C" means that you do not pass the course.

Final letter grades will be determined as follows: 

Letter Grade Weighted Score
A + 97-100
A 93-96
A- 90-92
B+ 87-89
B 83-86
B- 80-82
C+ 77-79
C 73-76
C- 70-72
D+ 67-69
D 63-66
D- 60-62
F 0-59

Note that Canvas will display unweighted scores.

Important Notes:

  • Any student needing accommodations should inform the instructor. Students with disabilities who may need accommodations for this class are encouraged to notify the instructor and contact the Academic Resource Center (ARC) early in the semester so that reasonable accommodations may be implemented as soon as possible. All information will remain confidential.
  • Academic dishonesty and plagiarism will result in a failing grade on the assignment. Using someone else's ideas or phrasing and representing those ideas or phrasing as our own, either on purpose or through carelessness, is a serious offense known as plagiarism. "Ideas or phrasing" includes written or spoken material, from whole papers and paragraphs to sentences, and, indeed, phrases but it also includes statistics, lab results, artwork, etc.  Please see the student handbook for policies regarding plagiarism

Tentative course schedule