Class Project

Introduction

This project asks teams of 4 or 5 to work with a customer and their data to answer questions and make predictions that are important to the customer. The project must include sound statistical analysis, hypothesis testing, and model development.

The type of analysis used as well as the modeling approached applied will be based on the goals of the customer. For instance, if a customer wanted to know if either of two front displays in a store affected the probability that a potential customer would leave without purchasing anything, hypothesis testing could be used to evaluate the null hypothesis - that there is no difference. On the other hand, this won’t predict the actual probability. Therefore, if the customer also wanted to have an estimate of the likelihood that a new comer will leave without making a purchase, model could be learned through logistic regression.

At a minimum, each project must attempt to apply hypothesis testing, multivariate logistic regression, multivariate linear regression, and one of the higher complexity models that will be discussed in the coming two weeks.

Deliverables

The project requires each team to self-manage their progress through four phases.

  1. Exploratory data analysis and hypothesis investigation (~1 week)
  2. Development and followup investigation (~1 week)
  3. Result cleanup and extraction (~1/2 week)
  4. Communication (~1/2 week)

Each team must submit the following documentation at the culmination of the project:

  • A well organized jupyter notebook that documents the work from phase 1 and 2. This should be cleaned up in phase 3. It should include both code and text blocks that clearly explain the hypothesis being tested or model being developed. It should also have text blocks that clearly interpret and justify any statistical analysis and model training/selection results.
  • A two page report that summarizes the questions that were investigated and the predictions that were attempted on behalf of the customer with all results that were obtained. This document should provide enough information for readers to understand the size of any effects and their significance. This should include figures and tables beyond listing values in the body text. The document should be two column and abide by the AAAI format. It is available here. I recommend using overleaf to prepare it - though this is not required. There’s a tutorial on overleaf linked on the resources page.
  • A live presentation in front of the customer and the class.

Purpose

The purpose of this project is to experience the process of customer goal understanding, investigation and development using data, and communication of the results to non-technical stakeholders.

Given this goal, the project will be graded accordingly. For instance, a project that identifies an appropriate set of questions and predictions and addresses these in an appropriate and statistically sound manner with proper investigation will be graded very favorably. A project that trains an interesting model without addressing the customer’s questions or goals will not.

To get maximum credit address the customer’s goals significantly and in a sound manner, contain those results in a notebook that is well organized and documented, and communicate these results in written and verbal form effectively.

Due Date

The project is due on April 27th with presentations on the 28th and 30th.