# Joe’s Pre-Analysis Checklist

This checklist will guide you from client discovery all the way through pre-deployment. Answer these questions and you’re on your way to an excellent deal!

## Problem Statement

- Summarize the business use case, and the value of the model to the business.
- What are the research questions the business is trying to answer?
- Has any research been done on this question in the past? If so, what were the hypotheses, findings, conclusions?
- What actionable hypothesis, or hypotheses, could an analysis contribute to?
- If we could make helpful predictions, would we care more about accuracy, or precision?
- Is the client presently making any prediction relevant to this problem? If so, as a benchmark, how well does that prediction perform?
- We make no guarantees about prediction accuracy (performance), and make no guesses before seeing the data or familiarizing ourselves with to the problem. However, we can ask these:
- Is this a case where anything better than random is an improvement?
- Is there any existing example of such a prediction being made by your competitors? If so, what is the performance?

- Do we care more about prediction or causal inference?

## Problem Framing

Drawing a DAG at this stage is usually very helpful and I highly recommend it!

For each research question:

- What is the unit of analysis (or units of analysis, in the case of hierarchy)?
- What is the population we wish to make inferences about?
- Do we expect to observe, or use, any selection or sampling in the course of the data generating process?
- What is the dependent variable (DV) of interest?
- What is the independent variable (IV) or what are the IV’s of interest?
- What are other controls that might reduce omitted variables bias?
- What are other controls that might reduce sampling variability?
- What is the theoretical or functional relationship between these variables, if any?

## Data Specification

- What are the major datasets needed? Group these by source. We will need to verify that each dataset is available from the client.
- Is there enough data to support the analysis?
- Remember: Time series analysis requires ample intertemporal variation to capture trend, common shocks, and seasonality. Is this applicable?

- Do we have measurement instruments for each DV/IV available?
- Will we need to produce any synthetic DV or IVs?
- Are there features we should engineer? How should we engineer them?
- Will “hammer and nail” features suffice from grouped-and-summarized units of analysis?

## Data Preparation (Cleaning and Validation)

- Validation. Distributionally visualize, statistically summarize, and verbally describe each variable provided to us by the client (a great first cut is an automatically generated codebook).
- Are there missing values? Are there non-finite values?
- Are there mixed-type variables, or are all variables statically typed?
- Is there sufficient (and as expected) variation in each of the variables?
- Intertemporal (“datetime”) features: Are there any noticeable discontinuities in the distribution of counts, and aggregates, over time?
*Note: In strict ML or highly dimensional cases, we instead look for anomalies and collinearities at scale.*

- Feature Engineering. Engineer and test the features we discussed in the data specification phase.
- Are there missing values? Are there non-finite values?
- Are there mixed-type variables, or are all statically typed?
- Is there sufficient (and as expected) variation in each of the variables?
- Intertemporal (“datetime”) features: are columns available for the entire time support? (Values should be possible for all 12 months, all 31 days, weeks, holidays, hours, minutes, etc.)

## Empirical Approach

- Descriptive analysis.
- Can we answer the research question using crosstabs and correlations?
- Do we detect any unreported sampling or selection?
- Are there any interesting descriptive trends or phenomena to highlight?

- Model selection and optimization strategy.
- What modeling approach should we use?
- Will we compare two or more models to each other?
- Will we compare results from two or more algorithms?
- Will the modeling libraries work at scale (is scale required)?
*Note: 80% of the time, linear and logistic regression are preferred by the client, for interpretability.*- Do we require sampling or oversampling?
- Do we require standard error correction?
- Do we require post-stratifiction?
- Do we require multilevel specification?
- What hyperparameters are most relevant?
- Do we care about coefficient interpretation?
- How will we optimize hyperparameters?
- Are there idiosyncracies or phenomena that will bias our results, given the data and the method used?
- Which diagnostics will we perform on our results?
- Is our computational inference approach (not
*statistical inference*approach) feasible in a reasonable time period?