Joe’s Pre-Analysis Checklist

This checklist will guide you from client discovery all the way through pre-deployment. Answer these questions and you’re on your way to an excellent deal!

Problem Statement

  1. Summarize the business use case, and the value of the model to the business.
  2. What are the research questions the business is trying to answer?
  3. Has any research been done on this question in the past? If so, what were the hypotheses, findings, conclusions?
  4. What actionable hypothesis, or hypotheses, could an analysis contribute to?
  5. If we could make helpful predictions, would we care more about accuracy, or precision?
  6. Is the client presently making any prediction relevant to this problem? If so, as a benchmark, how well does that prediction perform?
  7. We make no guarantees about prediction accuracy (performance), and make no guesses before seeing the data or familiarizing ourselves with to the problem. However, we can ask these:
    • Is this a case where anything better than random is an improvement?
    • Is there any existing example of such a prediction being made by your competitors? If so, what is the performance?
  8. Do we care more about prediction or causal inference?

Problem Framing

Drawing a DAG at this stage is usually very helpful and I highly recommend it!

For each research question:

  1. What is the unit of analysis (or units of analysis, in the case of hierarchy)?
  2. What is the population we wish to make inferences about?
  3. Do we expect to observe, or use, any selection or sampling in the course of the data generating process?
  4. What is the dependent variable (DV) of interest?
  5. What is the independent variable (IV) or what are the IV’s of interest?
  6. What are other controls that might reduce omitted variables bias?
  7. What are other controls that might reduce sampling variability?
  8. What is the theoretical or functional relationship between these variables, if any?

Data Specification

  1. What are the major datasets needed? Group these by source. We will need to verify that each dataset is available from the client.
  2. Is there enough data to support the analysis?
    • Remember: Time series analysis requires ample intertemporal variation to capture trend, common shocks, and seasonality. Is this applicable?
  3. Do we have measurement instruments for each DV/IV available?
  4. Will we need to produce any synthetic DV or IVs?
  5. Are there features we should engineer? How should we engineer them?
  6. Will “hammer and nail” features suffice from grouped-and-summarized units of analysis?

Data Preparation (Cleaning and Validation)

  1. Validation. Distributionally visualize, statistically summarize, and verbally describe each variable provided to us by the client (a great first cut is an automatically generated codebook).
    • Are there missing values? Are there non-finite values?
    • Are there mixed-type variables, or are all variables statically typed?
    • Is there sufficient (and as expected) variation in each of the variables?
    • Intertemporal (“datetime”) features: Are there any noticeable discontinuities in the distribution of counts, and aggregates, over time?
    • Note: In strict ML or highly dimensional cases, we instead look for anomalies and collinearities at scale.
  2. Feature Engineering. Engineer and test the features we discussed in the data specification phase.
    • Are there missing values? Are there non-finite values?
    • Are there mixed-type variables, or are all statically typed?
    • Is there sufficient (and as expected) variation in each of the variables?
    • Intertemporal (“datetime”) features: are columns available for the entire time support? (Values should be possible for all 12 months, all 31 days, weeks, holidays, hours, minutes, etc.)

Empirical Approach

  1. Descriptive analysis.
    • Can we answer the research question using crosstabs and correlations?
    • Do we detect any unreported sampling or selection?
    • Are there any interesting descriptive trends or phenomena to highlight?
  2. Model selection and optimization strategy.
    • What modeling approach should we use?
    • Will we compare two or more models to each other?
    • Will we compare results from two or more algorithms?
    • Will the modeling libraries work at scale (is scale required)?
    • Note: 80% of the time, linear and logistic regression are preferred by the client, for interpretability.
    • Do we require sampling or oversampling?
    • Do we require standard error correction?
    • Do we require post-stratifiction?
    • Do we require multilevel specification?
    • What hyperparameters are most relevant?
    • Do we care about coefficient interpretation?
    • How will we optimize hyperparameters?
    • Are there idiosyncracies or phenomena that will bias our results, given the data and the method used?
    • Which diagnostics will we perform on our results?
    • Is our computational inference approach (not statistical inference approach) feasible in a reasonable time period?