Interpretable Machine Learning using SHAP — theory and applications

9 min read

SHAP is an increasingly popular method used for interpretable machine learning. This article breaks down the theory of Shapley Additive Values and illustrates with a few practical examples.

Khalil ZlaouiPhoto by Johannes Plenio on Unsplash

Complex machine learning algorithms such as the XGBoost have become increasingly popular for prediction problems. Traditionally, there has been a trade-off between interpretation and accuracy, and simple models such as the linear regression are sometimes preferred for the sake of transparency and interpretability.

But SHAP (or Shapley Additive Values) is one of many approaches that is changing this trend, and we are progressively moving towards models that are complex, accurate and interpretable.

This article is heavily based on Scott Lundberg’s talk, which I mention in the sources below. I will focus on the theory of SHAP, then move to some applications. Because code and tutorials are abundant, I will link a few in the sources section.

The SHAP approach is to explain small pieces of complexity of the machine learning model. So we start by explaining individual predictions, one at a time.

This is important to keep in mind: We are explaining the contributions of each feature to an individual predicted value. In a linear regression, we know the contribution of each feature X by construction (X*Beta), and those contributions sum to explain the outcome Y, leading to an intuitive interpretation of the model (See Figure 1).

Figure 1: Contributions in a linear model where the X variables predict an outcome Y, image by author

The idea of SHAP is to find a function Phi to attribute credit to the prediction, just like in a linear model:

Figure 2: Contributions in a machine learning model where the X variables predict an outcome Y, image by author

So how do we get the Phis? (note the dependency of Phi with the data X, and the machine learning model f, in Figure 2)

For illustration purposes, let’s imagine that we are using an XGBoost algorithm to predict Y, the probability of having a future hospitalization, based on two variables X1 and X2:

X1: The patient had a hospitalization in the past year
X2: The patient visited the emergency room (ER) in the past year

and that from this algorithm, we get an average predicted probability (of a future hospitalization) of 5%.

With SHAP, we are trying to explain an individual prediction. So let’s take the example of patient A and try to explain their probability of having a hospitalization. Let’s imagine that according to our machine learning model this probability is 27%. We can ask ourselves: how did we deviate from the average model prediction (5%), to get to patient A’s prediction (27%)?

Figure 3: Predicted probability of a hospital admission for Patient A, image by author

One way to obtain the Phi’s (contributions to the predicted probability) for patient A, is by looking at the expected value of f(X) conditional on the observed feature values. We know that patient A had both a hospitalization and an ER visit in the past year, so X1 and X2 are both equal to 1.

We start by calculating:

And then:

Figure 4: Predicted probability of a hospital admission for patient A, broken down by feature contribution, image by author

In this example we see from Figure 4 that Phi1 (the contribution of the variable X1) is:
20% -5% =15%

And Phi2 (the contribution of the variable X2) is:
27% -20% =7%

But we can’t just stop here. It turns out that getting the Phi’s is not as simple because ordering matters in order to compute them (which variable do we start to condition on?). This is because in presence of interactions, the contribution of the interaction to the final prediction goes to the last variable in the order.

For example, if there is an interaction between our variables X1 and X2 such that if they are both equal to 1, the final predicted probability is higher, the contribution of the interaction will be absorbed by (or put simply, added to the contribution of) the second ordered variable X2, as we compute the second term:

Let’s visualize this by reversing the order of the conditioning. So this time, we start to condition on X2 instead of X1.

Figure 5: Predicted probability of a hospital admission for Patient A, broken down by feature contribution, image by author

Now, we see from Figure 5 that Phi1 (the contribution of the variable X1) is:
27% -10% =17% (earlier, we found that this was 15%)

And Phi2 (the contribution of the variable X2) is:
10% -5% =5% (earlier, we found that this was 7%)

So by approaching this problem sequentially, we see that we get unstable values for Phi due to potential interactions, dependending on which variable we start to condition on. And so the question becomes: is there a good way to allocate contribution among a set of inputs, without having to think about the order?

The solution comes from game theory, and specifically Shapley values as introduced by Lord Shapley in the 1950s.

In this context, players (features) play a game (an observation) together to win a prize (a prediction). Some players contributed more than others in winning the game, but players interact with each other (interactions) to win the game. So how do we fairly divide the prize?

Shapley set up some assumptions, defining the properties of fairness leading to a unique solution in dividing the prize: the “Shapley values”. In theory, Shapley values are easy to compute, as they are the result of averaging over all possible orderings N! (when N is the number of features). But in practice this approach is very computational, and so the authors find faster ways to compute Shapley values for a class of functions (e.g. for trees).

Before we move on to some illustrations, let’s note as a major caveat:

If the features are not independent , the Phi’s will not be completely accurate

One of the main assumptions of the model is independence, which helps with the computation of the conditional expectations. Let’s see how.
We take the example of Ε(f(X)|X1=1), which can be computed by:

(a) Fixing the feature of interest to its value (in our example, X1=1)
(b) Randomly sampling values from other features (in our example, randomly sampling values from the variable X2)
(c) Feeding the synthetic observations (X1,X2) generated from (a) and (b) into the model f(X) to get predictions
(d) Taking the average of the predictions

These steps would in theory approximate Ε(f(X)|X1=1), but clearly step (b) breaks if there is high correlation between the features: by fixing X1 in (a), and randomly sampling X2 in (b), we break the correlation between X1 and X2.

Now that we have covered some theory, let’s see what we can do with SHAP in practice!

(a) SHAP gives feature contributions for each individual observation

We discussed how SHAP was primarily focused on estimating individual contributions. A force plot summarizes how each feature contributes to an individual prediction.

In the example below, researchers are using a black-box model to predict the risk of hypoxemia (low blood oxygen) in an operating room. Features in red are shown to have a positive (increased) impact on the predicted odds, while those in green are shown to have a negative (reduced) impact on the predicted odds.

Figure 6: Example of a force plot, source: Scott Lundberg’s presentation listed below

Figure 6 displays the force plot for a particular patient (remember, we are explaining one observation at a time!). We can see that their low tidal volume (amount of air moving in or out of lungs) contributed to an increase in risk of hypoxemia (by about 0.5).

These relationships are not necessarily causal. Feature impact can be due to correlation.

(b) SHAP gives global explanations and feature importance

Local explanations as described in (a) can be put together to get a global explanation. And because of the axiomatic assumptions of SHAP, it turns out global SHAP explanations can be more reliable than other measures such as the Gini index.

SHAP can give better feature selection power

In the example below, researchers are predicting mortality risk based on a collection of baseline variables. Figure 7 shows the global feature importance. A straightforward illustration here is that age is the most predictive variable for death.

Figure 7: Example of feature importance, source: Scott Lundberg’s presentation listed below

The issue with global feature importance is that prevalence is mixed with magnitude. This means that:

Rare high magnitude effects will not appear in the feature importance plot.

Let’s illustrate with Scott Lundberg’s example. ‘blood protein’ in Figure 7 is the least important feature globally. But when we look at a local explanations summary (Figure 8 below, right panel), we see that high blood protein levels are very predictive of mortality for some people.

So although it appears that this variable is not as important in predicting mortality risk on average as say age, for a few people it highly correlates with mortality risk: and so SHAP values reveal heterogeneity in the population.

Figure 8: Example of a local explanation summary, source: Scott Lundberg’s presentation listed below

(c) SHAP reveals heterogeneity and interactions

We can use SHAP values to further understand the sources of heterogeneity. One way to do this is to use a SHAP partial dependence plot (Figure 9). Partial dependence plots display SHAP values against a specific feature, and color the observations according to another feature. In this example, SHAP values are plotted against systolic blood pressure, and observations are colored according to their age.

Let’s look at the local explanation summary plot again (Figure 8, right panel). We can see that higher systolic blood pressure is associated with higher mortality risk. From the partial dependence plot below in Figure 9, we can further say: high systolic blood pressure is associated with higher mortality risk, especially if it onsets at a younger age.

Figure 9: Example of a SHAP partial dependence plot, source: Scott Lundberg’s presentation listed below

To conclude this article, note that in addition to the aforementioned, SHAP values have many more applications. One worth mentioning is their usefulness in monitoring machine learning models after they are deployed. For example, SHAP values can reveal that some variables are suddenly contributing to more loss in the machine learning model due to anomalies in the source data. For more insights, I would encourage anyone to watch Scott Lundberg’s presentation linked in the sources below.

Future areas of research according to the author include interpretability in presence of correlated features, and incorporating causal assumptions into the Shapley explanations.

Sources:
(1) https://youtu.be/B-c8tIgchu0
(2) https://shap.readthedocs.io/en/latest/index.html
(3) https://christophm.github.io/interpretable-ml-book/shap.html
(4) https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27
(5) https://towardsdatascience.com/a-novel-approach-to-feature-importance-shapley-additive-explanations-d18af30fc21b
(6) https://arxiv.org/abs/1705.07874

You may like these posts

Post a Comment

hey there, great job keep on interacting