Why black boxes need opening, what a “good explanation” even is, and the three great families of methods — feature attributions, rule lists, and counterfactuals — plus the open problems that haunt them.
Modern models make consequential decisions about people, yet we often cannot say why they decided as they did. Before we build any machinery, we should be clear about what we are even trying to fix — and why we should care.
Imagine a bank refuses your loan, a hospital algorithm decides you are low-priority, or a model flags your photo as a security risk. In each case a number comes out, a decision follows, and a human is affected. The natural question — “why?” — turns out to be surprisingly hard to answer for the most powerful models we have. This week is about that question.
Supervised learning algorithms are increasingly used in a variety of high-stakes domains, from credit scoring to medical diagnosis. That is the good news: they work well enough that we are willing to let them touch life-altering decisions.
The bad news, on the next slide: many such methods are opaque, in that humans cannot understand the reasoning behind particular predictions. A deep net or a gradient-boosted ensemble may have millions of parameters interacting non-linearly; the mapping from input to output is computable but not comprehensible. This mismatch — high stakes plus low transparency — is the black-box problem, and it is the engine that drives the whole field of XAI.
A model is a black box when you can feed it inputs and read off outputs, but you cannot follow the internal reasoning that links the two in a way a human can understand. Note this is about human comprehension, not secrecy — even a fully open-source neural network can be a black box.
Watson gives three reasons we want explanations. Memorise these — they are the cleanest possible exam target, and each comes with a concrete example on the following slides.
1. To audit — to catch the model behaving badly (e.g. discriminating).
2. To validate — to check the model is right for the right reasons, not by luck or a shortcut.
3. To discover — to learn something new about the world from a model that predicts well.
A handy gloss: audit looks for harm, validate looks for cheating, discover looks for knowledge.
Example: an algorithm used by a major private healthcare provider exhibits systematic racial bias. (This is the famous Obermeyer et al. study: the model used health-care cost as a proxy for health need. Because less money is historically spent on Black patients at the same level of illness, the algorithm systematically underestimated their need — at any given risk score, Black patients had more active chronic conditions than White patients.) Without an explanation that surfaces which feature is driving the score, this bias hides in plain sight.
Example: an image classifier is trained to distinguish huskies from wolves. However, the training set contains snow in the background for all and only the wolf images. The model achieves great accuracy — by detecting snow, not wolf. It is right for entirely the wrong reason. An explanation (here, a saliency map highlighting the background) reveals the shortcut. This is the textbook example of why high test accuracy alone does not validate a model.
Example: biologists use a high-performing mortality model to generate hypotheses about biomarkers and disease mechanism. Here the model is not the suspect (audit) nor the student (validate) but the oracle: if it predicts mortality well, the features it leans on — and their interactions — may point to real biology worth investigating in the lab. The slide shows a SHAP interaction plot between white-blood-cell count and blood-urea-nitrogen, exactly the kind of structure a scientist might turn into a testable hypothesis.
“To discover” means generating hypotheses, not confirming causes. A feature attribution tells you what the model relies on; whether that reflects a real causal mechanism in the world is a separate, much harder question. Don’t write that XAI “proves” a biomarker causes mortality.
The most directly relevant motive is validation — checking the model is right for the right reasons. High accuracy is exactly what the husky/wolf example warns about: a model can hit 99% by exploiting a spurious shortcut. A real-world parallel: pneumonia classifiers have been caught keying on hospital metadata (e.g. a portable-scanner token burned into the image at sicker hospitals) rather than lung pathology. An explanation (saliency map) could reveal the model is attending to a corner tag, not the lungs. Auditing (for demographic bias) is a secondary motive.
The black-box problem is that high-stakes supervised models are opaque — humans can read their inputs and outputs but cannot understand the reasoning behind particular predictions. Open-sourcing the weights does not help, because comprehensibility is a cognitive property: a human staring at a million interacting parameters still cannot trace why this input produced that output. Transparency of code is not the same as transparency of reasoning.
Before we can build explanations, we should ask what an explanation is. Philosophy of science has wrestled with this for a century. Watson walks through four positions — each fixes the previous one’s flaw, and each fails in an instructive way.
It is tempting to skip the philosophy and get to the algorithms, but the exam can and does test this. The four big questions on slide 11 frame everything: What do good explanations look like? What are the proper units of explanation? Do all successful explanations share some underlying form? Do we have non-circular success criteria for explanations? Keep that last one in mind — “non-circular success criteria” is the thread that the whole section, and the conclusion, keeps tugging.
The classic starting point is the deductive-nomological (DN) model (Hempel, 1965). The idea: to explain is to deduce the event from facts plus laws.
The explanation for some event \(E\) consists of two components:
1. A non-empty set of observation statements \(S = \{s_1,\dots,s_n\}\); and
2. at least one law-like generalisation \(L\), such that:
In words: the facts and the laws together logically entail the event. If you can derive \(E\), you have explained it.
“Why did E happen? Because of these facts (S), and given this law (L), E had to happen.” The explanation is a little logical proof whose conclusion is the thing you’re explaining.
We can have a perfectly good explanation that is not DN-compliant — i.e. the facts and laws do not strictly entail the event. The slide’s example:
\(s_1\): Patient A has infection \(x\)
\(s_2\): Patient A receives treatment
\(L_1\): 0% of untreated patients with infection \(x\) survive
\(L_2\): 99% of treated patients with infection \(x\) survive
\(\therefore E\): Patient A survives.
This explains A’s survival well — yet the laws are probabilistic (99%, not 100%), so \((S\wedge L)\) does not logically entail \(E\). Deduction fails, but the explanation is still good. Hence DN is unnecessary.
Worse, we can have a DN-compliant derivation that is not a good explanation at all. The famous example:
\(s_1\): Patient A is male
\(s_2\): Patient A has been taking birth-control pills regularly
\(L_1\): All males who take birth-control pills regularly fail to get pregnant
\(\therefore E\): Patient A fails to get pregnant.
This is valid deduction from a true law — fully DN-compliant — yet absurd as an explanation. The pills are irrelevant; A fails to get pregnant because he is male. So DN-compliance is not sufficient for a good explanation.
Don’t mix these up. Unnecessary (Objection 1, infection case): a good explanation that is not DN-compliant ⇒ DN isn’t required. Insufficient (Objection 2, birth-control case): a DN-compliant derivation that is not a good explanation ⇒ DN isn’t enough. Together they say DN is neither necessary nor sufficient.
It is DN-compliant (valid deduction from a law-like generalisation) but not a good explanation — the falling barometer does not cause the storm; both are caused by a drop in atmospheric pressure. This illustrates Objection 2: DN is insufficient, exactly like the birth-control case. It shows DN models cannot distinguish genuine explanatory relevance from mere correlation/common cause, which motivates the move to a causal account next.
The fix for “relevance” is to make explanation causal. The interventionist model (Woodward, 2003) says \(X=x\) explains \(Y=y\) within a model \(M\) if and only if three conditions hold:
\(X=x\) explains \(Y=y\) within model \(M\) iff:
1. The generalisations described by \(M\) are accurate (or approximately so), as are the observations \(X=x\) and \(Y=y\).
2. According to \(M\), \(Y=y\) under an intervention that sets \(X\) to \(x\).
3. There exists some possible intervention that sets \(X\) to \(x'\) (with \(x \ne x'\)), with \(M\) correctly describing the value \(y'\) (with \(y \ne y'\)) that \(Y\) would assume under that intervention.
\(X\) explains \(Y\) if wiggling \(X\) wiggles \(Y\). Condition 2: setting \(X=x\) gives \(Y=y\). Condition 3 is the crucial one — there must be some other setting \(x'\) that would have produced a different \(y'\). If changing \(X\) never changes \(Y\), then \(X\) doesn’t explain \(Y\). This is exactly why “being male” (which you can’t intervene to change in this contrast) survives but “taking the pills” drops out: intervening on pill-taking changes nothing.
Note the deep link to counterfactuals (Part V): condition 3 is a counterfactual claim. Causal explanation and counterfactual explanation are two sides of one coin.
The interventionist account is appealing but has a cost: causal explanations may be overly complex. A full causal model of a real system can involve enormous numbers of variables and interventions — far more than a human wants or can use. Causal correctness and human usability pull in opposite directions.
The pragmatist response: stop hunting for one universal form. Epistemological pragmatism (van Fraassen, 1980) reframes explanation as relative to a question asked in a context.
“The discussion of explanation went wrong at the very beginning when explanation was conceived of as a relation like description: a relation between a theory and a fact. Really, it is a three-term relation between theory, fact, and context. No wonder that no single relation between theory and fact ever managed to fit more than a few examples! Being an explanation is essentially relative for an explanation is an answer… it is evaluated vis-à-vis a question, which is a request for information. But exactly… what is requested differs from context to context.”
An explanation is an answer to a why-question, and which answer is good depends on what the asker wanted to know. “Why did the model deny the loan?” has different good answers for the applicant (what to change), the regulator (is it fair?), and the engineer (is it a bug?). Explanation is a three-term relation: theory + fact + context.
The pragmatist’s flexibility is also its weakness: if goodness is fully context-relative, it threatens to collapse into “anything goes” — with no principled, non-circular criterion to separate genuine explanations from persuasive-sounding rubbish. This is precisely the “non-circular success criteria” worry from slide 11, now unresolved.
Watson points to Tim Miller’s influential survey, “Explanation in artificial intelligence: Insights from the social sciences” (University of Melbourne) — the standard reference for how human-facing explanation actually works.
The four theories form a chain of fixes: DN (logic) → fails relevance → interventionist (causation) → too complex → pragmatism (context) → “anything goes”. If asked to “discuss theories of explanation,” narrate this progression and pair each model with its objection and its named author + year. Bare lists lose marks.
Take \(Y\) = “becomes pregnant”. For \(X\) = sex: condition 2 holds (setting sex = male gives \(Y\) = not pregnant), and condition 3 holds because there is a possible intervention setting sex = female under which the model would assign a different value (\(Y\) could be pregnant). So sex explains \(Y\). For \(X\) = pill-taking: condition 3 fails — intervening to set pill-taking = “not taking pills” leaves \(Y\) unchanged (a male still doesn’t get pregnant). Because no intervention on \(X\) changes \(Y\), pill-taking is explanatorily irrelevant. The interventionist account thus repairs exactly the “insufficiency” the DN model couldn’t handle.
DN and Woodward treat explanation as a two-term relation between a theory/model and a fact. Van Fraassen adds a third term, context: an explanation is an answer to a specific why-question, and goodness is judged relative to what that question requested. Buys: it accommodates the fact that different stakeholders need different (all legitimate) answers about the same prediction, which a single universal form can’t. Cost: the “anything goes” objection — without a principled, non-circular standard, almost any answer can be defended as “good in some context,” undermining our ability to evaluate explanations rigorously.
The black-box problem has spawned a whole research field. This part lays out the three method families and the three dichotomies that organise them, then dives deep into the crown jewel — Shapley values — the one piece of maths most likely to be tested in full.
Watson organises XAI around three families of methods and three dichotomies . Get these two trios straight and you have the map of the entire practical field.
1. Feature attributions — assign each input feature a number quantifying its contribution to a prediction (LIME, QII, SHAP). “How much did each feature matter?”
2. Rule lists — give logical if-then conditions that account for the prediction. “What conditions guarantee this output?”
3. Counterfactuals — describe the smallest change to the input that would flip the output. “What would have to be different?”
Intrinsic vs. post-hoc — is the model interpretable by design (a “glass box”), or do we bolt an explainer on afterwards?
Model-specific vs. model-agnostic — does the method need the model’s internals, or does it work on any model via input/outp equal weights).