KCL · Machine Learning · Week 11

Explainable AI

Why black boxes need opening, what a “good explanation” even is, and the three great families of methods — feature attributions, rule lists, and counterfactuals — plus the open problems that haunt them.

KCL · ML ~48 min read 17 self-tests 80 slides mapped Lecturer: David Watson
where it all starts

Part I · IntroductionWhy Explain?

Modern models make consequential decisions about people, yet we often cannot say why they decided as they did. Before we build any machinery, we should be clear about what we are even trying to fix — and why we should care.

Imagine a bank refuses your loan, a hospital algorithm decides you are low-priority, or a model flags your photo as a security risk. In each case a number comes out, a decision follows, and a human is affected. The natural question — “why?” — turns out to be surprisingly hard to answer for the most powerful models we have. This week is about that question.

The black-box problem slides 4–5

Supervised learning algorithms are increasingly used in a variety of high-stakes domains, from credit scoring to medical diagnosis. slide 4 That is the good news: they work well enough that we are willing to let them touch life-altering decisions.

The bad news, on the next slide: many such methods are opaque, in that humans cannot understand the reasoning behind particular predictions. slide 5 A deep net or a gradient-boosted ensemble may have millions of parameters interacting non-linearly; the mapping from input to output is computable but not comprehensible. This mismatch — high stakes plus low transparency — is the black-box problem, and it is the engine that drives the whole field of XAI.

Definition · Black box

A model is a black box when you can feed it inputs and read off outputs, but you cannot follow the internal reasoning that links the two in a way a human can understand. Note this is about human comprehension, not secrecy — even a fully open-source neural network can be a black box.

Three reasons to explain slide 6

Watson gives three reasons we want explanations. Memorise these — they are the cleanest possible exam target, and each comes with a concrete example on the following slides.

In plain English · The three motives

1. To audit — to catch the model behaving badly (e.g. discriminating).
2. To validate — to check the model is right for the right reasons, not by luck or a shortcut.
3. To discover — to learn something new about the world from a model that predicts well.

A handy gloss: audit looks for harm, validate looks for cheating, discover looks for knowledge.

Reason 1: To audit slide 7

Example: an algorithm used by a major private healthcare provider exhibits systematic racial bias. slide 7 (This is the famous Obermeyer et al. study: the model used health-care cost as a proxy for health need. Because less money is historically spent on Black patients at the same level of illness, the algorithm systematically underestimated their need — at any given risk score, Black patients had more active chronic conditions than White patients.) Without an explanation that surfaces which feature is driving the score, this bias hides in plain sight.

Active chronic conditions Percentile of algorithm risk score Referred for screen Auto-enrolled Black White
At every risk-score percentile, Black patients carry more chronic conditions than White patients — the model under-rates their need. An explanation revealing “cost is the target” is what makes this auditable.

Reason 2: To validate slide 8

Example: an image classifier is trained to distinguish huskies from wolves. However, the training set contains snow in the background for all and only the wolf images. slide 8 The model achieves great accuracy — by detecting snow, not wolf. It is right for entirely the wrong reason. An explanation (here, a saliency map highlighting the background) reveals the shortcut. This is the textbook example of why high test accuracy alone does not validate a model.

🐺 (a) Husky classified as “wolf” (b) Explanation = the snowy background
The explanation highlights snow, not the animal: the classifier learned a spurious correlate. Validation = checking the model is right for the right reasons.

Reason 3: To discover slide 9

Example: biologists use a high-performing mortality model to generate hypotheses about biomarkers and disease mechanism. slide 9 Here the model is not the suspect (audit) nor the student (validate) but the oracle: if it predicts mortality well, the features it leans on — and their interactions — may point to real biology worth investigating in the lab. The slide shows a SHAP interaction plot between white-blood-cell count and blood-urea-nitrogen, exactly the kind of structure a scientist might turn into a testable hypothesis.

Exam trap · Discovery ≠ causation

“To discover” means generating hypotheses, not confirming causes. A feature attribution tells you what the model relies on; whether that reflects a real causal mechanism in the world is a separate, much harder question. Don’t write that XAI “proves” a biomarker causes mortality.

QA team reports 99% accuracy on a pneumonia X-ray classifier and asks why they should bother with explanations. Which of the three motives is most directly relevant, and what might an explanation reveal?

The most directly relevant motive is validation — checking the model is right for the right reasons. High accuracy is exactly what the husky/wolf example warns about: a model can hit 99% by exploiting a spurious shortcut. A real-world parallel: pneumonia classifiers have been caught keying on hospital metadata (e.g. a portable-scanner token burned into the image at sicker hospitals) rather than lung pathology. An explanation (saliency map) could reveal the model is attending to a corner tag, not the lungs. Auditing (for demographic bias) is a secondary motive. ▴ hide answer

QDefine the black-box problem in one sentence, and explain why an open-source model can still be a black box.

The black-box problem is that high-stakes supervised models are opaque — humans can read their inputs and outputs but cannot understand the reasoning behind particular predictions. Open-sourcing the weights does not help, because comprehensibility is a cognitive property: a human staring at a million interacting parameters still cannot trace why this input produced that output. Transparency of code is not the same as transparency of reasoning. ▴ hide answer

·  ·  ·
a detour through philosophy of science

Part II · TheoryWhat Makes a Good Explanation?

Before we can build explanations, we should ask what an explanation is. Philosophy of science has wrestled with this for a century. Watson walks through four positions — each fixes the previous one’s flaw, and each fails in an instructive way.

It is tempting to skip the philosophy and get to the algorithms, but the exam can and does test this. The four big questions on slide 11 frame everything: What do good explanations look like? What are the proper units of explanation? Do all successful explanations share some underlying form? Do we have non-circular success criteria for explanations? slide 11 Keep that last one in mind — “non-circular success criteria” is the thread that the whole section, and the conclusion, keeps tugging.

Theory 1 — The Deductive-Nomological model slide 12

The classic starting point is the deductive-nomological (DN) model (Hempel, 1965). The idea: to explain is to deduce the event from facts plus laws.

Definition · DN model (Hempel 1965)

The explanation for some event \(E\) consists of two components:

1. A non-empty set of observation statements \(S = \{s_1,\dots,s_n\}\); and
2. at least one law-like generalisation \(L\), such that:

$$(S \wedge L) \rightarrow E.$$

In words: the facts and the laws together logically entail the event. If you can derive \(E\), you have explained it.

In plain English

“Why did E happen? Because of these facts (S), and given this law (L), E had to happen.” The explanation is a little logical proof whose conclusion is the thing you’re explaining.

Objection 1: the DN model is unnecessary slides 13–14

We can have a perfectly good explanation that is not DN-compliant — i.e. the facts and laws do not strictly entail the event. The slide’s example: slide 14

Example · A good non-DN explanation

\(s_1\): Patient A has infection \(x\)
\(s_2\): Patient A receives treatment
\(L_1\): 0% of untreated patients with infection \(x\) survive
\(L_2\): 99% of treated patients with infection \(x\) survive
\(\therefore E\): Patient A survives.

This explains A’s survival well — yet the laws are probabilistic (99%, not 100%), so \((S\wedge L)\) does not logically entail \(E\). Deduction fails, but the explanation is still good. Hence DN is unnecessary.

Objection 2: the DN model is insufficient slides 15–16

Worse, we can have a DN-compliant derivation that is not a good explanation at all. The famous example: slide 16

Example · A bad DN-compliant “explanation”

\(s_1\): Patient A is male
\(s_2\): Patient A has been taking birth-control pills regularly
\(L_1\): All males who take birth-control pills regularly fail to get pregnant
\(\therefore E\): Patient A fails to get pregnant.

This is valid deduction from a true law — fully DN-compliant — yet absurd as an explanation. The pills are irrelevant; A fails to get pregnant because he is male. So DN-compliance is not sufficient for a good explanation.

Exam trap · “Unnecessary” vs “insufficient”

Don’t mix these up. Unnecessary (Objection 1, infection case): a good explanation that is not DN-compliant ⇒ DN isn’t required. Insufficient (Objection 2, birth-control case): a DN-compliant derivation that is not a good explanation ⇒ DN isn’t enough. Together they say DN is neither necessary nor sufficient.

QA barometer reading drops, then a storm arrives. From “the barometer dropped” plus “whenever the barometer drops, a storm follows” we can deduce the storm. Is this a good explanation? Which DN objection does it illustrate?

It is DN-compliant (valid deduction from a law-like generalisation) but not a good explanation — the falling barometer does not cause the storm; both are caused by a drop in atmospheric pressure. This illustrates Objection 2: DN is insufficient, exactly like the birth-control case. It shows DN models cannot distinguish genuine explanatory relevance from mere correlation/common cause, which motivates the move to a causal account next. ▴ hide answer

Theory 2 — The interventionist model slide 17

The fix for “relevance” is to make explanation causal. The interventionist model (Woodward, 2003) says \(X=x\) explains \(Y=y\) within a model \(M\) if and only if three conditions hold:

Definition · Interventionist model (Woodward 2003)

\(X=x\) explains \(Y=y\) within model \(M\) iff:

1. The generalisations described by \(M\) are accurate (or approximately so), as are the observations \(X=x\) and \(Y=y\).
2. According to \(M\), \(Y=y\) under an intervention that sets \(X\) to \(x\).
3. There exists some possible intervention that sets \(X\) to \(x'\) (with \(x \ne x'\)), with \(M\) correctly describing the value \(y'\) (with \(y \ne y'\)) that \(Y\) would assume under that intervention.

In plain English

\(X\) explains \(Y\) if wiggling \(X\) wiggles \(Y\). Condition 2: setting \(X=x\) gives \(Y=y\). Condition 3 is the crucial one — there must be some other setting \(x'\) that would have produced a different \(y'\). If changing \(X\) never changes \(Y\), then \(X\) doesn’t explain \(Y\). This is exactly why “being male” (which you can’t intervene to change in this contrast) survives but “taking the pills” drops out: intervening on pill-taking changes nothing.

Note the deep link to counterfactuals (Part V): condition 3 is a counterfactual claim. Causal explanation and counterfactual explanation are two sides of one coin.

Objection: causal explanations may be overly complex slide 18

The interventionist account is appealing but has a cost: causal explanations may be overly complex. slide 18 A full causal model of a real system can involve enormous numbers of variables and interventions — far more than a human wants or can use. Causal correctness and human usability pull in opposite directions.

Theory 3 — Epistemological pragmatism slide 19

The pragmatist response: stop hunting for one universal form. Epistemological pragmatism (van Fraassen, 1980) reframes explanation as relative to a question asked in a context.

Direct quote · van Fraassen (1980, p. 156)

“The discussion of explanation went wrong at the very beginning when explanation was conceived of as a relation like description: a relation between a theory and a fact. Really, it is a three-term relation between theory, fact, and context. No wonder that no single relation between theory and fact ever managed to fit more than a few examples! Being an explanation is essentially relative for an explanation is an answer… it is evaluated vis-à-vis a question, which is a request for information. But exactly… what is requested differs from context to context.”

In plain English

An explanation is an answer to a why-question, and which answer is good depends on what the asker wanted to know. “Why did the model deny the loan?” has different good answers for the applicant (what to change), the regulator (is it fair?), and the engineer (is it a bug?). Explanation is a three-term relation: theory + fact + context.

Objection: anything goes slide 20

The pragmatist’s flexibility is also its weakness: if goodness is fully context-relative, it threatens to collapse into “anything goes” slide 20 — with no principled, non-circular criterion to separate genuine explanations from persuasive-sounding rubbish. This is precisely the “non-circular success criteria” worry from slide 11, now unresolved.

Further reading slide 21

Watson points to Tim Miller’s influential survey, “Explanation in artificial intelligence: Insights from the social sciences” (University of Melbourne) — the standard reference for how human-facing explanation actually works. slide 21

Exam trap · Know the arc, not just the names

The four theories form a chain of fixes: DN (logic) → fails relevance → interventionist (causation) → too complex → pragmatism (context) → “anything goes”. If asked to “discuss theories of explanation,” narrate this progression and pair each model with its objection and its named author + year. Bare lists lose marks.

QUsing Woodward’s three conditions, explain precisely why “being male” explains the non-pregnancy but “taking birth-control pills” does not.

Take \(Y\) = “becomes pregnant”. For \(X\) = sex: condition 2 holds (setting sex = male gives \(Y\) = not pregnant), and condition 3 holds because there is a possible intervention setting sex = female under which the model would assign a different value (\(Y\) could be pregnant). So sex explains \(Y\). For \(X\) = pill-taking: condition 3 fails — intervening to set pill-taking = “not taking pills” leaves \(Y\) unchanged (a male still doesn’t get pregnant). Because no intervention on \(X\) changes \(Y\), pill-taking is explanatorily irrelevant. The interventionist account thus repairs exactly the “insufficiency” the DN model couldn’t handle. ▴ hide answer

QWhy does van Fraassen call explanation a “three-term relation,” and what does this buy us over the DN and interventionist models — and at what cost?

DN and Woodward treat explanation as a two-term relation between a theory/model and a fact. Van Fraassen adds a third term, context: an explanation is an answer to a specific why-question, and goodness is judged relative to what that question requested. Buys: it accommodates the fact that different stakeholders need different (all legitimate) answers about the same prediction, which a single universal form can’t. Cost: the “anything goes” objection — without a principled, non-circular standard, almost any answer can be defended as “good in some context,” undermining our ability to evaluate explanations rigorously. ▴ hide answer

·  ·  ·
from philosophy to algorithms

Part III · PracticeThe Mechanics of XAI & Shapley Values

The black-box problem has spawned a whole research field. This part lays out the three method families and the three dichotomies that organise them, then dives deep into the crown jewel — Shapley values — the one piece of maths most likely to be tested in full.

Watson organises XAI around three families of methods slides 23–25 and three dichotomies slide 26. Get these two trios straight and you have the map of the entire practical field.

Definition · The three method families

1. Feature attributions — assign each input feature a number quantifying its contribution to a prediction (LIME, QII, SHAP). “How much did each feature matter?”
2. Rule lists — give logical if-then conditions that account for the prediction. “What conditions guarantee this output?”
3. Counterfactuals — describe the smallest change to the input that would flip the output. “What would have to be different?”

Definition · The three dichotomies (slide 26)

Intrinsic vs. post-hoc — is the model interpretable by design (a “glass box”), or do we bolt an explainer on afterwards?
Model-specific vs. model-agnostic — does the method need the model’s internals, or does it work on any model via input/outpequal weights).

  • Define minimally sufficient (abductive / prime implicant) and minimally necessary (contrastive / prime implicate) explanations with conditions (a)–(c), and cite Shih 2018 / Ignatiev 2019.
  • State the complexity results: polynomial for single trees/naive Bayes; \(D^P\)-complete for random forests (Izza & Marques-Silva 2021, SAT-solvers); MaxSAT for GBMs (Ignatiev 2022).
  • Write the counterfactual objective \(x^\ast=\arg\min_{x'\in\mathrm{CF}(x)}\text{cost}(x,x')\) and the L1/L2 distances; explain why the “costliest” counterfactual depends on the metric and constraints.
  • List counterfactual advantages and disadvantages and explain recourse and why ignoring causal dependencies gives provably suboptimal recourse.
  • Explain the Rashomon set and Rudin’s (2019) “just use a glass box” argument, plus the two objections (no guarantee; IP law).
  • Recall the adversarial results: Dombrowski 2019 (arbitrary explanations), Slack 2021 (fooling fairness audits), Slack 2022 (errors of necessity in counterfactuals); and Mayo 2018 on error-rate control.
  • Reproduce the three closing questions and connect “non-circular success criteria” back to underdetermination (Lipton 2017; Doshi-Velez & Kim 2017) and van Fraassen.
  • Be able to map every method to the right why-question: attribution = “how much,” rule list = “what conditions,” counterfactual = “what change.”