Evidence-Based Practice & Medical Statistics — Bradford VTS

Bradford VTS · AKT Mastery Series

Evidence-Based Practice & Medical Statistics

Because "I read something online" doesn't quite cut it in the AKT.

🔥 High-yield tips for AKT ⚡ High-impact learning in minutes 💎 Hidden gems they forget to teach

Last updated: April 2026 · Also known as Evidence-Based Medicine (EBM)

📥 Downloads

Handouts, summaries, and teaching extras — ready when you are.

path: Evidence-Based Practice is also known as Evidence-Based Medicine or EBM for short.

🌐

Web Resources

A hand-picked mix of official guidance and real-world GP training resources. Because sometimes the best pearls are not hiding in the official documents.

📘 Core EBM & Statistics

CEBM — Oxford Centre for EBM Tools, guides, and critical appraisal resources Dr Chris Cates' EBM Website NNT calculator and Cates plot explainer — excellent GPnotebook — Statistics Concise clinical statistics reference Bandolier Evidence-based healthcare in plain English Cochrane Library Gold-standard systematic reviews

📗 AKT Revision & Exam Prep

NHS — Clinical Trials Plain-English guide to research types Cambridge EBM Resources Critical appraisal tools and tutorials

📙 Further Reading & Tools

NICE — How to Use Evidence NICE process manual for evidence review NNT Explained — Manchester Interactive NNT calculator and explanation Health Knowledge — Study Designs Public Health textbook on epidemiology

⚡

One-Minute Recall

Scanning this before clinic — or the night before an AKT paper? These are the things that score you marks.

🧮 Risk Formulas

ARR = CER − EER
NNT = 1 ÷ ARR
RRR = ARR ÷ CER
RR = EER ÷ CER
NNH = 1 ÷ ARI

🔬 Diagnostic Testing

Sens = TP ÷ (TP+FN) → SnNout
Spec = TN ÷ (TN+FP) → SpPin
PPV = TP ÷ (TP+FP) ← falls with low prevalence
NPV = TN ÷ (TN+FN) ← falls with high prevalence

📊 Study Designs

SR/Meta-analysis → highest evidence
RCT → gold standard for treatment
Cohort → RR & incidence
Case-control → OR, rare diseases
Cross-sectional → prevalence

📈 Graphs

Forest plot diamond crosses line = not significant
Funnel asymmetry = publication bias
Cates plot: NNT = 100 ÷ yellow faces
Box plot middle line = median
I² >50% = substantial heterogeneity

📉 Significance

p < 0.05 = statistically significant
CI crosses 1.0 (ratio) = not significant
CI crosses 0 (difference) = not significant
Mean = average; Median = middle value
68-95-99.7 rule for normal distribution

⚖️ Bias Types

Selection → unrepresentative sample
Recall → cases remember more
Publication → positive studies only
Lead time → screening illusion
Attrition → dropout distorts results

💡

Why This Matters in GP & the AKT

EBM and statistics aren't just theoretical. In your consulting room, every conversation about treatment options involves NNTs whether you name them or not. Every blood test has a sensitivity and specificity. Every new guideline is based on a study design that affects how much trust you should place in it.

In the AKT, this topic accounts for a significant proportion of marks — roughly 10–15% of the paper according to RCGP guidance. It is one of the few areas where a small amount of targeted revision pays dividends immediately. Many candidates lose easy marks here not because the concepts are difficult, but because they've never sat down and learned them systematically.

The statistics questions in the AKT often present a table of trial data and ask you to calculate a value, interpret a graph, or identify the best study design. They reward methodical thinking, not medical knowledge. This makes them the most "learnable" marks in the paper.

Once you know how to calculate NNT, interpret a forest plot, and understand the effect of prevalence on PPV, a whole category of AKT questions becomes straightforward rather than scary.

📖

Evidence-Based Medicine (EBM)

"When I was in training in the mid-1980s, I gave an intravenous infusion of lidocaine to every patient who came through the door after a heart attack. That was the standard. Everyone did it. It seemed to make perfect sense."

— Professor Gordon Guyatt, the physician who coined the term "Evidence-Based Medicine", describing his own training before EBM existed

He later discovered that the practice he'd been trained in — and that hundreds of thousands of doctors worldwide were performing — was not only useless, but potentially killing people. Not through negligence. Not through incompetence. But because no one had ever properly tested whether it actually worked.

That story is why Evidence-Based Medicine exists — and why it matters deeply to every patient you will ever see.

📋 What Is Evidence-Based Medicine?

"The conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients."

— David Sackett, BMJ 1996 — the most widely cited definition in medicine

In plain English: rather than treating patients based on habit, opinion, tradition, or what your professor told you, EBM requires you to base clinical decisions on the best available research — rigorously conducted, critically appraised, and honestly interpreted.

It does not replace clinical judgement — it informs it. EBM rests on three inseparable pillars that must work together:

🔬

Best Research Evidence

Clinically relevant, high-quality research — especially from RCTs and systematic reviews

🩺

Clinical Expertise

The skill and experience of the individual clinician, built through years of practice

🤝

Patient Values & Preferences

The individual patient's circumstances, values, and what matters most to them

🕰️ How Did EBM Come About? — A Brief History

EBM didn't appear from nowhere. It was the culmination of decades of quietly revolutionary thinking in Canada and the UK.

Year	Event
1938	John Paul (Yale) coins the term "clinical epidemiology" — the idea that medicine should be studied scientifically in populations, not just observed in individual patients
1967	McMaster University (Hamilton, Canada) opens its new medical school with a Department of Clinical Epidemiology and Biostatistics — radical at the time, dedicated to applying research methods to clinical decisions
1972	Archie Cochrane, a Scottish epidemiologist, publishes Effectiveness and Efficiency: Random Reflections on Health Services — a landmark text arguing that medicine must test its own treatments rigorously. His work eventually gives birth to the Cochrane Collaboration, named in his honour.
1981	David Sackett and colleagues at McMaster publish a nine-article series in the Canadian Medical Association Journal teaching clinicians how to critically appraise medical literature. This is the formal beginning of the EBM movement.
1990	Gordon Guyatt, a young resident director at McMaster, designs a new teaching programme and initially calls it "Scientific Medicine." Colleagues recoil — the implication that current practice isn't scientific is too direct.
1991	Guyatt renames the approach "Evidence-Based Medicine" and publishes the term in an editorial in the ACP Journal Club. The phrase sticks immediately.
1992	The landmark JAMA paper — "Evidence-Based Medicine: A New Approach to Teaching the Practice of Medicine" — introduces EBM to the world. The response, Guyatt recalls, was initially "rage." Colleagues felt they were being told they weren't good doctors.
1993	The Cochrane Collaboration is formally founded — an international network to produce and disseminate systematic reviews of healthcare evidence.
1996	David Sackett publishes the definitive three-pillar definition in the BMJ. EBM becomes mainstream.
2000s–present	EBM becomes embedded in UK training: NICE guidelines, the GMC's Good Medical Practice, and the RCGP curriculum all require it. It is the foundation of how every UK doctor is now trained and assessed.

⚠️ What Was Medicine Like Before EBM? The Problem It Solved

Before EBM, medicine ran on what Gordon Guyatt memorably called "GOBSAT" — Good Old Boys Sitting Around a Table. Clinical guidelines were written by senior experts who pooled their personal opinions, and what happened to your patient depended entirely on which doctor happened to see them.

The Pre-EBM World

Eminence-based medicine: You treated patients the way your professor did. Authority came from seniority, not evidence. A consultant who had "always done it this way" for 30 years was deferred to — even if "this way" had never been tested.
Intuition-based medicine: If a treatment seemed to make physiological sense, it was used. If suppressing abnormal heart rhythms seemed logical, you suppressed them. Whether it actually helped patients was rarely tested.
Anecdote-based medicine: "In my experience, I've found that..." was the standard of evidence. Individual cases drove practice — even when those cases were statistical outliers.
Enormous variation: The same patient presenting to two different hospitals — or even two different doctors in the same hospital — might receive completely different treatment for exactly the same condition.

Would you want a bridge built based on "I've been designing bridges this way for 30 years and none have fallen down yet"? Or based on tested structural engineering science? The answer is obvious. But for most of medicine's history, the bridge approach was exactly what happened.

🔴 The Example That Changed Medicine — The CAST Trial

This is not a hypothetical. It is one of the most important true stories in modern medicine — and one of the strongest arguments that EBM has ever needed.

The Setup — The Logic That Seemed Unassailable

Heart attacks cause dangerous heart rhythm abnormalities (ventricular arrhythmias). Ventricular arrhythmias cause sudden death. Therefore: suppress the arrhythmias → prevent sudden death. This seemed so obviously right that from the 1970s onwards, antiarrhythmic drugs — particularly lidocaine, flecainide, and encainide — were routinely given to post-MI patients in hospitals across the world. Not occasionally. Routinely. As standard care.

The Trial — Someone Actually Tested It

In 1987, the Cardiac Arrhythmia Suppression Trial (CAST) enrolled over 1,700 post-MI patients and randomised them to antiarrhythmic drugs (flecainide or encainide) or placebo. The drugs did exactly what they were supposed to — they successfully suppressed the arrhythmias. But something unexpected happened.

The Result — What Nobody Expected

Patients on the drugs were 2.5 times more likely to die than those on placebo. The trial had to be stopped early because the harm was so clear. The drugs had been killing the very patients they were meant to protect.

NNH = 21. Every 21 patients treated with flecainide or encainide, one additional person died who would otherwise have survived.

The Lesson

Gordon Guyatt — then a young cardiologist — had personally given lidocaine infusions to every post-MI patient who came through his ward. He was following best practice. He had been taught correctly. He had good intentions. And yet, without the rigorous test of an RCT, neither he nor his colleagues had any way of knowing the treatment was harmful. This experience became central to why he dedicated his career to EBM. The history of medicine, he later said, "is full of treatments that were based mostly on guess-work and intuition rather than solid evidence."

🌐 How EBM Changed Everything — Standardisation and Unity

Before EBM, what you got depended on where you happened to live, which hospital you attended, and which doctor saw you. The same patient with the same condition might receive completely different treatments in Leeds and London. Different hospitals. Different countries. Wildly different outcomes.

EBM changed this. By anchoring clinical decisions to the same body of evidence — the same trials, the same systematic reviews, the same guidelines — it gave medicine a common language and a common standard. Today, a patient presenting with an MI in Bradford and a patient presenting with an MI in Bristol should receive essentially the same evidence-based care. Not because doctors are identical, but because the treatment is driven by the evidence, not by individual preference.

In the UK, this is operationalised through NICE guidelines, the RCGP curriculum, QOF indicators, clinical audits, and MRCGP examinations — all of which require and assess evidence-based practice. When you sit the AKT, you are being tested on your ability to apply this framework.

🌍 A World Without EBM — The International Picture

The UK's commitment to EBM — through NICE, the NHS, and postgraduate training — is not universal. In many parts of the world, what you receive as a patient still depends heavily on who sees you, where you present, and how much you can pay. Understanding this helps you appreciate what EBM protects your patients from.

🔴 Important Note Before Reading

The variation described below reflects healthcare systems and structures, not the competence or dedication of individual doctors. Many brilliant, hard-working physicians practise in every country listed. The issue is the absence of the standardising infrastructure — guidelines, oversight, training frameworks — that EBM provides. Individual doctors cannot overcome systemic problems alone.

Country / Region	How Practice Varies Without Strong EBM Frameworks
🇮🇳 India (private sector)	A 2018 Lancet study found C-section rates of 40–58% in private hospitals compared to 10–14% in public facilities — often driven by financial incentives rather than clinical need. Over-investigation and polypharmacy are widely documented in the private sector. The same cancer patient may receive dramatically different treatment based on where they present and what they can afford.
🇵🇰 Pakistan	Significant variation in adherence to antibiotic guidelines — one of the highest antibiotic prescription rates in South Asia. Drug-resistant TB and antimicrobial resistance are direct consequences. Access to specialist care and standardised management pathways is highly dependent on geography and income.
🇳🇬 Nigeria / 🇬🇭 Ghana	Magnesium sulphate is the WHO-recommended, evidence-based, inexpensive treatment for eclampsia. Studies show it reduces maternal mortality significantly. Yet availability and actual use in Nigerian and Ghanaian facilities varies enormously depending on hospital resources and clinician training — meaning whether a woman with eclampsia lives or dies may depend on which facility she reaches.
🇸🇩 Sudan / 🇮🇶 Iraq	Prolonged conflict and instability have devastated healthcare infrastructure. In Iraq after 2003, public health services collapsed, guideline implementation stalled, and access to basic drugs became geography-dependent. In Sudan, conflict has disrupted vaccination programmes, maternal health services, and chronic disease management. Practice variation in such environments is not a matter of preference — it is a matter of what is available.
🇪🇬 Egypt / 🇮🇷 Iran	Both countries have medical schools producing skilled physicians and have published EBM guidelines — but implementation is inconsistent between public and private sectors, and between urban and rural areas. In Iran, international sanctions have affected drug availability, forcing adaptations that diverge from evidence-based protocols.
🇷🇴 Romania	Romania has been documenting the practice of plicul (the "envelope") — informal cash payments to doctors and nurses to ensure care. Officially illegal, widely practised. Parliamentary enquiries and investigative journalism have confirmed that the quality of surgical care can depend on what a patient can pay privately, regardless of their official NHS-equivalent entitlement. Brain drain has removed an estimated 14,000+ doctors since EU accession.
🇺🇸 United States	The most expensive healthcare in the world — over $12,000 per person per year — yet outcomes often no better than the UK. The Dartmouth Atlas project has documented enormous geographic variation in clinical practice: the same patient in Miami may receive twice as many investigations and procedures as the same patient in Minneapolis, with no difference in outcomes. A 2019 JAMA study estimated $935 billion — roughly a quarter of all US healthcare spending — is wasted on unnecessary care. The opioid crisis was partly fuelled by pharmaceutical companies influencing prescribing practices outside of EBM frameworks.
🇫🇷 France	France has excellent healthcare — but antibiotic prescribing rates have historically been among the highest in Europe, driven partly by cultural expectations that a consultation should always end with a prescription. Campaigns to reduce this ("Antibiotics are not automatic") have helped, but the pattern illustrates how cultural and commercial pressures can override evidence-based guidance even in sophisticated systems.
🇮🇹 Italy	Healthcare quality in Northern Italy (Milan, Bologna) is among the best in Europe. In parts of Southern Italy, the picture is very different — longer waiting times for cancer surgery, less consistent application of screening programmes, lower adherence to guideline-based care. Geographic origin within the same country can significantly affect outcomes.
🇬🇷 Greece / 🇪🇸 Spain	Greece's austerity crisis (2010–2015) led to healthcare spending cuts of over 25%, causing documented shortages of medicines, staff reductions, and quality deterioration. Over 35,000 healthcare workers emigrated. In Spain, significant regional variation in cancer survival rates has been documented — the tumour you develop may behave differently depending on which region you happen to live in, not because the biology differs, but because the system's application of evidence-based treatment does.

💊 When What You Get Depends on What You Pay

In healthcare systems without robust EBM frameworks or universal entitlements, the relationship between payment and treatment quality is often direct and documented:

In some Indian private hospitals, a patient presenting with chest pain who can pay for private care may receive immediate catheterisation and stenting. The same patient in the public system may wait hours for an ECG.
In parts of Sub-Saharan Africa, whether a child with severe malaria receives artemisinin-based combination therapy (the evidence-based standard) or an older, less effective drug depends on which facility they reach and what their family can pay.
In countries without universal drug access, cancer chemotherapy agents may be available only to those who can pay out-of-pocket — meaning identical cancers have dramatically different outcomes based solely on income.
In Romania and some other Eastern European countries, the quality of a surgical procedure — the surgeon's diligence, the quality of anaesthesia monitoring, even the availability of post-operative nursing — has been documented to depend on informal payment, not clinical need.

✈️ The Analogy That Makes It Clear

Every time you board a commercial flight, you benefit from one of the most effective safety systems humans have ever built. Not because individual pilots are exceptionally talented — though they are. But because aviation is built around standardised, evidence-tested protocols. Every pilot, every airline, every country follows the same pre-flight checklists, the same landing procedures, the same emergency protocols. The system protects you regardless of which individual pilot you get.

Medicine without EBM is aviation without checklists. Your safety depends entirely on whether you happen to get a good pilot, whether that pilot trained recently enough, whether they are having a good day, and whether they've heard the latest thinking from someone they trust.

EBM replaces luck with systems. It replaces "in my experience" with "in 15,000 trials involving 2 million patients." It replaces the opinion of whoever happens to be most senior in the room with the accumulated evidence of humanity's collective clinical experience. Wouldn't you prefer that for your patients? Wouldn't your patients prefer it for themselves?

💡 Why This Matters for You, as a GP Trainee in the UK

Every NICE guideline you follow is the product of systematic evidence review — someone has done the work of ensuring that what you do is supported by the best available science
Every AKT question on statistics and research methods is testing your ability to critically appraise evidence — to be an active consumer of EBM, not a passive follower of instructions
When you explain an NNT to a patient, or discuss the limitations of a screening test, or refuse to prescribe an antibiotic that isn't indicated, you are practising EBM — consciously, explicitly, and judiciously
And when a drug rep sits across from you and tells you their new medication reduces cardiovascular events by 35%, your first question — "35% relative or absolute?" — is the question that EBM taught medicine to ask

Study Designs & Hierarchy of Evidence

Before you interpret any result, you need to know where it came from. Different study designs answer different questions, generate different statistics, and carry different levels of reliability.

The Evidence Pyramid

Strongest evidence at the top; weakest at the bottom

Systematic Reviews & Meta-Analyses

Randomised Controlled Trials (RCTs)

Cohort Studies

Case-Control Studies

Cross-Sectional Studies

Case Reports & Expert Opinion

⬆ Strongest evidence | ⬇ Weakest evidence

Study Design	Direction	Best For	Generates	Key Weakness
Systematic Review / Meta-Analysis	—	Best overall evidence on a question	Pooled effect size	Only as good as underlying studies; heterogeneity
RCT (Randomised Controlled Trial)	Forward	Does treatment X work?	RR, ARR, NNT	Expensive, artificial setting, ethical issues
Cohort Study	Forward (prospective) or backward (retrospective)	Does exposure cause outcome? Incidence?	Relative Risk (RR), Incidence	Attrition; expensive over time; confounding
Case-Control Study	Backward	Rare diseases; risk factors	Odds Ratio (OR)	Recall bias; cannot calculate incidence directly
Cross-Sectional Study	Single snapshot	How common is X right now? (prevalence)	Prevalence	Cannot establish causation; temporal ambiguity
Case Report / Expert Opinion	—	Hypothesis generation; rare events	Description only	Highly susceptible to bias; not generalisable

📖 Qualitative vs Quantitative Research

Quantitative Research

Answers "how many" or "how much." Uses numbers, statistics, and structured data. Examples: RCTs, cohort studies, surveys with numerical outcomes. Generates p-values, CIs, NNTs.

Qualitative Research

Answers "why" or "how." Uses words, themes, and interviews. Examples: focus groups, ethnographic studies, grounded theory. Explores patient experiences and beliefs.

In the AKT: a question about patient attitudes, experiences, or understanding of illness → qualitative. A question about rates, outcomes, or effectiveness → quantitative.

🔬 Systematic Review vs Meta-Analysis — Not the Same Thing

Systematic Review: A rigorous, structured literature search that identifies, selects, and critically appraises all relevant studies on a question. The result is a qualitative summary of the evidence.

Meta-Analysis: A systematic review that goes one step further — it mathematically pools the quantitative results from multiple studies into a single combined estimate. Not all systematic reviews include a meta-analysis (e.g., if studies are too heterogeneous to combine).

Think of a systematic review as reading every book about a topic and writing a report. A meta-analysis is that report with a formula at the end that averages all the authors' conclusions into one number.

🔄 The PICO Framework

PICO is the standard framework for structuring a clinical research question — used to search the literature and design studies.

Letter	Stands For	Example
P	Population / Patient	Adults with type 2 diabetes
I	Intervention	SGLT2 inhibitors
C	Comparison	Metformin alone
O	Outcome	Cardiovascular events at 5 years

The AKT may present a scenario and ask which study design best answers the PICO question. Match the outcome type to the correct design using the table above.

🎲 RCT Design Features (commonly tested)

Feature	What It Means	Why It Matters
Randomisation	Participants allocated to groups by chance	Eliminates selection bias; balances confounders
Single blind	Participants don't know their allocation	Reduces placebo effect and participant bias
Double blind	Neither participants nor investigators know allocation	Eliminates observer bias AND participant bias
Triple blind	Participants, investigators, AND data analysts blinded	Maximum bias reduction
Intention to Treat (ITT)	Analysed in their original group regardless of adherence	Preserves randomisation; reflects real-world use
Per Protocol	Analysed only if they completed the protocol	Shows biological efficacy but overestimates real-world benefit
Crossover design	Participants receive both treatments in sequence	Each person acts as their own control; needs washout period
Allocation concealment	The person recruiting participants cannot see which group the next participant will be assigned to until after they have been enrolled	Prevents the recruiter from subconsciously (or deliberately) allocating healthier patients to the treatment group — a form of selection bias that randomisation alone does not prevent
Cluster RCT	Whole groups (e.g. GP practices, wards, schools) are randomised rather than individuals	Used when individual randomisation is impractical (e.g. testing a new consultation style). Requires larger sample sizes and statistical adjustment for clustering effect

⚠️ AKT Trap — ITT vs Per Protocol

Intention to treat gives a more conservative (lower) estimate of effectiveness — because it includes non-adherent participants. This is the preferred analysis for clinical decisions. Per protocol overestimates effectiveness but is useful for understanding biological mechanism.

⚖️ Superiority, Non-Inferiority & Equivalence Trials

Not all trials ask the same question. The AKT occasionally tests whether you understand what a trial was actually designed to show — and why that matters when interpreting its results.

Trial Type	The Question Being Asked	Common Context
Superiority trial	"Is the new treatment better than the comparator?"	Most standard RCTs — testing a genuinely new drug or approach
Non-inferiority trial	"Is the new treatment no worse than the comparator by more than a pre-specified small margin?"	New drug with fewer side effects, lower cost, or easier to administer — aim is to show it's "good enough"
Equivalence trial	"Are the two treatments essentially the same?"	Biosimilar drugs; generic medicines; different routes of administration

💡 Why Non-Inferiority Trials Matter in GP

A new anticoagulant might be shown to be "non-inferior" to warfarin for stroke prevention — not better, but not meaningfully worse — while being easier to use (no INR monitoring). That's a clinically valuable finding even if the drug didn't "beat" warfarin. The AKT may ask you to interpret a non-inferiority trial result correctly.

⚠️ AKT Trap

A non-inferiority trial that shows "no significant difference" is not the same as a superiority trial that shows "no significant difference." In a superiority trial, a non-significant result means you failed to prove the new drug works better. In a non-inferiority trial, a non-significant difference is exactly what you were hoping for.

Research Bias, Validity & Reliability

Type of Bias	Definition	Which Study Designs?	How to Reduce It
Selection Bias	Participants are not representative of the target population	All types	Randomisation; careful sampling
Recall Bias	Cases (people with disease) remember past exposures more vividly than controls	Case-control studies	Objective data sources; standardised questioning
Publication Bias	Positive/significant studies are published more often than negative ones	Meta-analyses (detected via funnel plot)	Trial registration; grey literature search
Attrition Bias	Loss of participants to follow-up distorts results (dropouts differ from completers)	Cohort studies, RCTs	Intention-to-treat analysis; minimise dropout
Lead Time Bias	Screening gives the illusion of improved survival by detecting disease earlier	Screening studies	Use disease-specific mortality, not survival from diagnosis
Length Bias	Screening detects more slow-growing (less aggressive) disease	Screening studies	RCTs with disease-specific outcomes
Observer / Assessment Bias	Knowledge of treatment allocation affects outcome assessment	RCTs	Blinding (single, double, triple)
Hawthorne Effect	Participants change their behaviour because they know they are being observed	All types, especially observational	Control groups; blind observers
Verification Bias	Only patients with a positive test result get the gold-standard confirmatory test — so sensitivity appears falsely high and specificity falsely low	Diagnostic test studies	Ensure all patients (positive and negative) receive the gold-standard test
Confounding	A third variable is associated with both exposure and outcome, distorting the apparent relationship	Observational studies	Randomisation (RCTs); stratification; multivariate analysis

Classic confounding example: "Ice cream sales are associated with drowning." Is ice cream dangerous? No — both go up in summer. Season is the confounder. This illustrates why correlation ≠ causation and why confounders must be controlled for.

🔀 Confounding — When Two Things Look Linked But Aren't

Confounding is one of the most important concepts in research methodology — and one of the most common reasons why apparently convincing observational findings turn out to be wrong. The key idea is simple: two things can appear to be linked not because they directly affect each other, but because both are linked to a hidden third variable.

🔴 The Core Definition

A confounder (or confounding variable) is a third variable that is independently associated with both the exposure and the outcome. It creates a spurious (false) association — or masks a real one — between the exposure and outcome you are studying.

Imagine you notice that people who carry lighters are more likely to develop lung cancer. Should we ban lighters? No — the real culprit is smoking. Smoking is associated with both carrying a lighter and developing lung cancer. Smoking is the confounder. The lighter-cancer association is entirely explained by it.

Classic Examples

Apparent Link	The Confounder	Why It Explains Everything
Ice cream sales → drowning deaths	Hot weather (summer)	Hot weather causes both more ice cream eating AND more swimming → more drownings. Ice cream doesn't cause drowning.
Coffee drinking → lung cancer	Smoking	Smokers drink more coffee on average. Early studies linked coffee to cancer — until smoking was controlled for.
Carrying a lighter → lung cancer	Smoking	Smokers carry lighters. The lighter has no biological effect — smoking does.
Grey hair → heart disease	Age	Both grey hair and heart disease increase with age. Age is the confounder — not hair colour.
Shoe size → reading ability (in children)	Age	Older children have bigger feet AND read better. Age explains both.

📐 The Three Criteria for a True Confounder

A variable is a confounder if it meets all three of these:

It is associated with the exposure (the thing you're studying)
It is associated with the outcome (the result you're measuring)
It is not on the causal pathway between exposure and outcome (it's a separate third variable, not a step in between)

💡 How to Control for Confounding

Randomisation (in RCTs) — distributes known and unknown confounders equally between groups. This is the strongest protection.
Restriction — only enrol participants who are similar on the confounder (e.g. only non-smokers in the coffee study)
Matching — pair cases and controls on the confounder variable
Stratification — analyse results separately for each level of the confounder
Multivariate statistical adjustment — statistically account for multiple confounders simultaneously

⚠️ Why Confounding Is Mainly a Problem in Observational Studies

In a well-conducted RCT, randomisation distributes confounders (both known and unknown) equally between groups — eliminating confounding as an explanation for differences. In cohort and case-control studies, you can only adjust for confounders you know about and have measured. Unmeasured confounders always remain a potential explanation for any observed association — which is why observational studies can never definitively prove causation.

✅ Validity & Reliability — What's the Difference?

Internal Validity

Does the study measure what it claims to measure within the study population? Are the results of this study trustworthy? Threatened by bias and confounding.

External Validity (Generalisability)

Can the results be applied to other populations or real-world settings? A highly controlled RCT in a specialist centre may not reflect what happens in primary care.

Reliability (Reproducibility)

Does the test produce consistent results when repeated under the same conditions? Measured by inter-rater reliability (kappa statistic) or test-retest reliability.

Kappa Statistic (κ)

Measures agreement between two raters beyond chance. κ = 1 (perfect agreement); κ = 0 (agreement no better than chance); κ < 0 (worse than chance).

Measuring Risk & Treatment Effect

This is the most heavily tested area in AKT statistics. You need to know these formulas cold and be able to apply them to trial data tables under exam conditions.

Imagine you're at a casino. The lottery says you've doubled your chances of winning — sounds amazing. But if you went from 1-in-a-million to 2-in-a-million, the relative improvement is 100% but the absolute difference is still essentially nothing. That's the difference between RRR and ARR. Drug companies love to quote RRR. You should always ask for ARR.

The Key Metrics

Absolute Risk Reduction (ARR)

ARR = CER − EER

The actual difference in event rates between groups. The most clinically honest metric.

Relative Risk Reduction (RRR)

RRR = (CER − EER) ÷ CER

Often sounds more impressive — can be misleading without knowing baseline risk. Alternative formula: RRR = 1 − RR (since RR = EER÷CER, RRR = 1 − that value).

Relative Risk (RR)

RR = EER ÷ CER

Used in cohort studies and RCTs. RR of 1 = no effect; <1 = reduced risk; >1 = increased risk.

Number Needed to Treat (NNT)

NNT = 1 ÷ ARR

How many patients to treat to prevent one bad outcome. Lower = more effective. Always round UP.

Number Needed to Harm (NNH)

NNH = 1 ÷ ARI

How many patients to treat to cause one additional adverse effect. Higher = safer treatment.

Absolute Risk Increase (ARI)

ARI = EER − CER

The additional risk conferred by a harmful exposure or treatment side effect.

📊 NNT Interpretation — What Do The Numbers Actually Mean?

⚠️ Common Confusion — Lower NNT = Better (not worse)

NNT tells you how many patients you need to treat for one to benefit. So the fewer patients you need to treat to get one benefit, the more effective the treatment. NNT = 2 means 1 in every 2 patients benefits — that's excellent. NNT = 100 means only 1 in 100 benefits — much weaker.

Think of giving out umbrellas on a rainy day. NNT = 2 → give 2 umbrellas, 1 person stays dry — very effective. NNT = 100 → give out 100 umbrellas, only 1 person avoids getting wet — pretty underwhelming.

NNT Range	Rough Interpretation	Example Context
< 10	Very effective	Antibiotics for certain infections; some acute treatments
10 – 50	Moderate effect	Many common preventative medications
> 100	Weak effect	Some population-level preventive strategies

⚠️ These Bands Are Informal — Context Is Everything

These thresholds are not from NICE or RCGP — they are rough teaching aids only. The "right" NNT is always context-dependent:

An NNT of 100 might still be worthwhile if the outcome prevented is death or serious irreversible harm
An NNT of 5 might not be acceptable if the treatment has frequent or serious side effects

One-line rule: NNT tells you how many patients you treat for one to benefit — lower = stronger effect. But always weigh it against the severity of the outcome and the burden of treatment.

🔄 Don't Confuse NNT with % Benefit

If 60 out of 100 patients benefit, that is an ARR of 60% (= 0.6), giving NNT = 1 ÷ 0.6 ≈ 1.7 — an excellent result. The NNT is not the number who benefit; it is the number you treat to get one benefit.

📖 Odds Ratio & Hazard Ratio — When Are These Used?

Measure	Used In	Interpretation
Relative Risk (RR)	Cohort studies, RCTs	Directly compares risk in two groups. More intuitive than OR.
Odds Ratio (OR)	Case-control studies	Compares odds of exposure in cases vs controls. Approximates RR when disease is rare.
Hazard Ratio (HR)	Survival analysis (time-to-event)	Like RR but accounts for when events occur over time. HR <1 = reduced hazard in treatment group.

⚠️ AKT Trap — OR ≠ RR

For common diseases, the OR overestimates risk compared to the RR. For rare diseases (<10% prevalence), they are approximately equal. A case-control study generates an OR — you cannot directly calculate incidence or RR from a case-control study.

🧮 Worked Example — Calculating NNT from Trial Data

🎯 Scenario: Statin Trial

A 5-year RCT shows that among patients with high cardiovascular risk: 6% in the placebo group had a heart attack, compared to 4% in the statin group.

1 CER (Control Event Rate) = 6% = 0.06 EER (Experimental Event Rate) = 4% = 0.04

2 ARR = CER − EER = 0.06 − 0.04 = 0.02 (2%)

3 NNT = 1 ÷ ARR = 1 ÷ 0.02 = 50
Treat 50 patients for 5 years to prevent 1 heart attack

4 RRR = ARR ÷ CER = 0.02 ÷ 0.06 = 33%
Sounds impressive! But the absolute benefit is only 2%.

5 RR = EER ÷ CER = 0.04 ÷ 0.06 = 0.67
The statin group had 67% of the risk of the placebo group — a 33% relative reduction.

💡 The Clinical Bottom Line

When a patient asks "Will this statin help me?", use NNT. "If 50 people like you take this tablet for 5 years, 1 heart attack will be prevented. For you personally, it's a 2% absolute benefit." That's far more honest than "It reduces your risk by 33%."

💬 Communicating Risk to Patients (AKT Favourite)

The AKT often tests how you would explain statistical information to patients. There are four main formats:

Format	Example	Best For
Natural frequency	"5 out of every 100 people"	Easiest for patients to understand
Percentage	"5% of people"	Widely used but can mislead
NNT	"Treat 20 to prevent 1 event"	Communicates absolute benefit clearly
Cates plot	Visual grid of 100 faces	Best visual aid for shared decision-making

🚨 Never Use RRR Alone When Communicating Risk

Saying "this reduces your risk by 33%" without giving the baseline risk is misleading. A 33% RRR from a baseline of 0.3% means your absolute benefit is 0.1%. Always pair RRR with baseline risk or give ARR/NNT instead.

💊 The Pharma Rep Trick — How Drug Companies Spin Statistics

Drug company representatives are trained to present trial data in the most favourable light. They use a simple but effective statistical sleight of hand:

Benefits → quoted as Relative Risk Reduction (RRR) — because it sounds bigger and more impressive
Harms → quoted as Absolute Risk Increase (ARI) — because it sounds smaller and less concerning

Worked Example — a fictional statin rep visit

A new statin reduces heart attacks from 2% to 1% over 5 years, but increases myopathy from 1% to 2%.

What the rep says	What it actually means
"This drug reduces heart attacks by 50%"	RRR = 50% — but ARR is only 1%. NNT = 100. Treat 100 people for 5 years to prevent 1 heart attack.
"The myopathy risk increases by only 1%"	ARI = 1% — but that's actually a doubling of the myopathy risk (Relative Risk Increase = 100%).

The honest version: "If 100 patients take this drug for 5 years, 1 extra person avoids a heart attack — and 1 extra person develops myopathy." Same data. Completely different impression.

The antidote: always ask — "What is the absolute difference?" When a rep quotes a relative figure, convert it yourself: ARR = CER − EER, then NNT = 1 ÷ ARR. This applies equally to benefits and harms.

Diagnostic Testing & Screening — The 2×2 Table

The 2×2 contingency table is the foundation of all diagnostic statistics. If you can build and read this table, you can answer most diagnostic AKT questions.

The 2×2 Table

	ACTUAL DISEASE STATUS
TEST RESULT	Disease Present	Disease Absent
Test Positive	True Positive (TP) Test says YES → has disease ✓	False Positive (FP) Test says YES → no disease ✗
Test Negative	False Negative (FN) Test says NO → has disease ✗	True Negative (TN) Test says NO → no disease ✓

Sensitivity

TP ÷ (TP + FN)

Proportion of people WITH disease who test positive. "How good at catching disease?"

Specificity

TN ÷ (TN + FP)

Proportion of people WITHOUT disease who test negative. "How good at ruling out disease?"

Positive Predictive Value (PPV)

TP ÷ (TP + FP)

Probability that a positive test truly means disease. Affected by prevalence.

Negative Predictive Value (NPV)

TN ÷ (TN + FN)

Probability that a negative test truly means no disease. Affected by prevalence.

🧠 SnNout & SpPin — The Memory Aids (and Why They Work)

SnNout

Snout: A highly Sensitive test, if Negative, rules the disease Out.

If sensitivity is very high, a negative test result means the disease is very unlikely (low false-negative rate). Use a sensitive test to screen and rule out.

SpPin

Spin: A highly Specific test, if Positive, rules the disease In.

If specificity is very high, a positive test result means disease is very likely (low false-positive rate). Use a specific test to confirm diagnosis.

Think of sensitivity as a very sensitive smoke alarm — it goes off for everything, including burnt toast. It will never miss a real fire (SnNout). Specificity is a well-calibrated alarm that only triggers for actual fires — if it goes off, you can be confident there really is a fire (SpPin).

📊 The Effect of Prevalence on PPV & NPV — Critical AKT Topic

This is one of the most important and most tested concepts in diagnostic statistics. Sensitivity and specificity are fixed properties of the test. But PPV and NPV change dramatically depending on how common the disease is in the population you're testing.

Imagine you use the same pregnancy test in a fertility clinic (60% of women are pregnant) versus at a random GP surgery (5% of women could be pregnant). Same test — same sensitivity/specificity. But a positive result means something very different in each setting.

Scenario	Prevalence	PPV	NPV
Screening the general population for a rare disease (e.g. HIV in low-risk)	1%	LOW (~16%)	HIGH (~99.9%)
Testing in a high-risk specialist clinic	50%	HIGH (~95%)	HIGH (~95%)

🔴 Key Rules to Remember

As prevalence ↑ → PPV ↑, NPV ↓
As prevalence ↓ → PPV ↓, NPV ↑
In a low-prevalence population, most positive results are false positives (low PPV) — even with a highly specific test
A negative test in a high-prevalence population may still miss disease (lower NPV)

📐 Likelihood Ratios — When You Want to Go Further

Likelihood ratios (LRs) combine sensitivity and specificity into a single number that tells you how much a test result shifts the probability of disease. More advanced than PPV/NPV, but useful — and occasionally tested in AKT.

Positive Likelihood Ratio (LR+)

LR+ = Sensitivity ÷ (1 − Specificity)

How much more likely a positive test is in disease vs. no disease. LR+ >10 = strong evidence for disease.

Negative Likelihood Ratio (LR−)

LR− = (1 − Sensitivity) ÷ Specificity

How much more likely a negative test is in disease vs. no disease. LR− <0.1 = strong evidence against disease.

LR+	Effect on Post-Test Probability
>10	Large and often conclusive increase in probability
5–10	Moderate increase
2–5	Small increase
1	No change (test is useless)
0.1–0.5	Small to moderate decrease
<0.1	Large decrease — strong negative rule-out

Fagan's nomogram uses LRs graphically — you draw a line from pre-test probability through the LR to read off post-test probability. If you see this in the AKT, remember: a positive test shifts probability up, a negative test shifts it down.

Population Statistics & Epidemiology

Incidence

New cases ÷ Population at risk × multiplier

Measures risk. Counts only NEW cases over a defined time period.

Point Prevalence

All existing cases ÷ Total population

Measures disease burden. Counts ALL cases (new and old) at a specific point in time (Time T). Also written as: Cases at Time T ÷ Population at Time T.

Incidence is the number of new students joining a school this year. Prevalence is the total number of students currently in the school (new + existing). A chronic disease with low incidence can still have high prevalence if people live with it for years.

📊 Standardised Mortality Ratio (SMR)

SMR Formula

(Observed deaths ÷ Expected deaths) × 100

Adjusts for age and sex when comparing mortality between populations.

SMR Value	Interpretation
= 100	Mortality same as reference population
> 100	Excess mortality (higher than expected)
< 100	Lower mortality than expected

AKT question type: "A mining community has an SMR of 145. What does this mean?" → Excess mortality — 45% more deaths than expected in a comparable general population.

🎯 Screening — Lead Time & Length Bias (Key AKT Traps)

🔴 Lead Time Bias

Screening detects a disease earlier, making it appear that survival has improved — even if the patient dies at the same point in time. The "survival time from diagnosis" has simply been extended by early detection, not by actual treatment benefit. The patient doesn't live longer; they just know for longer.

🟠 Length Bias

Screening programmes are more likely to detect slow-growing, indolent disease (which has a longer "detectable preclinical phase") than aggressive disease that progresses quickly. This makes screening look more effective than it really is for severe disease.

Lead time bias: Imagine you know you're going to die at age 60. Without screening you find out aged 55; with screening you find out aged 40. Screening appeared to give you 15 extra years of survival from diagnosis — but you still die at 60.

✅ Wilson & Jungner Screening Criteria

Before a screening programme is introduced, it should satisfy the Wilson & Jungner criteria (originally published 1968, still the standard framework). The AKT tests both knowledge of these criteria and application of them to scenarios.

#	Criterion	What It Means In Practice
1	Important health problem	The condition has significant morbidity or mortality — worth the effort of screening
2	Accepted treatment available	No point detecting disease you cannot treat
3	Facilities for diagnosis & treatment exist	Infrastructure must be in place before launching
4	Recognisable latent or early stage	The disease must have a detectable pre-symptomatic phase
5	Suitable test available	Test must be acceptable to the population, safe, and reasonably accurate
6	Test acceptable to the population	People must be willing to undergo it — invasive or uncomfortable tests may deter uptake
7	Natural history adequately understood	Must know how the disease progresses if left untreated
8	Agreed policy on who to treat	Clear protocols needed — not just detection but what happens next
9	Cost-effective	Cost of finding each case must be balanced against benefit
10	Continuous process, not one-off	Screening must be ongoing — disease incidence continues

⚠️ AKT Application — Why PSA Screening Fails These Criteria

PSA screening for prostate cancer is not part of the NHS national screening programme precisely because it struggles with criteria 5 and 7: PSA is not a sufficiently accurate test (low specificity → many false positives), and the natural history of many low-grade prostate cancers means they would never cause symptoms in the patient's lifetime. This links directly to overdiagnosis (see below).

Memory aid: think of the criteria as three groups — the disease (important, understood, has a latent stage), the test (suitable, acceptable, accurate), the system (treatment exists, facilities exist, policy agreed, cost-effective, continuous).

🔍 Overdiagnosis — Finding Problems That Wouldn't Have Caused Problems

Overdiagnosis occurs when a real disease is detected — one that truly exists — but that disease would never have caused symptoms, harm, or death during the patient's lifetime if left undetected. It is not a false positive (the disease is real); it is a true positive that did not need to be found.

🔴 Why It Matters

Overdiagnosis converts well people into patients. It exposes them to the anxiety, side effects, and risks of treatment for a condition that would never have harmed them. It is one of the most important harms of screening programmes — and one the AKT tests directly.

Condition	Overdiagnosis Example
Prostate cancer	Many low-grade cancers detected by PSA would never progress or cause symptoms — men die with them, not from them
Thyroid cancer	Ultrasound finds tiny papillary thyroid cancers that are almost universally indolent — detection has soared but mortality unchanged
DCIS (breast)	Ductal carcinoma in situ detected by mammography — some would never become invasive cancer

Overdiagnosis is like finding a tiny crack in a wall that would never have caused the house to fall — but now you've spotted it, you feel compelled to fix it. The crack was real; the harm from finding it came from the treatment, not the crack.

Overdiagnosis vs Overtreatment

Overdiagnosis = finding a disease that didn't need finding. Overtreatment = treating a disease that didn't need treating (which may follow overdiagnosis, or may occur independently). They are related but distinct concepts.

🔢 Age Standardisation

When comparing disease rates or mortality between different populations (e.g. different countries, different time periods), you need to adjust for the fact that those populations may have different age distributions. Older populations will naturally have higher mortality even if their health is equally good.

Key Concept

Age standardisation is a statistical technique that removes the distorting effect of different age distributions when comparing health outcomes between populations. It produces a rate that would be observed if both populations had the same age structure (the "standard population").

🎗️ Cancer Statistics — AKT Must-Knows

The AKT occasionally tests specific cancer statistics. You do not need exhaustive oncology knowledge — but these headline figures come up and are worth knowing.

Cancer	Key Statistic	Why It Matters
Testicular cancer	>98% 10-year survival	One of the most treatable cancers — important to know for counselling young men
Lung cancer	Leading cause of cancer death (UK)	Despite not being the most common cancer, it kills more people than any other

⚠️ Survival Rate vs Mortality Rate — Don't Confuse These

A cancer can have a high incidence (common) but low mortality (treatable), like breast cancer. Or it can have a lower incidence but very high mortality, like pancreatic cancer. The AKT may test your ability to interpret cancer statistics correctly — don't assume the most common cancer is the deadliest.

⚖️ Health Inequalities

Health inequalities describe unfair, avoidable differences in health between different groups of people. They are a significant focus of UK public health policy and appear in the AKT in the context of epidemiology and social determinants of health.

Definition

Health inequalities are unfair, avoidable differences in health status or in the distribution of health determinants between different population groups. They are "avoidable" because they stem from social, economic, or environmental conditions that could in principle be changed — not from random chance or biological variation.

Type	Examples in the UK
Socioeconomic	Lower life expectancy in deprived areas; higher rates of cardiovascular disease, diabetes, and mental illness in poorer communities
Geographic	"North-South divide" — poorer health outcomes in parts of Northern England compared to the South
Ethnic	Higher rates of type 2 diabetes in South Asian populations; higher cardiovascular risk in Black populations
Gender	Men have lower life expectancy but women have more years of ill health (morbidity)

In the AKT, health inequalities often appear as questions about the social determinants of health or about interpreting population data that shows differences between groups. Key tool: the Marmot Review (Fair Society, Healthy Lives) and the concept of proportionate universalism — universal services delivered with more intensity to those with greater need.

Data Distribution & Statistical Significance

Measures of Central Tendency

Measure	Definition	Best Used When	Watch Out
Mean	Sum of all values ÷ number of values	Data is normally distributed	Easily skewed by outliers
Median	Middle value when sorted in order. If there is an even number of values, the median is the average of the two middle numbers.	Skewed data (e.g. income, hospital stay length)	Ignores the actual values at extremes
Mode	Most frequently occurring value	Categorical data; bimodal distributions	Can be meaningless with continuous data
Range	Maximum − Minimum	Quick sense of spread	Entirely determined by outliers
IQR (Interquartile Range)	75th percentile − 25th percentile	Paired with median for skewed data	Ignores upper and lower 25%
Standard Deviation (SD)	Average spread from the mean	Normally distributed data	Misleading if data is not normally distributed

🗂️ Types of Data — Nominal, Ordinal, Interval, Ratio

Understanding what type of data you have determines which summary statistics and which statistical tests are appropriate. The AKT tests this — usually by presenting a dataset and asking which test or measure to use.

Type	Definition	Examples	Analogy
Nominal	Categories with no natural order	Blood group (A, B, AB, O); sex; eye colour; cause of death	A fruit bowl — apples and bananas are just different, neither is "more"
Ordinal	Ordered categories, but the gaps between them are not necessarily equal	NYHA heart failure class (I–IV); pain scale (mild/moderate/severe); Likert scales	Race positions — 1st, 2nd, 3rd. We know the order, but the gap between 1st and 2nd may be very different to the gap between 2nd and 3rd
Interval	Ordered with equal gaps between values, but no true zero	Temperature in °C; calendar dates; IQ scores	A thermometer — 0°C doesn't mean "no temperature." You can't say 20°C is "twice as warm" as 10°C in any absolute sense
Ratio	Ordered, equal gaps, AND a true absolute zero	Height, weight, blood pressure, age, income, drug dose	Money — £0 means you genuinely have nothing. £40 is twice as much as £20

💡 Why It Matters for the AKT

Nominal/Ordinal data → use non-parametric tests, mode or median for averages
Interval/Ratio data (normally distributed) → can use mean, SD, parametric tests
You cannot meaningfully calculate a mean for nominal data (e.g. "mean blood group" is nonsense) or make proportional statements with interval data (you can't say someone with an IQ of 120 is "twice as clever" as someone with 60)

🔬 Parametric vs Non-Parametric Tests

Statistical tests fall into two families depending on the assumptions they make about your data. The AKT tests which type of test is appropriate for a given scenario — you do not need to perform the calculations, just know when to use which.

The Core Distinction

Parametric tests assume the data is normally distributed (or the sample is large enough that this doesn't matter much) and that the data is at least interval-level. Non-parametric tests make no such assumptions — they work with ranks or categories and are suitable for skewed data, small samples, or ordinal/nominal data.

Parametric tests are like a recipe that only works if your oven is exactly the right temperature. Non-parametric tests are more forgiving — they work even if the oven runs a bit hot or cold.

Purpose	Parametric Test	Non-Parametric Equivalent
Compare means of 2 independent groups	Independent t-test	Mann-Whitney U test
Compare means: same group, 2 time points	Paired t-test	Wilcoxon signed-rank test
Compare means of 3 or more groups	ANOVA (Analysis of Variance)	Kruskal-Wallis test
Correlation between two continuous variables	Pearson correlation	Spearman rank correlation
Compare proportions / categorical data	—	Chi-squared test (χ²)

⚠️ AKT Decision Rules

Data is skewed or not normally distributed → use non-parametric
Small sample size (and normality uncertain) → use non-parametric
Data is ordinal (e.g. pain scores, Likert) → use non-parametric
Comparing proportions or categories → chi-squared test
Large sample, continuous, approximately normal → parametric test is fine

🔔 Normal Distribution — The 68-95-99.7 Rule

A normal distribution is a symmetrical bell-shaped curve. The mean, median, and mode are all equal and sit at the centre.

Range	% of Values Included
Mean ± 1 SD	68%
Mean ± 2 SD	95%
Mean ± 3 SD	99.7%

AKT application: Reference ranges for blood tests are usually defined as mean ± 2 SD — meaning 5% of normal healthy people will fall "outside the normal range." This is why a mildly abnormal result in an asymptomatic patient often doesn't need treatment.

⚠️ Skewed Data

With positively skewed data (e.g. income, GP waiting times, serum bilirubin in a ward), the mean is pulled to the right by a few high outliers. In this case, use the median — it better represents the typical value. The AKT loves testing this distinction.

📉 P-Values & Confidence Intervals — What They Really Mean

P-Values

The p-value is the probability of observing results at least as extreme as those seen if the null hypothesis is true (i.e. if there is no real effect). It is not the probability that the null hypothesis is correct.

p-value	Interpretation
p < 0.05	Statistically significant — less than 5% probability this result occurred by chance
p > 0.05	Not statistically significant — result may be due to chance
p = 0.01	1% chance of getting this result if no real effect

Confidence Intervals (CIs)

A 95% CI means: if you repeated the study 100 times, in 95 of those times the true population value would fall within this range.

A CI is like a weather forecast saying "expect between 14°C and 22°C tomorrow." It doesn't mean the temperature will definitely be in that range — but you're 95% confident it will be.

🔴 The Most Tested Rule — Does the CI Cross the Magic Number?

For a ratio (RR, OR, HR): if 95% CI includes 1.0 → not statistically significant
For a difference (mean difference, ARR): if 95% CI includes 0 → not statistically significant
If the CI is entirely above 1 (for ratios) → significantly increased risk
If the CI is entirely below 1 (for ratios) → significantly reduced risk

❌ Type I & Type II Errors — Crying Wolf vs Missing the Wolf

Type I Error (α — Alpha)

Rejecting the null hypothesis when it is actually true. In other words: concluding that a treatment works when it doesn't. The false positive rate. Conventionally acceptable at 5% (α = 0.05, p < 0.05).

"Crying wolf" — sounding the alarm when there's no wolf.

Type II Error (β — Beta)

Failing to reject the null hypothesis when it is actually false. In other words: missing a true treatment effect. The false negative rate. Conventionally acceptable at 20% (β = 0.2, power = 80%).

"Missing the wolf" — saying no wolf when there is one.

Statistical Power

Power = 1 − β. It is the probability of correctly detecting a true effect. Higher power = less likely to miss a real effect. Power increases with larger sample sizes. A well-designed trial needs ≥80% power.

⚡ Statistical Significance ≠ Clinical Importance

This is one of the most important and most tested nuances in AKT statistics — and one that many trainees miss.

🔴 The Core Rule

A result can be statistically significant (p < 0.05) while being clinically meaningless. Statistical significance only tells you the result is unlikely to be due to chance — it says nothing about whether the effect is large enough to matter in practice.

A trial of 50,000 patients finds that a new antihypertensive reduces systolic BP by 1 mmHg. p = 0.001. Highly significant — but would you change your prescribing for a 1 mmHg drop? No. The p-value is tiny because the sample is enormous, not because the effect is important.

Scenario	Statistically Significant?	Clinically Important?
BP drops 1 mmHg, n=50,000, p=0.001	Yes	No
BP drops 15 mmHg, n=30, p=0.08	No	Probably yes
BP drops 12 mmHg, n=500, p=0.02	Yes	Yes

💡 What To Use Instead

Always look at the effect size (ARR, NNT, mean difference) alongside p-values and CIs. A narrow confidence interval that excludes zero or one is more meaningful than a p-value alone — it tells you both the direction and the precision of the effect.

📏 Confidence Interval Width — Precision at a Glance

Narrow CI → precise estimate → high confidence the true value is close to the point estimate (usually from a large sample)
Wide CI → uncertain estimate → true value could be anywhere in a broad range (usually from a small or heterogeneous sample)

Example: RR = 1.5 (95% CI 1.4–1.6) → precise, convincing. RR = 1.5 (95% CI 0.6–3.8) → wide, uncertain — and crossing 1.0 so not even significant.

📈 Regression to the Mean — The Hidden Confounder of Clinical Practice

Regression to the mean is the statistical tendency for an extreme measurement to be closer to the average on a second measurement — regardless of any intervention. It is one of the most under-recognised sources of misleading conclusions in medicine.

🔴 Why It Matters Clinically

Patients are often investigated or treated precisely when their symptoms or measurements are at their worst. Natural variation means those measurements are likely to improve anyway on re-testing — not necessarily because of your intervention. Without a control group, it is impossible to distinguish regression to the mean from genuine treatment effect.

Scenario	What Looks Like Treatment Effect	What May Actually Be Happening
Patient has very high BP on one reading → started on medication → BP lower at next visit	Drug is working	First reading may have been an outlier; BP would have been lower anyway on repeat measurement
Pupil scores very poorly on a test → gets extra tuition → scores better next time	Tuition helped	The poor score may have been unrepresentative; natural performance tends toward their average
Patient with severe flare of eczema starts a new cream → flare improves	Cream is effective	Severe flares naturally improve over time regardless of treatment

Imagine you only measure someone's height when they're standing on a box. They look very tall. Next time you measure them normally — they appear to have "shrunk." The measurement changed; the person didn't. That's regression to the mean.

💡 How to Protect Against It

Take multiple baseline measurements and use the average before starting treatment
Use a control group — regression to the mean affects both groups equally, so any difference between groups is more likely to be a real treatment effect
This is one of the strongest arguments for RCTs over uncontrolled before-and-after studies

Statistical Graphs — What to Look For

The AKT regularly presents graphs and asks you to interpret them. Learn to spot the key feature in each graph type — do not try to read everything. One targeted observation is all that's needed.

Graph Type	Primary Use	The One Thing to Look For
Forest Plot	Meta-analysis results	Does the diamond cross the vertical line of no effect?
Funnel Plot	Publication bias detection / GP outlier monitoring	Is the plot asymmetrical? (Gap = missing unpublished studies)
Cates Plot	Communicating NNT visually to patients	Count the coloured "benefit" faces; NNT = 100 ÷ those faces
L'Abbé Plot	Exploring heterogeneity in meta-analyses	Dot on the diagonal line = zero treatment effect
Box-and-Whisker Plot	Data distribution and spread	Middle line = median (not mean); dots beyond whiskers = outliers
Fagan's Nomogram	Pre-test → post-test probability	Draw a line from pre-test probability through LR to read post-test probability
Stem-and-Leaf Plot	Distribution — preserves original values	Like a histogram but shows individual data points
Kaplan-Meier Curve	Survival analysis — time to event	Steps drop when events occur; curves that diverge early and stay apart suggest sustained treatment benefit
Histogram	Distribution of continuous data	Shape of curve: symmetrical = normal distribution; skewed = use median not mean

🌲 Forest Plots — In Detail

How to Read a Forest Plot

Each square = one study. The size of the square = the study's weight (usually driven by sample size)
Horizontal lines = 95% confidence interval for that study
The diamond at the bottom = the pooled overall estimate. Its width = the pooled 95% CI
Vertical line at 1.0 = "line of no effect" (for ratios). If a CI line or the diamond crosses this line, that result is not statistically significant

Heterogeneity — The I² Statistic

I² measures how much variation between studies is due to true heterogeneity (real differences) rather than chance.
I² < 25% = low heterogeneity ✓
I² = 25–50% = moderate heterogeneity
I² > 50% = substantial heterogeneity — pooled result is less reliable ⚠️

Think of the diamond as the jury's final verdict. If the diamond crosses the vertical "line of no effect," the jury is hung — no definitive conclusion. The I² statistic tells you whether the jurors even agreed on the same case.

📯 Funnel Plots — Publication Bias & GP Outlier Monitoring

Funnel plots appear in two entirely different contexts in the AKT — make sure you recognise both.

In Meta-Analysis (Publication Bias)

Plots effect size (x-axis) against study precision or size (y-axis). A symmetric inverted funnel = no bias. An asymmetric funnel with a gap at the bottom-left = missing small negative studies → publication bias.

In Practice Performance Monitoring

Compares GP practices on metrics (e.g. referral rates, mortality). Data points outside the funnel lines are statistical outliers warranting investigation — not necessarily proof of poor performance.

😊 Cates Plots — How to Extract NNT Visually

A Cates plot (sometimes called a "smiley face plot") uses a grid of 100 faces to help patients visualise absolute benefit and harm.

Each of the 100 faces represents one person per 100 treated
Yellow/green faces = people who benefited (events prevented)
Red faces = people who experienced harm
Grey/plain faces = no effect either way

NNT from Cates Plot

NNT = 100 ÷ number of benefit faces

Example: 4 yellow faces → NNT = 100 ÷ 4 = 25

📉 Kaplan-Meier Survival Curves — How to Read Them

Kaplan-Meier (KM) curves show the probability of surviving (or remaining event-free) over time. They appear in AKT questions about cancer trials, cardiovascular studies, and any research tracking time to an event.

Feature	What It Means
The Y-axis	Probability of survival (or event-free survival) — starts at 1.0 (100%) and falls over time
The X-axis	Time (days, months, years)
Each downward step	An event has occurred (e.g. death, relapse). Larger steps = more events at that point
Tick marks on the line	Censored patients — lost to follow-up or study ended. Not events.
Two curves diverging early and staying apart	Suggests sustained treatment benefit throughout follow-up
Curves crossing	Suggests hazard is not proportional over time — complicates interpretation
Median survival	The time point where the survival curve crosses 50% — the point where half the patients have had the event

Link to Hazard Ratio

The log-rank test compares two KM curves statistically. The hazard ratio (HR) summarises the overall difference between curves. HR < 1 = treatment group has fewer events over time. If the curves overlap substantially, the HR will be close to 1 (no benefit).

Think of a KM curve as a staircase going down. Each step down is someone having the event. A treatment that works keeps people on the higher steps for longer — the treated group's staircase descends more slowly.

📊 Histograms — Distribution at a Glance

A histogram displays the distribution of a continuous variable (e.g. age, blood pressure, BMI) by grouping values into intervals (bins) and showing how many observations fall in each. Unlike a bar chart, the bars touch — because the data is continuous.

Shape	Meaning	Use Mean or Median?
Symmetrical bell shape	Normal distribution — mean, median, mode all equal	Either — but conventionally mean ± SD
Right (positive) skew	Long tail to the right — a few very high values pulling mean up	Median (more representative)
Left (negative) skew	Long tail to the left — a few very low values pulling mean down	Median (more representative)
Bimodal (two peaks)	Two distinct subgroups in the data	Report both peaks separately

AKT tip: if a histogram shows positive skew, the correct measure of central tendency to quote is the median. If it shows a normal distribution, the mean is appropriate.

Quality Improvement & Clinical Audit

Tool	Purpose	Key Features
Clinical Audit	Measure practice against explicit standards; identify and close gaps	Requires a defined standard. Uses PDSA cycle. Involves closing the loop (re-audit). Not research — no hypothesis, no new knowledge generated.
Significant Event Analysis (SEA)	Systematic multidisciplinary review of a single significant event	Blame-free. Focuses on system learning, not individual fault. Shared within the team. Documents learning and action.
Root Cause Analysis (RCA)	In-depth investigation of serious incidents	More structured than SEA. Uses "5 Whys" technique. Identifies contributory and root causes. Often for never events or serious harm.
QOF Exception Reporting	Appropriate exclusion of patients from QOF indicators	Clinically appropriate for: maximal tolerated therapy, informed patient dissent, extreme frailty, or clinical contraindication.

🔄 Clinical Audit vs Research — Key Differences

Feature	Clinical Audit	Research
Purpose	Improve care by comparing with standards	Generate new knowledge
Hypothesis	None — compares against existing standard	Always has a hypothesis
Ethics approval	Usually not required	Usually required
Consent	Usually not required	Usually required
Randomisation	Never	May include RCTs
End result	Service improvement; action plan	New evidence; publication

AKT trap: "A GP wants to investigate whether a new antibiotic achieves better cure rates than standard treatment in his practice." → This is research (generates new knowledge, has a hypothesis, needs ethics). "A GP measures whether his practice's asthma review rate meets the NICE standard of 80%." → This is audit.

🔁 PDSA Cycle

The Plan-Do-Study-Act (PDSA) cycle is the continuous improvement framework used in clinical audit and quality improvement.

Stage	What Happens
Plan	Define the question, set the standard, plan data collection
Do	Implement the change or measure current practice
Study	Analyse results; compare against the standard; identify gaps
Act	Implement improvements; plan re-audit to close the loop

Clinical Calculations (AKT Formula Bank)

These formulas crop up in AKT calculation questions. Learn each one and know the clinical cut-offs that trigger action.

Body Mass Index (BMI)

Weight (kg) ÷ Height² (m²)

Underweight <18.5 | Normal 18.5–24.9 | Overweight 25–29.9 | Obese ≥30 | Morbidly obese ≥40

Ankle-Brachial Pressure Index (ABPI)

Highest ankle systolic ÷ Highest brachial systolic

Normal: 1.0–1.3 | PAD: <0.9 | Severe PAD: <0.5 | Non-compressible (calcified): >1.3

Alcohol Units

(Volume in ml × ABV%) ÷ 1000

e.g. 750ml × 12% ÷ 1000 = 9 units. Low risk: ≤14 units/week for both sexes. No safe level in pregnancy.

Non-HDL Cholesterol

Total Cholesterol − HDL Cholesterol

Target: <2.5 mmol/L in those on statin therapy (NICE guidance). Includes all atherogenic lipoproteins.

Paediatric Drug Dose Volume

(Desired dose ÷ Available strength) × Volume

e.g. Child needs 250mg, suspension is 125mg/5ml: (250 ÷ 125) × 5ml = 10ml

Percentage Weight Loss (Neonates)

[(Birth wt − Current wt) ÷ Birth wt] × 100

>10% loss in first 5 days requires review. Normal to lose up to 10%; should regain by day 10–14.

Cockcroft-Gault (CrCl)

[(140−Age) × Weight(kg)] ÷ Creatinine(µmol/L) × K

K = 1.23 (men), 1.04 (women). Estimates creatinine clearance — used for drug dosing in renal impairment.

🧮 ABPI Worked Example

🎯 Scenario: Mrs T, 72, with leg pain on walking

1 Ankle pressure (highest of DP/PT): 90 mmHg | Brachial pressure (highest arm): 130 mmHg

2 ABPI = 90 ÷ 130 = 0.69

3 ABPI 0.69 = Peripheral Arterial Disease confirmed (cut-off <0.9)

4 Compression bandaging is CONTRAINDICATED — refer to vascular surgery

🍷 Alcohol Unit Worked Examples

Drink	Volume (ml)	ABV%	Calculation	Units
Large glass wine	250ml	13%	250 × 13 ÷ 1000	3.25
Bottle of wine	750ml	12%	750 × 12 ÷ 1000	9.0
Pint lager (strong)	568ml	5%	568 × 5 ÷ 1000	2.84
Single spirit measure	25ml	40%	25 × 40 ÷ 1000	1.0

Low-risk guidance: ≤14 units/week (men and women), spread across 3+ days, with 2–3 alcohol-free days per week. This is what you'd discuss in a consultation about alcohol reduction.

🧮

Worked Examples

🎯 Example 1: Calculating All Risk Metrics from a Trial Table

An RCT randomises patients to receive either Drug X or placebo. After 3 years: 80 out of 500 placebo patients had a stroke; 40 out of 500 Drug X patients had a stroke.

1CER = 80/500 = 0.16 (16%) EER = 40/500 = 0.08 (8%)

2ARR = 0.16 − 0.08 = 0.08 (8%)

3NNT = 1 ÷ 0.08 = 12.5 → round up to 13

4RRR = 0.08 ÷ 0.16 = 0.5 = 50%

5RR = 0.08 ÷ 0.16 = 0.5 (Drug X halves the risk of stroke)

🎯 Example 2: Constructing a 2×2 Table and Calculating Sensitivity & Specificity

A new test is applied to 200 patients, 100 of whom have the disease. The test correctly identifies 85 of the 100 with disease, and correctly identifies 80 of the 100 without disease.

1TP = 85 | FN = 15 | TN = 80 | FP = 20

2Sensitivity = TP ÷ (TP+FN) = 85 ÷ (85+15) = 85/100 = 85%

3Specificity = TN ÷ (TN+FP) = 80 ÷ (80+20) = 80/100 = 80%

4PPV = TP ÷ (TP+FP) = 85 ÷ (85+20) = 85/105 = 81%

5NPV = TN ÷ (TN+FN) = 80 ÷ (80+15) = 80/95 = 84%

🎯 Example 3: Paediatric Dose Calculation

A child requires 300mg of amoxicillin. The available suspension is 250mg/5ml. What volume should be given?

1Formula: (Desired dose ÷ Available strength) × Volume

2(300 ÷ 250) × 5 = 1.2 × 5 = 6ml

📋

Formulas Cheat Sheet & Memory Aids

Risk & Treatment Formulas

ARR = CER − EER RRR = ARR ÷ CER RR = EER ÷ CER NNT = 1 ÷ ARR NNH = 1 ÷ ARI

Diagnostic Testing

Sensitivity = TP ÷ (TP+FN) Specificity = TN ÷ (TN+FP) PPV = TP ÷ (TP+FP) NPV = TN ÷ (TN+FN)

Population & Clinical Formulas

SMR = (Obs ÷ Exp) × 100 BMI = Wt(kg) ÷ Ht²(m) ABPI = Ankle SBP ÷ Brachial SBP Alcohol = (Vol × ABV%) ÷ 1000

🧠 Master Memory Aids

SnNout: Sensitive test — Negative rules OUT disease (screening)
SpPin: Specific test — Positive rules IN disease (confirmation)
CI crosses 1.0 (ratio) or 0 (difference) → not significant
Forest plot diamond crosses line → not significant
Funnel asymmetry → publication bias (or GP outlier if performance context)
Box plot middle line → MEDIAN (not mean)
68-95-99.7 → ±1SD, ±2SD, ±3SD in normal distribution
I² >50% → substantial heterogeneity
OR from case-control | RR from cohort | Prevalence from cross-sectional
Audit ≠ Research (audit measures vs standard; research generates new knowledge)

📊 Study Design Mnemonic: "For Cases, OR — For Cohorts, RR"

Case-control → OR (Odds Ratio) — looking BACK at exposures
Cohort → RR (Relative Risk) — going FORWARD from exposure
Cross-sectional → Prevalence — SNAPSHOT in time
RCT / SR → Gold standard for treatment questions

🎓

Trainer & Teaching Pearls

Common Trainee Blind Spots on This Topic

Trainees often learn NNT as a formula without understanding what it actually means clinically — ensure they can explain it in a sentence a patient would understand
The distinction between sensitivity/specificity (test properties) and PPV/NPV (influenced by prevalence) is poorly understood — the pregnancy test analogy works well
Many trainees know what a forest plot looks like but cannot explain what to look at — focus on the diamond and whether it crosses the line of no effect
Lead time bias is frequently confused with length bias — use diagrams or timelines to illustrate the distinction
Trainees routinely confuse clinical audit with research — use the RCGP's own examples from assessments

Tutorial Ideas & Discussion Starters

"Here's the summary table from a drug trial — can you tell me the NNT and whether you'd prescribe it for this patient?"
"Look at this forest plot. Is the intervention effective? How confident are you in that answer?"
"Your patient has a positive FIT test. The PPV in a low-risk population is about 3%. How do you explain this to him?"
"We're seeing a lot of high PSA results. What's the issue with using PSA as a screening test?"
"A colleague wants to compare our sepsis outcomes with the Trust's. Is that audit or research? Does it need ethics?"
"How would you explain a 1-in-50 chance of a side effect to a patient who asks 'Is it safe?'"

Reflective Questions for Tutorials

How do you currently explain risk to patients? Do you use absolute or relative figures?
Can you recall a recent clinical guideline that cited a significant NNT — what was it, and how did it influence your practice?
Have you ever ordered a test and not known its sensitivity/specificity? How did you interpret the result?
Why might a drug company choose to present their trial results as RRR rather than ARR?

🔥 AKT High-Yield Tips

These are the patterns that repeatedly appear in AKT papers. Memorise these and you will score marks.

🎯 NNT Calculation

Always convert percentages to decimals first. ARR = 5% → NNT = 1 ÷ 0.05 = 20. Always round up to the nearest whole number. A lower NNT = more effective treatment.

🎯 CI Crosses the Magic Number

For ratios (RR, OR, HR): CI crosses 1.0 = not significant. For differences: CI crosses 0 = not significant. This comes up in nearly every forest plot question.

🎯 Study Design Matching

Rare disease → case-control → OR. New exposure going forward → cohort → RR. Prevalence snapshot → cross-sectional. Best evidence for treatment → RCT or SR/meta-analysis.

🎯 Sensitivity vs Specificity — Which to Use When

Screening test → want high sensitivity (don't want to miss cases → SnNout). Confirmatory test → want high specificity (don't want false positives → SpPin).

🎯 PPV Drops in Low-Prevalence Populations

Even a highly specific test (99%) gives poor PPV in a low-prevalence setting. Most positive results in population screening are false positives. This is why we don't screen everyone for everything.

🎯 Forest Plot Diamond

If the diamond (pooled estimate) crosses the vertical line of no effect → overall result is NOT statistically significant. If I² > 50% → substantial heterogeneity → pooled result is less reliable.

🎯 Funnel Plot Asymmetry

A gap in the bottom-left of a funnel plot = publication bias — small negative studies were not published. This inflates the apparent effect of a treatment in the meta-analysis.

🎯 Box-and-Whisker — Always Median, Not Mean

The line inside the box is the median. The box = IQR (middle 50%). Dots or circles beyond the whiskers = outliers.

🎯 RRR Sounds More Impressive Than ARR

Pharmaceutical companies love quoting RRR because it sounds bigger. A 50% RRR sounds amazing — until you know the baseline risk was only 2% (→ ARR = 1%, NNT = 100). Always ask: what was the baseline risk?

🎯 Median — Use for Skewed Data

Skewed distributions (income, hospital stay, serum bilirubin) → use median not mean. The mean is pulled by outliers; the median is not.

🎯 Audit vs Research

Audit: measures against an existing standard; no ethics needed; no hypothesis. Research: generates new knowledge; needs ethics approval; has a hypothesis. A key distinction the AKT tests repeatedly.

🎯 Intention to Treat (ITT)

ITT analysis includes all randomised participants regardless of adherence. This gives a conservative estimate of effectiveness — more realistic for clinical practice. Per-protocol analysis overestimates the effect.

🎯 ABPI Cut-Off

ABPI < 0.9 = peripheral arterial disease. ABPI > 1.3 = non-compressible (calcified) vessels — unreliable result. Compression bandaging is contraindicated if ABPI < 0.8 (check with your compression guidelines).

🎯 Cates Plot NNT

Count the benefit faces (usually yellow or green). NNT = 100 ÷ (number of benefit faces). 5 yellow faces → NNT = 20.

⚠️

Common Mistakes & Trainee Traps

These are the errors that appear repeatedly across AKT marking schemes. Every one of these is a real mark lost by real candidates.

Forgetting to convert percentages to decimals before calculating NNT (e.g., ARR = 5% → must use 0.05, not 5, to get NNT = 20, not 0.2)
Rounding NNT down rather than up (NNT = 12.5 → answer is 13, not 12)
Confusing RRR with ARR and quoting the more impressive-sounding relative figure as the clinical benefit
Saying a result is "significant" when the CI just touches 1.0 — it must not include 1.0 to be significant
Stating the box-and-whisker plot middle line is the mean — it is always the median
Confusing sensitivity with PPV — sensitivity is a fixed property of the test; PPV depends on prevalence
Thinking a highly specific test in a low-prevalence population will give a reliable positive result — it won't (low PPV)
Confusing an OR with an RR — ORs cannot be directly used as RRs except when disease is rare
Saying a case-control study generates RR — it generates OR, because you start with cases and controls, not an exposed cohort
Confusing clinical audit with research — claiming an audit needs ethical approval
Misinterpreting lead time bias as meaning a screening programme genuinely improves survival
Forgetting that I² >50% in a forest plot raises concerns about the validity of the pooled result
Using the mean to describe skewed data (e.g. income, hospital stay length) — use the median

🏁 Final Take-Home Points

NNT = 1 ÷ ARR. Always convert percentages to decimals. Always round up. Lower NNT = better treatment.
ARR is clinically honest. RRR sounds impressive but can mislead. Always pair RRR with baseline risk.
Sensitivity and specificity are fixed properties of a test. PPV and NPV change with disease prevalence.
SnNout: sensitive tests rule OUT when negative. SpPin: specific tests rule IN when positive.
Forest plot diamond crosses the line of no effect → result not statistically significant. I² >50% → heterogeneity concerns.
Funnel plot asymmetry → publication bias. Points outside funnel limits in performance monitoring → outlier practices.
CI for a ratio that includes 1.0 → not significant. CI for a difference that includes 0 → not significant.
Case-control → OR. Cohort → RR. Cross-sectional → prevalence. RCT/SR → gold standard for treatment.
Skewed data → use median, not mean. Box plot middle line = median. Dots beyond whiskers = outliers.
Clinical audit measures against standards — no ethics needed, no hypothesis. Research generates new knowledge — ethics required.

Statistics questions in the AKT are among the most reliably learnable marks in the paper. A few hours with this page and a handful of practice questions will pay dividends well beyond their investment.