Forest plots are easy and very useful to interpret. Each individual study should be annotated, and it has a square, or ‘blob’, with a horizontal line representing the 95% CI. The blob in the middle is the reported result of the study, and the relative size of it represents the weighting that this individual study has in the overall analysis. The vertical line down the middle is the ‘line of no effect’, which for a ratio is 1. Studies are statistically significant if the CI line does not cross the value of no-effect i.e. the vertical line. If the horizontal line of the study crosses the vertical ‘line of no effect’ it is non-significant. The overall result pooled by meta-analysis is represented by a diamond, the length of which represents the CI. If the diamond does not cross the line of no effect, this is a positive result. If it does, it means the overall result is non-significant. See also meta-analysis.
A method of graphing results in a meta-analysis to see if the results have been affected by publication bias.
A way of expressing the relative risk of an adverse event i.e. if an adverse event was twice as likely to happen with a particular intervention, it would have a HR of 2.
This is important when appraising systematic reviews. Consider how heterogeneous the results are both clinically and statistically. Clinically you have to use your judgement to see how heterogeneous the studies are to warrant combining them (e.g. is a systematic review of sinusitis that includes groups of patients with just facial pain and those with a CT scan confirmed diagnosis valid?). Statistically consider if the individual results contradict each other. This can be assessed by ‘eye balling’ the Forest plot. Experts use formal statistical tests such as the Cochran Q (chi-squared) test to do this.
The percent of a population who will develop a disease during a specified interval e.g. a study that found an incidence of chlamydia amongst new college students was 2% per annum, means that 2% of the students contracted chlamydia during the year.
A simple graph very commonly used in clinical trials to represent the time to an event (e.g. MI), or survival rates of participants as they move through time. Time on horizontal access, percentage reaching primary end point on vertical axis, curves in different colours represent experimental and control groups. See below for an example comparing medical therapy to coronary intervention in stable angina.
This is important in prognosis studies. Detecting disease early does not necessarily improve mortality. If patients are not all recruited at a similar stage of the disease differences in outcome over time may just reflect earlier diagnosis e.g. a study that shows apparent improved survival in breast cancer: have the patients really lived longer, or actually died at the same time but just been aware for longer of the diagnosis? (e.g. Imagine two women develop breast cancer aged 50. Woman A is diagnosed by screening at age 52. Woman B is not screened, and the cancer manifests clinically at age 56. Both women die aged 60. Screening appears to improve survival with breast cancer from 4 years to 8 years, but in fact has just diagnosed it earlier)
Frequently used in guidelines to grade the strength of recommendations. Variable, but a general guide is:
A |
1a to 1c |
Syst review of RCTs, or RCT with narrow CI |
B |
2a to 3b |
Syst review of cohort or case control studies, or good quality cohort or CC studies |
C |
4 |
Case series, poor quality cohort and case control studies |
D |
5 |
Expert opinion, or based on physiology or ‘first principles’ |
A measure used in studies looking at diagnosis. It tells us how useful a test, symptom or sign is for establishing a diagnosis and is considered the most useful overall measure of its efficacy. Example: a study is looking at a new near-patient test for streptococcal throat, the LR is the ratio of the probability of a positive result amongst patients who really do have strep throat to the probability of a positive result in patients who do not have strep throat.
Meta-analysis
A statistical technique used to integrate the quantitative results of pooled studies, for example in a systematic review. For mathematical reasons, they tend to express the results in terms of odds ratios. See systematic review.
See positive predictive value, sensitivity and specificity below.
This is a case control study which is ‘nested’ within a defined cohort. Cases of a disease which occur within this cohort are identified and compared to matched controls within the same cohort which do not develop the disease. Although a less robust level of evidence than a cohort study, they are cheaper and easier to do than a full cohort study and can help answer useful questions about factors which contribute to the development of a condition. For example, an excellent recent nested-case control study looked at death in patients with epilepsy (BJGP2011;61;341). The cohort was patients diagnosed with epilepsy, cases were identified who died and for each of these two controls with epilepsy were selected who were still alive. Cases and controls are then matched for age and sex and the researchers then identify factors which are associated with death in epilepsy.
Network meta-analysis
This is a form of meta-analysis whereby indirect comparisons are made between multiple different treatments. Networks of RCTs are analysed together, permitting inferences into the comparative effectiveness of different treatments which may not have been directly compared with each other. Put simply, if drug A was compared to B in one trial, and then drug C to D in another, then using this technique you could comment on the comparative effectiveness of A vs D even though they have not been directly compared with each other. This makes this technique potentially very useful when looking at conditions with multiple treatment options, and this form of meta-analysis is increasingly used to compare comparative effectiveness of different treatments for the same condition e.g. a recent study to compare effectiveness of different asthma treatments
BMJ2014;348;g3009 The strength of this approach is that effects estimated from direct comparisons within individual trials are combined with indirect, non-randomised comparisons between different trials but at a cost of potentially increasing heterogenicity and confounding factors. The methodology of this is complex, see
BMJ2013;346:f2914 and is beyond our scope, but we need to have an idea of what it is and what it’s for!
These are trials specifically designed to see if the new drug is ‘no worse than’ or at least ‘as good as’ the standard treatment. Non-inferiority trials require smaller sample sizes, are cheaper and quicker to do and less likely to produce disappointing results for new drugs so are increasingly used in Pharma sponsored research!
A clinically useful measure of the absolute benefit or harm of an intervention expressed in terms of the number of patients who have to be treated for one of them to benefit or be harmed. Calculated as 1/ARR. Example: the ARR of a stroke with warfarin is 2% (=2/100 = 0.02), the NNT is 1/0.02 = 50. e.g. Drug A reduces risk of a MI from 10% to 5%, what is the NNT?. The ARR is 5% (0.05), so the NNT is 1/0.05 = 20. Chris Cates’ site NNT On-line has a fantastic
Visual Rx calculator which will calculate a NNT for you based on odd ratios or relative risk, and then show you a nice ‘smiley face’
Cates plot.
Observational study
What it says on the tin…In an observational study, the investigators have not assigned treatments or interventions to particular groups. They observe and analyse outcomes of predetermined treatments or interventions. This can be based on data that is retrospective (case control study), prospective (cohort study) or current (cross sectional). See each of these for more information. Observational studies can yield incredibly useful data and you often hear the fact that they are based on ‘real life’ data as a positive. However, they can never eliminate bias and confounding variables. Observational studies can show associations but not prove causation, and thus are a ‘lower’ form of evidence compared to unlike experimental studies such as randomized controlled trials.
Is another way of expressing a relative risk and is simply the odds of an event (either beneficial or harmful) happening versus it not happening. For statistical reasons they are favoured in meta-analysis and case-control studies. They are calculated from the CER and EER e.g. consider a trial of a new drug. 10 out of 200 patients died in the control arm. 5 out of 200 died in the treatment arm.
- What are the odds of death in the treatment arm? The EER is 5/200 i.e. 2.5% which is 0.025
- What are the odds of death with control? The CER is 10/200 i.e. 5% which is 0.05
- What is the OR of the study? This is the ratio of the odds, i.e. 0.025/0.05 = 0.5 i.e. the treatment halves the risk of death compared to control
Odds ratios can be converted into an absolute measure such as NNT, provided all the baseline data are known, but this is complicated and you will not be asked to do it in the AKT! You can convert OR to NNT using the ‘visual treatment calculator’ on-line at
www.nntonline.com or at
www.cebm.net
P value
A measure that an event happened by chance alone e.g. p = 0.05 means that there is a 5% chance that the result occurred by chance. For entirely arbitrary reasons p<0.05 means statistically significant and p<0.01 means highly significant. The lower the value, the more likely the effect is real. It is always sobering to remember that if the makers of a new bottled water called HEADPAINGONE did a RCT comparing their product with tap water to cure headaches, and they repeated this RCT 20 times they would eventually get a statistically significant result.
PICO is the commonly used acronym for forming clinical questions when practising Evidence Based Medicine, and searching for an answer to a specific clinical problem.
P = patient or problem
I = intervention
C = comparison intervention
O = outcome
e.g. in a 50 year old man with diabetes and no heart disease(patient), does a statin (intervention) compared to placebo (comparison) reduce outcomes (CV events or mortality)?
Positive predictive value (see also sensitivity and specificity)
When examining the sensitivity and specificity of a diagnostic test (see below for these), the PPV is the percentage of patients who test positive for condition X who really do have condition X, and the negative predictive value NPV is the percentage who test negative who really do not have it. Importantly these are dependent on the background prevalence of the disorder in the population. If a disease is rare, the PPV of a positive test will be lower (but sensitivity and specificity remain constant). So often with tests the PPV is higher in a secondary care or sieved population than it is in primary care. So, taking a recent example, when testing symptomatic patients in hospital for coronavirus during the 2020 covid-19 pandemic a positive test will have a much higher PPV than a positive test from an asymptomatic ‘screened’ population in the community. The likelihood ratio takes this into account and gives the most accurate information on test accuracy.
What it says on the tin. This is a RCT which is based in patients usual care settings, without rigorous exclusion criteria and often using non-blinded treatments which can be flexibly given or used according to clinical need. The aim is to produce a research setting which more closely mimics ‘real life’ care and ‘real’ patient populations. Such a trial is also likely to be easier and cheaper to do. The downside of course is that because it is less controlled it will be more subject to bias.
The probability of a disease in a population at any one point in time. Example, the prevalence of diabetes in the population is 2% simply means that 2% of the population at the time of the study have diabetes. See incidence.
An important source of bias. Negative trials are just as valid as positive ones, but are less likely to be published. Systematic reviews should search for all data, including unpublished, to try and eliminate this. Also occurs in a SR if the search for studies is incomplete.
Qualitative studies are inherently different from quantitative studies which are based on data. Qualitative studies use observation and interview to shed light on beliefs, thoughts and motivations. They can help ‘fill the gaps’ in knowledge that numbers cannot answer. For example, why do parents worry so much about fever in children? Why do men want a PSA test? Although by definition they are qualitative rather than quantitative, and tend to be small, they should still be designed in a rigorous and systematic way.
Obviously, the gold standard study type for answering questions about treatment, and the best way of minimizing bias and confounding variables. An ‘experimental’ study that can be used to test hypotheses that may have been generated by ‘observational’ studies, such as cohort studies. But for RCTs very large numbers are needed to find a difference in rare events, so they prone to type 1 and 2 errors. Also very expensive, so often funded by manufacturers with strong vested interests in the outcome.
The relative risk, or risk ratio, is the ratio of the risk of an event in experimental group compared to the control group i.e. RR = EER/CER. The RRR is the proportional reduction seen in an event rate between the experimental and control groups. For example if a drug reduces your risk of an MI from 6% to 3%, it halves your risk of a MI i.e. the RRR is 50%. But note that the ARR is only 3%.
Relative risks and odds ratios are used in meta-analyses as they are more stable across trials of different duration and in individuals with different baseline risks. They remain constant across a range of absolute risks. Crucially, if not understood, they can create an illusion of a much more dramatic effect. Saying that this drug reduces your risk of a MI by 50% sounds great; but if your absolute risk was only 6%, this is the same as reducing it 3%. So, saying this drug reduces your risk by 50% or 3% are both true statements but sound very different, so guess which one that drug companies tend to prefer!
However relative risks are still useful to know as you can then apply them yourself to an individual’s absolute risk. So, if you know an intervention reduces the relative risk by a third, you can easily calculate and inform your two patients with a 30% and a 9% risk of heart disease that the intervention will reduce their individual risk by about a third i.e. to 20% and 6% respectively.
Relative risk increase (RRI) similarly describes increases in rates of bad events in a trial or the development of a disease in a cohort study.
Risk difference. See Absolute Risk
Sensitivity and specificity (and positive and negative predictive values)
In diagnostic studies, sensitivity is the probability of a positive test amongst patients with the disease. A very sensitive test will have few false negatives and be good at picking up disease. Specificity is the probability of a negative test among patients without the disease. A very specific test will have few false positives and be good at ruling a disease out. SnNOUT means if a test is very Sensitive (Sn) a Negative test rules the diagnosis out. SpPIN means if a test is highly Specific (Sp) a Positive result rules the diagnosis in.
Imagine a new test for whooping cough. The results are presented as:
Present |
Absent |
Positive |
a |
b |
Negative |
c |
d |
The
sensitivity of the test is a/a+c; the
specificity of the test is d/b+d
The positive predictive value (PPV) of the test is a/a+b; The negative predictive value NPV of the test is d/c+d
The PPV is the percentage of patients who test positive for Bordetella who really do have it, and the NPV is the percentage who test negative who really do not have it. Importantly these are dependent on the background prevalence of the disorder in the population. If a disease is rare, the PPV will be lower (but sensitivity and specificity remain constant). So often with tests the PPV is higher in a secondary care or sieved population than it is in primary care. The likelihood ratio takes this into account and gives the most accurate information on test accuracy.
A paper in which the authors have systematically searched for, appraised and summarized all the known literature (including non-published studies) on a topic. Quantitative reviews then use the statistical method of meta-analysis to synthesize the pooled results. This creates statistically powerful studies with narrow CI. Compared to individual studies this improves the precision of the result, increases confidence that the result has not occurred by chance and reduces the chance of type 1 and 2 error (see below). They are critically dependant on the quality of the studies that are included (rubbish in=rubbish out) and can sometimes be blunt tools for looking at some of the very specific problems seen in primary care.
A Type 1 error occurs if a result is statistically significant, but this is a chance finding and in fact there is no real difference. A Type 2 error occurs if the study finds no-significant difference when in fact there is a real treatment difference. Small studies with wide CI are prone to these errors.
If you see an unexpected positive result (e.g. a small trial shows willow bark extract is effective for back pain) think: could this be a type 1 error? After all, every RCT has at least a 1 in 20 chance of a positive result and a lot of RCTs are published…
If a trial shows a non-significant result, when perhaps you might not have expected it, think could this be a type 2 error? Is the study under-powered to show a positive result? Systematic reviews, which increase study power and reduce CI, are therefore very useful at reducing Type 1 and 2 error.
Dr Simon Curtis
Medical Director, NB Medical Education
‘Hot Topics’ education, courses and CPD for primary care healthcare professionals
June 2020