How Good is that Test? II 

Bandolier 26
reported on quality standards that should be met by reports of diagnostic
procedures, and how few of those standards were met by reports in our top medical
journals. Users of tests will want to know not only that tests work, but how well
they work  just like NNTs for treatments.
This issue of Bandolier investigates diagnostic test qualities a little further. The problem of spectrum bias means focusing on sensitivity and specificity of tests. These, however, are not the most userfriendly of measures, so Bandolier , ever seeking simplicity, has invented a new measure  the NND, or numberneededtodiagnose. Comments are invited. Spectrum biasAn unrecognised (but probably very real) problem is that of spectrum bias [1]. This is the phenomenon of the sensitivity and/or specificity of a test varying with different populations tested  populations which might vary in sex ratios, age, or severity of disease as three simple examples.Spectrum bias at its simplest means that the sensitivity and specificity of the test have to be known in a range of different patient populations. This was tested in the paper by looking at men and women tested for urinary tract infections with urine dipsticks [1].
Actually, this is very good, showing that using the urine dipstick test where there were some clinical indications of UTI picked up the infection nearly every time. Note though, that this only addresses those patients with the disease  not those without it. The authors examined a number of other tests, and found examples of spectrum bias with tumour markers (varying with severity of disease), exercise ECG for coronary ischaemia (varying with age, sex and severity) and various other physical tests. ProblemThe problem is handling tables of sensitivity and specificity  two sets of numbers that can go up or down independently in different populations. It is just too much for simple or busy brains. It is hard enough remembering just how sensitivity and specificity are defined. If the evidence is too complicated to be used, then we have a problem.SimplifyIs it possible to simplify these measures? Well, a whole raft of calculations can be done knowing the true and false positive and negative rates, none of which condenses the information down to a single useful figure. Using positive and negative predictive values (as one example) still means carrying too much baggage.Given Bandolier's prediliction for the numberneededtotreat, we wondered whether it was possible to generate an analogous "numberneededtodiagnose". The arguments go something like this (and forgive a little jargon): For any chosen clinical endpoint the NNT is the reciprocal of the fractional improvement in a treated group minus the fractional improvement in an untreated group NNT = 1/(fraction improved with active  fraction improved with control) For a diagnostic test the analogous calculation of a NND would be the reciprocal of the fraction of positive tests in the group with the disease minus the fraction of positive tests in the group without the disease. The first term, the fraction of positive tests in the group with disease is the sensitivity (true positive/true positive plus false positive). Specificity is defined as the proportion of people without the disease who have a negative test. So the second term, the fraction of positive tests in the group without the disease, is 1  specificity. NumberneededtodiagnoseThe numberneededtodiagnose is therefore: NND = 1/[Sensitivity  (1  Specificity)] How does this work in practice?Take Helicobacter pylori infections as an example. Serology tests for the presence of antiH pylori immunoglobulins and urea breath tests have sensitivities and specificities each of about 95%. So the NND calculation using fractions would be:NND = 1/[0.95  (1  0.95)] = 1/[0.9] = 1.1 Using examples from the paper on spectrum bias gives a series of results with NND values up to about 4. Thus using CEA as a diagnostic screening test for colon cancer in patients with the disease would yield a NND of 4.4 in early cancers, but as low as 1.6 in late cancers  a clear case of spectrum bias. Similar differences exist for other examples. Interesting is the effect of NND calculations on the authors' own data on urine testing. Because sensitivity goes down but specificity increases in patients with few symptoms of UTI, the NND of 2.9 remains the same whether the clinical suspicion is high or low. Their best result was the overall NND of 1.8, because of a combination of relatively high sensitivity and specificity. Perhaps this emphasises the need to consider sensitivity and specificity combined in a single term. Choosing which testThere are occasions where different tests can be used to make the same diagnosis. NNDs may help to choose between them when faced with an array of sensitivity and specificity figures.NNDs calculated for diagnostic tests
The Table shows three tests of smoking status from a Northern Ireland study [2] measured against selfreporting. They are all good, but urine nicotine metabolite or breath carbon monoxide are much better than serum thiocyanate. Even small improvements are important if considering routine or screening use of such tests. Implications
References:

previous or next story in this issue