It is inevitable that clinical trials will fail. No matter how good the hypothesis, how promising the compound, sometimes our understanding of the biology was just too patchy. Humans are, after all, complex creatures.
But for every “acceptable failure” when the hypothesis being tested was just plain wrong, there is another failure that was entirely avoidable. Tiny, almost invisible, errors in the design of many clinical trials conducted today lead to costly failures. Worst still, perfectly good therapeutic candidates are dumped without their promise ever being properly tested.
Spotting these design errors is tricky. Some are so subtle that even highly experienced and successful drug developers can inadvertently incorporate them into their trial designs. And the natural herd instinct in all of us is compounded by regulatory authorities who prefer to see established trial designs than innovation. But spotting them is imperative if we are to return the productivity of pharmaceutical R&D to the levels seen in past decades.
In previous posts, DrugBaron has already highlighted two of the three biggest causes of avoidable failures: an inadequate approach to power calculations, and unexpectedly large placebo responses. But there is a third cause that is at least important: the increasing popularity of composite end-points.
It’s a sign of the times. Advances in technology allow us to measure more things more cheaply than ever before, on ever smaller samples, ever more accurately. In short, we are awash with data.
And the presumption is that all this data on the subjects treated with an experimental drug must somehow reveal more about the safety and efficacy of the drug than simpler measures from a bygone age. This is a presumption that is driven by the growing prominence of so-called “big data” in the tech space. Mining large quantities of data emerging from the close monitoring of very complex processes, such as silicon chip manufacturing plants, logistics networks and most recently the internet have driven highly disruptive changes in all these areas.
So its natural to assume the same will happen in healthcare, and more specifically in clinical trials. And it will. But it hasn’t happened yet – most likely because the gulf between biology and medicine on the one hand and mathematics and multivariate statistics on the other is much wider than for other technology disciplines.
“The superficially straight-forward task of combining individual biomarkers is not a task for the uninitiated. Injudicious combination will reduce rather than increase your power” Total Scientific Biomarker Blog
Outside the life sciences, many tech companies are heavily populated with “geeks” – an unfairly negative classification for computer scientists, engineers and mathematicians. Anyone, in fact, with a stronger grip on numbers than human emotions. These people grew up describing very complex systems with numbers, and have developed ever more powerful mathematical tools to deal with the problems they face. And the biggest hurdle for these “big data” problems is extracting information from vast tracts of data.
Data is not information. We collect data by measuring things. Information, on the other hand, is useful to guide future decisions. It allows you to predict the future behaviour of the complex system you observed when you collected the data. And closing this loop requires processing of the data to yield predictive information.
Even without a background in data processing, it seems obvious that irrespective of the processing step required to derive useful information from the data, the more and better the data the better the resulting information. And hence the current trend to capture more and more data about our clinical trial subjects.
But there is another difference between our clinical trials and applications of “big data” outside of life sciences. n. The number of observations. A typical proof-of-concept study might have tens of subjects, a decent phase II study might have hundreds of subjects and the largest studies have a few thousand. Contrast that with the dataset collected by Google from the search queries it receives. Or the number of silicon wafers being processed in a single day at a semiconductor fabrication plant. When a tech company talks of “big data” it really means big. Not only (and maybe not even) lots of data about each individual event, but also lots and lots of “replicate” events.
When such data is arranged in a table, we call it “tall and narrow” because there are more replicate observations (rows) than separate measurements (columns). If you take lots of measurements on each subject in a clinical trial, however, what you end up with is a “short and fat” dataset with more columns than rows.
All this seems a bit esoteric, but when you look a little bit deeper this is precisely why the more data you collect so the risk of clinical trial failure becomes greater, rather than less. This is a trend that will drive down pharmaceutical R&D productivity over time; so far from being esoteric, it is of central importance to everyone in the life science industry.
But forewarned is forearmed. The perils of composite end-points may be even more subtle than the misuse of power calculations or the danger of an unexpectedly large placebo response.
To gain some insight into why this might be we need to recall some maths taught to us as thirteen year olds: simultaneous equations. Given two variables, x and y and two equations that link them you can apply simple algebra to obtain the values of x and y. Critically, to solve the problem you need (at least) the same number of equations as variables. Give me three variables and two equations and it is mathematically impossible to return values for the three variables.
Processing data to yield information is exactly analogous to the simultaneous equations problem. So as long as I have plenty more observations (rows) than variables (columns) – in other words, I have a “tall and narrow” dataset – then it is (relatively) trivial to extract information from the data.
But with “short and fat” datasets, such as those that typically emerge from clinical trials today, the same is not true. Extracting any kind of useful information from such a dataset requires a level of mathematical sophistication that eludes most statisticians working in the life science space, let alone chief medical officers and investor directors in life science companies. It is no wonder, then, that this is a leading cause of “avoidable failures”.
Composite end-points, leading to ‘avoidable failures’ are more likely to be your problem than your solution.
Of course, that’s not the end of the story. That merely describes the problem, so what about the solution? There are basically three types of solution of increasing complexity.
Most straightforwardly, you could restrict the number of measurements that you take. This was the old-fashioned approach from the days when making more measurements was in any case limited by technical factors (such as availability of assays or sample volumes) or by cost. This simplifies the statistics, with a single primary end-point it is easy to determine whether there has been a meaningful change in response to study drug.
The downside cost of this simplicity, though, is the difficulty picking such ‘simple’ end-points. For many applications, particularly in early stage trials, there is no single biomarker that is sufficiently powerful to predict future efficacy against “hard” clinical end-points. Instead, there are dozens if not hundreds of candidate biomarkers each weakly associated with the regulatory end-points of interest. Choosing one is difficult if not impossible.
The next option, and perhaps the most common solution to this problem, is to start combining the weak biomarkers into profiles or composite scores. These scores might include functional measures (such as pain scales) or molecular measures such as cytokine panels, or increasingly both types at once such as the DAS28 score for rheumatoid arthritis that incorporates combines phenotypic measures (the number of stiff and swollen joints) with the inflammatory marker CRP. The wonderfully named SNOT-22 score for nasal symptoms in allergy is another example.
Judging by its popularity, you might assume such combinations were universally a good idea. According to this logic, combining several weak biomarkers should yield a single stronger one. After all, with more data incorporated, surely the composite must be more powerful than the separate component biomarkers?
Absolutely not. Most combinations of biomarkers are weaker than the single strongest component biomarker. Why is that? Because the signals contained in each separate biomarker are usually correlated with each other, but the large amount of noise is entirely random. So each added marker contributes a full shot of noise, but a heavily diluted dose of signal.
A composite score may be well established in the literature, used clinically, even accepted by regulators. But that doesn’t mean that one of the component markers might not be more powerful than the composite.
Given a handful of weak biomarkers, there are limitless ways of combining them, and any arbitrary combination will almost certainly be less powerful than the single best component marker. It is important to note that ‘arbitrary’ in this sense simply means a combination selected without testing whether that combination was in fact more powerful at predicting outcome than other combinations or the separate components. Many ‘arbitrary’ combinations have excellent theoretical underpinning – but that doesn’t make them any more likely to be useful.
In a recent post on their Biomarker Blog, Total Scientific used their heart disease dataset, MaGiCAD, to powerfully illustrate this point. A commonly used scoring algorithm for predicting heart attacks was shown to be no more powerful than its best single component.
The third, and most complex, approach, then, is to avoid arbitrary combinations of biomarkers and use mathematics to select the best possible combination. For “tall and narrow” datasets, that is relatively easy. But for the “short and fat” datasets emerging from clinical trials it is considerably more challenging.
Again, thinking about simultaneous equations helps illustrate the approach. With three variables and two equations you cannot find a single solution. But you can place limits on the values each variable can take. And the limits can be substantial – in other words it is usually possible to rule out the majority of possible variable combinations, with any one of a minority still satisfying both equations.
We can do the same trick with “short and fat” datasets. Using multivariate statistical approaches, such as projection (PCA or PLS), random forests or genetic algorithms, we can select combinations of the variables that are most likely to predict the outcome. In the Total Scientific dataset, the best combination selected in this way was much more powerful than the arbitrary composites routinely used in the literature.
But there are pitfalls even still. When the dataset is “short and fat” its impossible to know precisely the best combination – but only the ‘kind of’ combinations that are going to be the best (just as with two equations and three variables you can only limit the possible solutions, not provide a single final solution). The one that looks the best in your particular dataset may only be good for that dataset and be very weak in the next trial that is run. Statisticians call that “overfitting the model”.
With so much to go wrong when you start combining biomarkers, it is hardly surprising that it causes “avoidable failures” in clinical trials. And given the underlying mathematical complexity, there is little wonder that these ‘mistakes’ go almost entirely unnoticed.
But forewarned is forearmed. The perils of composite end-points may be even more subtle than the misuse of power calculations or the danger of an unexpectedly large placebo response, yet as the difficulty and cost of measuring more and more variables declines further it is set to become the single biggest cause of ‘avoidable failures’.
Each added marker contributes a full shot of noise, but a heavily diluted dose of signal.
A composite score may be well established in the literature, used clinically, even accepted by regulators. But that doesn’t mean that one of the component markers might not be more powerful than the composite. Nor does it mean that it is the best (or even a good) algorithm for combining those measures. Even if their combination was validated, in other patient groups, different individual biomarkers may be responsible for different components of the signal and the noise – so what worked for them may not work for you.
In short, if you see a clinical trial using a composite end-point it is time to be cautious. “Big data” may be changing the landscape all around us, and in decades to come surely it will do so for clinical trials (and healthcare generally), but for now the level of quantitative skills in our industry remains too low to be confident that data is effectively being processed into information. Until the situation improves, composite end-points, leading to ‘avoidable failures’ are more likely to be your problem than your solution.