Athletes who test positive for a prohibited substance in both their A and B samples rarely, if ever, win their appeals cases, which often devolve into lawyers grasping at straws and drawing on desperate, irrelevant arguments in an attempt to get the case overturned.
But in a case heard ‘round the ski world, Estonian cross-country legend Andrus Veerpalu has done what seemed impossible: despite an “adverse analytical finding” (AAF) of recombinant human grown hormone (rhGH) in both his A and B sample, the Court of Arbitration for Sport (CAS) overturned his ban on a technicality.
And what a technicality it was. On Tuesday morning, CAS announced in a press release that the International Ski Federation, the governing body for skiing which initially handled the case and was the respondent in Veerpalu’s appeal, “failed to meet the applicable standard of proof with respect to the procedure followed to set the aspects of the decision limits.”
In short, the test itself was valid, but the way that anti-doping administrators applied it was not. And it clearly pained CAS to have to rule in favor of Veerpalu.
“There were many factors in this case which tend to indicate that Andrus Veerpalu did himself administer exogenous hGH,” the press release noted, but continued on to say that there was insufficient statistical proof to ban him.
Has this ever happened before?
Gary Wadler, a New York doctor who used to work on the committee to select the World Anti-Doping Agency (WADA) Prohibited List, wasn’t familiar with the case, but he was certainly surprised.
“I have not heard of that, no,” he told FasterSkier of a successful appeal on the basis of math.
And the doctor who created the test, which compares ratios of differently-structured molecules of hGH and had served as a witness in the CAS hearings, was no less flummoxed.
“I could not understand from the information I have what the criticism was,” Dr. Martin Bidlingmaier told FasterSkier in an interview from his clinical offices at the University Clinic of Munich. “I know the findings in this case, and in my opinion it looked quite clear.”
No matter who you ask, as long as they are not affiliated with Veerpalu, you’ll hear that the cat was in the bag; Veerpalu was doping. So how could the case have gone wrong, with a test that has been peer-reviewed and can easily detect an unnatural ratio of the different forms of hGH?
The CAS ruling blamed not FIS, but WADA, which sets the procedures and standards that other anti-doping bodies must follow. WADA’s Senior Manager for Media Relations Terence O’Rorke has been unavailable for comment since the ruling announcement on Tuesday, and requests to the media e-mail account, as directed by his secretary, went unanswered. WADA Science Director Olivier Rabin was briefly reached by FasterSkier, but said he was on his way to a meeting and to make an appointment with his secretary, who never answered the phone.
With no official word from WADA – which has not posted anything about the outcome of the case on its website – it’s tough to know why they chose to set the decision limits in the way that they did. But both the CAS decision and WADA’s guidelines for the application of the hGH test provide a few insights.
Isomers and Ratios
Bidlingmaier has been studying growth hormones since the mid-1990’s in a range of applications: diabetes. Exercise. Pregnancy. When it came to doping, he and his colleagues tackled what seemed like an unsolvable problem: how to detect recombinant hGH, which was in many ways identical to the natural kind. Making matters worse, hGH expression fluctuates in the body over both short and long time scales, making it impossible to employ any absolute cutoffs.
Bidlingmaier and an international team of researchers published a paper in the Journal of Clinical Endocrinology and Metabolism examining how the expression of hGH changed in the human body when an athlete began to administer the exogenous, or recombinant, variety.
In a normal, non-doping human, about 70 percent of hGH — the naturally-occurring variety that is secreted by the pituitary gland and referred to as pithGH — is expressed in forms that are 191 amino acids long, known as 22-kDa isomers. Much of the rest is in a form that has shorter, 176 amino acid chains, known as 20-kDa isomers. There are also other forms of various shapes and structures. rhGH exists only as 22-kDa isomers.
After injecting rhGH and exercising, serum concentrations of the 20-kDa and rarer isoforms dropped off to practically nothing. The high level of rhGH interferes with the body’s signaling and feedback mechanisms and tells it not to produce any more natural hGH. So instead, the ratio of 22-kDa isomers, from the recombinant form, rises.
In 1999, Bidlingmaier and three collaborators published a short report in leading medical journal The Lancet reporting that they had developed a test that compared the ratio of pithGH to rhGH. Over the next few years, more tests were developed.
The one used on Veerpalu’s samples was described in a 2009 article in Clinical Chemistry. As in previous tests, the researchers used mice that they raised either normally or on only rhGH. They then isolated antibodies that would bind specifically to one isoform of the hormones or the other. The test itself relies on two of these antibodies, which are labeled with acridinium ester, a chemical which luminesces.
A sample, then, is put in a vial with both of these antibodies. The amount of luminescence tells how strongly each is expressed, and the ratio is then calculated. If there’s too few of the non-22-kDa isomers, it is assumed that rhGH must be suppressing their production. The team verified this by establishing baseline values for the ratios in several groups of healthy subjects, then tested samples from 10 male and 10 female athletes who had been administered rhGH.
For peer review in a medical journal, this was enough to prove that the test worked – and CAS agreed, saying that the method was robust and reliable. Bidlingmaier’s job was done; it was WADA’s turn to run with his work.
Setting the Limits
Just because a test works doesn’t mean that it can immediately be put to use. Among other things, WADA had to decide what constituted a negative or a positive test. Even using a ratio, there had to be a line somewhere, and deciding where to set it would have a huge impact on future doping cases. The test, which consisted to two kits each containing one antibody of each type, was put to the test.
According to WADA’s own “Guidelines for hGH Isoform Differential Immunoassays” handbook, published in July 2010, as well as the CAS ruling, there were three trials to calibrate the test and set detection limits. The first was an “Initial Study” using samples from the 2009 IAAF Track and Field World Championships. Using these samples, WADA set a detection limit based on a lognormal distribution of the ratios in each sample.
Over the next year, WADA collected over 700 new samples from nine of its accredited laboratories around the world. In a point that is central to Veerpalu’s case, the organization decided to combine this data with the samples from the “Initial Study” to form a “Validation Study.” They also combined male and female samples, and African and Caucasian samples, all of which had been kept separate in the Initial Study.
At this point, WADA said in the CAS hearing, they decided that sex and race did not significantly impact a finding, even though in the Initial Study they had found that ratios were typically higher in the African samples.
The researchers had 801 male samples for the first kit and 142 for the second kit. They also decided that instead of using a lognormal distribution model to set a decision limit, they would use a gamma model. The original lognormal model, they said, was not acceptable.
What’s the difference? In some ways, not much; both cater to “right-skewed” datasets, with many observations close to zero and a long tail of less common observations at higher values. But the mathematical equations used to describe the distributions are different, and their shapes are not exactly the same.
However, the conclusions were not much different than in the Initial Study, and the decision limits that they established were put into the July 2010 Guidelines. For kit 1, the limits of the ratio are 1.81 for males and 1.46 for females; for kit 2, the limits are 1.68 and 1.55.
(As reported in the original FIS ruling, Veerpalu’s A-sample values were 2.62 for kit 1 and 3.07 for kit 2; the B-sample was 2.73 and 2.00, in other words, not on the border of this decision point that WADA had identified but rather quite extreme.)
A year later WADA conducted a second verification study, using over 1,000 samples and the gamma distribution. Their results showed that the decision limits previously set were conservative, had a relatively low risk of producing false positives, and would therefore not be unfair to clean athletes.
In a key point, WADA decided to exclude any samples with very high values from either Validation Study. In some cases, these were samples that they knew had rhGH, but in others they were simply outliers that did not fit well within the distribution.
This was one of Veerpalu’s team’s main arguments: that by excluding samples with high ratios, even though they may have come from clean athletes, WADA set itself up to have a tight decision limit and constrained the amount of variation that could be considered clean.
Furthermore, the team claimed that the distribution of ratios from the samples can’t be modeled parametrically at all – why else would there be these troublesome outliers? They point to the rejection of the lognormal distribution in the Validation Study as evidence that no distributions truly fit the data. And WADA itself admitted: the gamma distribution fit the data well at high values, but not at low values. It wasn’t a perfect match.
The lawyers brought up, too, the question of why WADA lumped all the ethnicities together, when there were clearly – at first, at least – differences among them. It was likely an issue of sample size, but even after combining all the samples Veerpalu’s team argued that the 142 samples for kit 2 were not enough to fit a solid distribution curve. And finally, why did they not use any known positive samples to test their decision points? Why use only samples that they knew were not from dopers?
In response to all of this, WADA acquitted itself poorly. Always reluctant to explain how it develops tests due to a fear of dopers listening in and “beating the system,” the organization answered only what it was asked for, and sometimes barely that. At one point the CAS ruling notes that “The Respondent has not provided evidence or further information on the other distributions considered.”
Basically, instead of explaining what they did, WADA said: don’t worry. We did it right. The group’s scientists even backtracked and changed their explanations of the distribution models at the very last hearing.
CAS was unconvinced:
“Despite the Respondent’s ample opportunities to convince the Panel on the correctness of the decision limits including in the post-Hearing brief as well as in response to the two subsequent rounds of Panel Questions, the Panel cannot exclude to its comfortable satisfaction that the decision limits are overinclusive and could lead to an excessive amount of false positive results (beyond the claimed specificity of 99.99%). Although the Panel has found that the Test itself is undoubtedly reliable (as explained in Section 5 (‘The Reliability of the Test’)), the Panel finds that the following factors prevent it from concluding that the decision limits are equally reliable: (1) The inappropriate exclusion of certain sample data from the dataset; (2) the small sample sizes; and (3) the data provided on the distribution models used.”
Fallout and Failure
What is somewhat astounding about this case is that providing more detailed explanations of the statistics used to set the decision limits would likely not have hurt WADA. The organization’s continual refusal to share the chemical makeup of its kits or the techniques used in detecting banned substances is understandable, albeit frustrating at times.
WADA publishes the numbers that it uses as limits for these ratios; why not talk about the statistics? Knowing which type of data distribution was used to set a limit is unlikely to help a doper in any way. What it could do, however, is instill more confidence in WADA itself, from athletes across the spectrum.
After all, even if everyone is convinced that Veerpalu is guilty, anti-doping efforts have to be science, not a witch hunt. To ban an athlete you need evidence, not a hunch.
“The Panel accepts that a dynamic approach to testing may be desirable and acceptable, particularly in the field of anti-doping,” CAS wrote. “However, the Panel must bear in mind the seriousness of the allegations made against the Appellant when assessing whether it is satisfied that the decision limits with regard to hGH have been correctly determined.”
Lost in all of this is FIS itself, which should have had a slam-dunk case against a prominent athlete. When the CAS decision came out, FIS did not publicize any comments – instead they posted a brief, few-sentence announcement on their website with links to the CAS press release and ruling. Short of blaming WADA, they nonetheless passively pointed a finger at the larger body which gives them its rules and guidelines.
“The Panel found that the Decision Limits of the Test for the substance as published in the WADA Regulations were unreliable and therefore his appeal against the decision of the FIS Doping Panel to sanction him is upheld,” the statement read.
As FasterSkier reported previously, FIS Secretary General Sarah Lewis said that “there is nothing else to add.”
As for Bidlingmaier, who developed a test that he thought would catch people just like Veerpalu, he’s still marveling at how WADA failed to do so.
“All I know is from the media and the website of CAS,” he told FasterSkier. “As far as I understood, the decision made very clear that the test itself is scientifically sound and there were no doubts with respect to all the labwork. The discussion arose only around the statistical model used around the decisions they make, between positive and negative… I’m a physician and a biochemist, I’m not a statistician, and it’s their world, their arguments. I have no idea if it’s valid or not.”
Chelsea Little is FasterSkier's Editor-At-Large. A former racer at Ford Sayre, Dartmouth College and the Craftsbury Green Racing Project, she is a PhD candidate in aquatic ecology in the @Altermatt_lab at Eawag, the Swiss Federal Institute of Aquatic Science and Technology in Zurich, Switzerland. You can follow her on twitter @ChelskiLittle.