How WADA Dropped the Ball on the Veerpalu Doping Case

Chelsea LittleMarch 27, 20138
Andrus Veerpalu competing in Muonio, Finland, in the fall of 2010, as he started what would be his last season in the sport.
Andrus Veerpalu competing in Muonio, Finland, in the fall of 2010, as he started what would be his last season in the sport.

Athletes who test positive for a prohibited substance in both their A and B samples rarely, if ever, win their appeals cases, which often devolve into lawyers grasping at straws and drawing on desperate, irrelevant arguments in an attempt to get the case overturned.

But in a case heard ‘round the ski world, Estonian cross-country legend Andrus Veerpalu has done what seemed impossible: despite an “adverse analytical finding” (AAF) of recombinant human grown hormone (rhGH) in both his A and B sample, the Court of Arbitration for Sport (CAS) overturned his ban on a technicality.

And what a technicality it was. On Tuesday morning, CAS announced in a press release that the International Ski Federation, the governing body for skiing which initially handled the case and was the respondent in Veerpalu’s appeal, “failed to meet the applicable standard of proof with respect to the procedure followed to set the aspects of the decision limits.”

In short, the test itself was valid, but the way that anti-doping administrators applied it was not. And it clearly pained CAS to have to rule in favor of Veerpalu.

“There were many factors in this case which tend to indicate that Andrus Veerpalu did himself administer exogenous hGH,” the press release noted, but continued on to say that there was insufficient statistical proof to ban him.

Has this ever happened before?

CAS is sure that Veerpalu administered some form of hGH, perhaps like Genotropin, shown here; they just can't prove it. Photo: Hunted230 via Wikimedia Commons.
CAS is sure that Veerpalu administered some form of hGH, perhaps like Genotropin, shown here; they just can’t prove it. Photo: Hunted230 via Wikimedia Commons.

Gary Wadler, a New York doctor who used to work on the committee to select the World Anti-Doping Agency (WADA) Prohibited List, wasn’t familiar with the case, but he was certainly surprised.

“I have not heard of that, no,” he told FasterSkier of a successful appeal on the basis of math.

And the doctor who created the test, which compares ratios of differently-structured molecules of hGH and had served as a witness in the CAS hearings, was no less flummoxed.

“I could not understand from the information I have what the criticism was,” Dr. Martin Bidlingmaier told FasterSkier in an interview from his clinical offices at the University Clinic of Munich. “I know the findings in this case, and in my opinion it looked quite clear.”

No matter who you ask, as long as they are not affiliated with Veerpalu, you’ll hear that the cat was in the bag; Veerpalu was doping. So how could the case have gone wrong, with a test that has been peer-reviewed and can easily detect an unnatural ratio of the different forms of hGH?

The CAS ruling blamed not FIS, but WADA, which sets the procedures and standards that other anti-doping bodies must follow. WADA’s Senior Manager for Media Relations Terence O’Rorke has been unavailable for comment since the ruling announcement on Tuesday, and requests to the media e-mail account, as directed by his secretary, went unanswered. WADA Science Director Olivier Rabin was briefly reached by FasterSkier, but said he was on his way to a meeting and to make an appointment with his secretary, who never answered the phone.

With no official word from WADA – which has not posted anything about the outcome of the case on its website – it’s tough to know why they chose to set the decision limits in the way that they did. But both the CAS decision and WADA’s guidelines for the application of the hGH test provide a few insights.

Isomers and Ratios

Bidlingmaier has been studying growth hormones since the mid-1990’s in a range of applications: diabetes. Exercise. Pregnancy. When it came to doping, he and his colleagues tackled what seemed like an unsolvable problem: how to detect recombinant hGH, which was in many ways identical to the natural kind. Making matters worse, hGH expression fluctuates in the body over both short and long time scales, making it impossible to employ any absolute cutoffs.

The structure of the most common isomer of hGH. Image: WikiMedia Commons
The structure of the most common isomer of hGH. Image: WikiMedia Commons

Bidlingmaier and an international team of researchers published a paper in the Journal of Clinical Endocrinology and Metabolism examining how the expression of hGH changed in the human body when an athlete began to administer the exogenous, or recombinant, variety.

In a normal, non-doping human, about 70 percent of hGH — the naturally-occurring variety that is secreted by the pituitary gland and referred to as pithGH — is expressed in forms that are 191 amino acids long, known as 22-kDa isomers. Much of the rest is in a form that has shorter, 176 amino acid chains, known as 20-kDa isomers. There are also other forms of various shapes and structures. rhGH exists only as 22-kDa isomers.

After injecting rhGH and exercising, serum concentrations of the 20-kDa and rarer isoforms dropped off to practically nothing. The high level of rhGH interferes with the body’s signaling and feedback mechanisms and tells it not to produce any more natural hGH. So instead, the ratio of 22-kDa isomers, from the recombinant form, rises.

In 1999, Bidlingmaier and three collaborators published a short report in leading medical journal The Lancet reporting that they had developed a test that compared the ratio of pithGH to rhGH. Over the next few years, more tests were developed.

The one used on Veerpalu’s samples was described in a 2009 article in Clinical Chemistry. As in previous tests, the researchers used mice that they raised either normally or on only rhGH. They then isolated antibodies that would bind specifically to one isoform of the hormones or the other. The test itself relies on two of these antibodies, which are labeled with acridinium ester, a chemical which luminesces.

A sample, then, is put in a vial with both of these antibodies. The amount of luminescence tells how strongly each is expressed, and the ratio is then calculated. If there’s too few of the non-22-kDa isomers, it is assumed that rhGH must be suppressing their production. The team verified this by establishing baseline values for the ratios in several groups of healthy subjects, then tested samples from 10 male and 10 female athletes who had been administered rhGH.

For peer review in a medical journal, this was enough to prove that the test worked – and CAS agreed, saying that the method was robust and reliable. Bidlingmaier’s job was done; it was WADA’s turn to run with his work.

Setting the Limits

Just because a test works doesn’t mean that it can immediately be put to use. Among other things, WADA had to decide what constituted a negative or a positive test. Even using a ratio, there had to be a line somewhere, and deciding where to set it would have a huge impact on future doping cases. The test, which consisted to two kits each containing one antibody of each type, was put to the test.

According to WADA’s own “Guidelines for hGH Isoform Differential Immunoassays” handbook, published in July 2010, as well as the CAS ruling, there were three trials to calibrate the test and set detection limits. The first was an “Initial Study” using samples from the 2009 IAAF Track and Field World Championships. Using these samples, WADA set a detection limit based on a lognormal distribution of the ratios in each sample.

Examples of the gamma (a) and lognormal (b) distributions. Note the differences in shape.
Examples of the gamma (a) and lognormal (b) distributions; Note the differences in shape on both the left and right sides of the distributions.

Over the next year, WADA collected over 700 new samples from nine of its accredited laboratories around the world. In a point that is central to Veerpalu’s case, the organization decided to combine this data with the samples from the “Initial Study” to form a “Validation Study.” They also combined male and female samples, and African and Caucasian samples, all of which had been kept separate in the Initial Study.

At this point, WADA said in the CAS hearing, they decided that sex and race did not significantly impact a finding, even though in the Initial Study they had found that ratios were typically higher in the African samples.

The researchers had 801 male samples for the first kit and 142 for the second kit. They also decided that instead of using a lognormal distribution model to set a decision limit, they would use a gamma model. The original lognormal model, they said, was not acceptable.

What’s the difference? In some ways, not much; both cater to “right-skewed” datasets, with many observations close to zero and a long tail of less common observations at higher values. But the mathematical equations used to describe the distributions are different, and their shapes are not exactly the same.

However, the conclusions were not much different than in the Initial Study, and the decision limits that they established were put into the July 2010 Guidelines. For kit 1, the limits of the ratio are 1.81 for males and 1.46 for females; for kit 2, the limits are 1.68 and 1.55.

(As reported in the original FIS ruling, Veerpalu’s A-sample values were 2.62 for kit 1 and 3.07 for kit 2; the B-sample was 2.73 and 2.00, in other words, not on the border of this decision point that WADA had identified but rather quite extreme.)

A year later WADA conducted a second verification study, using over 1,000 samples and the gamma distribution. Their results showed that the decision limits previously set were conservative, had a relatively low risk of producing false positives, and would therefore not be unfair to clean athletes.

Inappropriate Science?

In a key point, WADA decided to exclude any samples with very high values from either Validation Study. In some cases, these were samples that they knew had rhGH, but in others they were simply outliers that did not fit well within the distribution.

This was one of Veerpalu’s team’s main arguments: that by excluding samples with high ratios, even though they may have come from clean athletes, WADA set itself up to have a tight decision limit and constrained the amount of variation that could be considered clean.

Furthermore, the team claimed that the distribution of ratios from the samples can’t be modeled parametrically at all – why else would there be these troublesome outliers? They point to the rejection of the lognormal distribution in the Validation Study as evidence that no distributions truly fit the data. And WADA itself admitted: the gamma distribution fit the data well at high values, but not at low values. It wasn’t a perfect match.

The lawyers brought up, too, the question of why WADA lumped all the ethnicities together, when there were clearly – at first, at least – differences among them. It was likely an issue of sample size, but even after combining all the samples Veerpalu’s team argued that the 142 samples for kit 2 were not enough to fit a solid distribution curve. And finally, why did they not use any known positive samples to test their decision points? Why use only samples that they knew were not from dopers?

In response to all of this, WADA acquitted itself poorly. Always reluctant to explain how it develops tests due to a fear of dopers listening in and “beating the system,” the organization answered only what it was asked for, and sometimes barely that. At one point the CAS ruling notes that “The Respondent has not provided evidence or further information on the other distributions considered.”

Basically, instead of explaining what they did, WADA said: don’t worry. We did it right. The group’s scientists even backtracked and changed their explanations of the distribution models at the very last hearing.

CAS was unconvinced:

“Despite the Respondent’s ample opportunities to convince the Panel on the correctness of the decision limits including in the post-Hearing brief as well as in response to the two subsequent rounds of Panel Questions, the Panel cannot exclude to its comfortable satisfaction that the decision limits are overinclusive and could lead to an excessive amount of false positive results (beyond the claimed specificity of 99.99%). Although the Panel has found that the Test itself is undoubtedly reliable (as explained in Section 5 (‘The Reliability of the Test’)), the Panel finds that the following factors prevent it from concluding that the decision limits are equally reliable: (1) The inappropriate exclusion of certain sample data from the dataset; (2) the small sample sizes; and (3) the data provided on the distribution models used.”

Fallout and Failure

What is somewhat astounding about this case is that providing more detailed explanations of the statistics used to set the decision limits would likely not have hurt WADA. The organization’s continual refusal to share the chemical makeup of its kits or the techniques used in detecting banned substances is understandable, albeit frustrating at times.

WADA publishes the numbers that it uses as limits for these ratios; why not talk about the statistics? Knowing which type of data distribution was used to set a limit is unlikely to help a doper in any way. What it could do, however, is instill more confidence in WADA itself, from athletes across the spectrum.

After all, even if everyone is convinced that Veerpalu is guilty, anti-doping efforts have to be science, not a witch hunt. To ban an athlete you need evidence, not a hunch.

“The Panel accepts that a dynamic approach to testing may be desirable and acceptable, particularly in the field of anti-doping,” CAS wrote. “However, the Panel must bear in mind the seriousness of the allegations made against the Appellant when assessing whether it is satisfied that the decision limits with regard to hGH have been correctly determined.”

Lost in all of this is FIS itself, which should have had a slam-dunk case against a prominent athlete. When the CAS decision came out, FIS did not publicize any comments – instead they posted a brief, few-sentence announcement on their website with links to the CAS press release and ruling. Short of blaming WADA, they nonetheless passively pointed a finger at the larger body which gives them its rules and guidelines.

“The Panel found that the Decision Limits of the Test for the substance as published in the WADA Regulations were unreliable and therefore his appeal against the decision of the FIS Doping Panel to sanction him is upheld,” the statement read.

As FasterSkier reported previously, FIS Secretary General Sarah Lewis said that “there is nothing else to add.”

As for Bidlingmaier, who developed a test that he thought would catch people just like Veerpalu, he’s still marveling at how WADA failed to do so.

“All I know is from the media and the website of CAS,” he told FasterSkier. “As far as I understood, the decision made very clear that the test itself is scientifically sound and there were no doubts with respect to all the labwork. The discussion arose only around the statistical model used around the decisions they make, between positive and negative… I’m a physician and a biochemist, I’m not a statistician, and it’s their world, their arguments. I have no idea if it’s valid or not.”

Chelsea Little

Chelsea Little is FasterSkier's Editor-At-Large. A former racer at Ford Sayre, Dartmouth College and the Craftsbury Green Racing Project, she is a PhD candidate in aquatic ecology in the @Altermatt_lab at Eawag, the Swiss Federal Institute of Aquatic Science and Technology in Zurich, Switzerland. You can follow her on twitter @ChelskiLittle.

Loading Facebook Comments ...

8 comments

  • saakal

    March 27, 2013 at 3:05 pm

    I have no idea if it’s valid or not – then how can you say that the test actually works?

  • brainscauseminds

    March 27, 2013 at 8:57 pm

    I think Veerpalu case is a perfect example how statistics are misused in practice as WADA did in this case. The main problem with the test was lack of true positive examples and removal of unknown “extreme” values as outliers for the purpose to fit the data, that would have been impossible otherwise. However, in the world of statistics, this is not the correct thing to do and therefore renders the test invalid as the assumptions are wrong. Actually, it would be the correct thing, if we were sure that all humans are very similar in respect to hGH levels. But this would be yet another assumption with no solid proof. Also, it would require the distribution to fit really well, that was not the case with the Validation studies.

    Therefore I do not agree with the claim that “Veerpalu probably doped, but won due to procedural flaws” as actually we have no idea, where these decision limits are. BUT, we maybe know that in future, if WADA will fix the statistics by collecting more data and reassess the Veerpalu case. I am fairly certain that at least the B-test would come out negative and therefore proves that Veerpalu *did not* dope.

    WADA probably rushed with collecting the data and doing statistics as they needed to come out with something new and I think they hoped to cut costs and fix the statistics later as the full study might have not been financially feasible. Hopefully they do not short-cut and back-fire in the future.

    I tried to read the whole documentation of the case and as an Estonian I had an elevated interest in the details. But I am no statistician, so if I made any errors, please fix me! But good and detailed post!
    Cheers!

  • Chelsea Little

    March 28, 2013 at 2:06 am

    Hi Brainscauseminds,

    one piece of information I neglected to include in the original article were Veerpalu’s test values themselves, which led CAS to draw the conclusion that he had almost certainly doped. I have added them in the article, and here:

    “For kit 1, the limits of the ratio are 1.81 for males and 1.46 for females; for kit 2, the limits are 1.68 and 1.55. (As reported in the original FIS ruling, Veerpalu’s A-sample values were 2.62 for kit 1 and 3.07 for kit 2; the B-sample was 2.73 and 2.00, in other words, not on the border of this decision point that WADA had identified but rather quite extreme.)”

  • luida

    March 28, 2013 at 9:18 am

    According to the information by the Veerpalu’s defense team the information presented here about the development of the test is not correct. It seams that WADA had used setting the limits just normal distribution model. As the there can not exists negative values expressing the ratio normal distribution (which w ould assume also negative values) model can not be used. After bringign up that in CAS by the Veerpalu’s team, WADA changed their statement and claimed that they used lognormal distribution model. Recalculation by Veerpalu’s teem showed that limits acquired by lognormal model would be bigger than Veerpalu’s test results. After this WADA changed again their model and claimed that they used actually gamma distribution model (which again does not fit the data very well). So they story seams to be quite complicated and confusing (mainly from the side of WADA).

    “Furthermore, Kõks said that in the researchers’ correspondence with WADA, the organization repeatedly changed its responses regarding its test’s methodology. Initially, the Tartu researchers said, WADA claimed to use normal distribution, a mathematical model. When the Tartu researchers found that normal distribution could not be used for such analysis, WADA said they actually used lognormal distribution. When that was found inapplicable, WADA said it used gamma distribution.”

    http://news.err.ee/sports/7b580679-cce8-4995-964e-7c99f4ad09a5

    In the light of this information the limits published sofar are according to the normal distribution model. It would be interested to know what are the limits calculated by lognormal or gamma distribution model and how far are Veerpalu’s test results from those limits.

  • brainscauseminds

    March 28, 2013 at 9:50 am

    Another possibility would have been, if they would have used density estimation to model the distribution, but I guess they did not had enought data for that. Further, they still would have had trouble estimating the correct cut-off, where the ratio of “guilty”/”nonguilty” had been large enough, because just assuming outliers to be true positive “dopers” is not sufficient in my optinion.

    But I hope we can make stronger claims about that case in future. We just need more data about known true positives and true negatives.

  • highstream

    March 28, 2013 at 12:29 pm

    I would think this decision and the explanation throws HGH testing into disarray for now. More important, it takes some of the sheen off of WADA’s reputation, as it should.

    Then there’s the relationship between HGH and performance, which as far as I know remains a matter of speculation unconfirmed by human studies.

  • Tim Kelley

    March 29, 2013 at 1:25 pm

    I work with process statistical analysis. If I “exclude[d] samples with very high values”, like WADA did, in coming up with baselines for decision making and if I had no good explanation for doing this, as with WADA … I’d be out of a job. Throwing out numbers just to make your results look good is deceptive and unprofessional. It’s called “cooking the numbers”. I’m not arguing that Veerpalu is innocent. But it seems that WADA needs smarter people working for them if they are attempting to craft policies that affect many peoples’ lives.

  • rational

    March 30, 2013 at 4:17 pm

    Let’s look at some excerpts:
    “hGH expression fluctuates in the body over both short and long time scales, making it impossible to employ any absolute cutoffs.”

    If that were true and it would be impossible to set any lower bound above zero to hGH expression, then the denominator of the ratio would approach zero or could even be zero. If the denominator of the ratio approaches zero (and is larger than the numerator) or is zero, then the value of the ratio either reaches infinity or is indeterminate. If both denominator and numerator are zero or close to zero, then we have a similar problem.

    “After injecting rhGH and exercising, serum concentrations of the 20-kDa and rarer isoforms dropped off to practically nothing.”

    Again, it seems like the denominator of the ratio either approaches zero or could possibly be zero. It would be impossible to set any cutoff value to such a ratio. Even if we disregard the distribution statistics, this is elementary school stuff. WADA really needs to show that the denominator does not approach zero nor be zero. Or use a dfferent kind of test altogether.
    And I agree with Kelley that the way WADA excluded outliers looks highly dubious.

    Disclaimer: I have not read through the case documents, nor the Bidlingmaier papers.
    Another disclaimer: I am an estonian, not part of the Veerpalu team.

Leave a Reply

Voluntary Subscription