Judging bias on the Grand Prix post CoP revision (numbers!)

Metis · Dec 4, 2018

This is fantastic. Thank you.

ribbit · Dec 4, 2018

Elucidus said:
Interesting and important topic :agree:
However it seems to me you missed elephant in the room. While GOEs/PCS judges are pretty important - they can only influence on the score so much. There is built-in average scoring mechanism in the system for a reason. However there are judges who are much much much more important than what you have in database. Tech callers. Their multiple calls, fake calls or ignoring skaters mistakes combining with their nationality would be much more interesting and important data - considering what huge influence they are having on the score - practically forcing all common judges to lower their scores significantly in case of a call and vice versa.
So without that data I can't see this analysis as complete as it should be. Moreover - techcallers data should be done in the first place IMO - they are that important. Unfortunately questionable calls/lack of them are numerous. It's often when the same callers in the same tournament are very strict for one skater and very lax for another - literally making podium placements as they see fit. This situation is very concerning and should be addressed ASAP.

But tech calls (levels, downgrades, underrotations, !, e, etc.) don't lend themselves to this kind of statistical analysis. Classifying calls as "fake" or "questionable" or "ignoring mistakes" is inherently subjective, for the same reasons that those calls themselves are necessarily subjective. There are many well-known reasons why it's difficult for audiences to judge tech callers' decisions: tech callers see much less footage from much more limited camera angles than we do; tech callers are more limited in the replays available to them than we are; the crucial moment may be missing from a frame-by-frame replay, etc. And tech calls don't generate data that can be readily translated into a spreadsheet. (I don't mean to downplay the enormous amount of work that Shanshani put into these analyses, only to point out that s/he is working with pre-existing numbers that don't have to be reassessed before s/he can import them into a spreadsheet.)

To do the same for tech calls, you'd first have to recruit a panel of well-qualified posters (to avoid simply ending up with a second subjective viewpoint) to review every single tech call in every single program and judge whether that call was correct. You'd have to choose a standard for each poster to apply in judging each call (beyond a reasonable doubt? preponderance of the evidence?) and set the rules for classifying a call as incorrect and determining the correct call (majority rule? consensus? unanimous vote?) Only after this extensive review would you have the raw data to feed into a spreadsheet. I agree with you that a biased tech caller can have an enormous effect on an event, but objectively quantifying and statistically analyzing what actual tech callers are actually doing is probably too much to ask of any individual.

Shanshani · Dec 5, 2018

Ribbit is correct, there are significant difficulties with analyzing tech calls as they outline, since it would inevitably involve the analyst making judgment calls on specific cases, whereas the idea behind what I'm doing is to apply the same formula to everyone independent of my own judgment about the correctness or incorrectness of scores and see what the formulas produce. However, I would be interested in seeing an attempt to comprehensively analyze tech calls, even if imperfect.

I would like to note though that there is some cross pollination between tech panels and judging panels. So we may be able to get indirect data on tech callers by looking at their judging records. If someone exhibits bias in their judging record, then there is a good chance they are not an unbiased tech caller as well.

4everchan · Dec 5, 2018

this is a lot of work. However, for now, it doesn't mean much as sample is too small..

I will give just one example... Cze judge, has -11.... judging ONCE ONE Cze skater... could be tough love... could be that she doesn't like Michal... could be that she has seen him skate SO MANY times that she recognizes a lot of his problems... could be anything.. but one thing for me that it doesn't mean is that she judges her own country's skaters less favorably ... not with a sample of ONE skater, in ONE event.

Maths are great. Love them... and I am sure that the OP is well aware of the sample size... so if OP is willing to keep going, I'd like to see this when most judges have judged several events and several skaters from their respective country... etc... though, like all of you, I opened my eyes very wide when I saw some countries being mostly in red...

Also, as someone mentioned earlier : it will be interesting to see how Japanese judges mark their skaters if they mess up

so far, they have had an easy way here on GP as many have done so well and all agreed (other judges) ...

Finally, to me, what matters is the power of the individual judges on the global result. IF a judge by their OWN scoring, and I say ONE Judge, has managed to change the end result, then I am not happy.

Again, thanks for all the work.

Shanshani · Dec 5, 2018

4everchan said:
this is a lot of work. However, for now, it doesn't mean much as sample is too small..

I will give just one example... Cze judge, has -11.... judging ONCE ONE Cze skater... could be tough love... could be that she doesn't like Michal... could be that she has seen him skate SO MANY times that she recognizes a lot of his problems... could be anything.. but one thing for me that it doesn't mean is that she judges her own country's skaters less favorably ... not with a sample of ONE skater, in ONE event.

Maths are great. Love them... and I am sure that the OP is well aware of the sample size... so if OP is willing to keep going, I'd like to see this when most judges have judged several events and several skaters from their respective country... etc... though, like all of you, I opened my eyes very wide when I saw some countries being mostly in red...

Also, as someone mentioned earlier : it will be interesting to see how Japanese judges mark their skaters if they mess up so far, they have had an easy way here on GP as many have done so well and all agreed (other judges) ...

Finally, to me, what matters is the power of the individual judges on the global result. IF a judge by their OWN scoring, and I say ONE Judge, has managed to change the end result, then I am not happy.

Again, thanks for all the work.

Yes, sample sizes for many judges are very small, which is why I said that I would give the benefit of the doubt to judges who have only scored their own skaters once or twice (otherwise we must conclude that a certain Czech judge really dislikes Czech skaters, or at least Michal Brezina :laugh:

). I do plan on adding more data (including hopefully Senior B data over the next few months), so hopefully that will start to improve. However, even now there are a few judges, mostly from bigger feds, who do have decent data set sizes (for instance, Laurie Johnson USA--yikes, at least in Men and Pairs, and Elena Fomina RUS and Olga Kozhemyakina RUS--also yikes).

Ice Dance · Dec 5, 2018

I can't say I have sorted out your system yet, though you've obviously done a lot of work so you have my applause. However, I did decide to take a look at the Skating Scores results for the GPs in dance.

Here's what struck me the most. During the FD at Skate Canada, the U.S. dance judge put Hubbell & Donohue 3rd while the Russian judge put them 1st. The Russian judge put Sinitsina & Katsalapov 2nd and the U.S. judge put Gilles & Poirier 1st and Sinitsina & Katsalapov 2nd. Otherwise, the U.S. and Russian judge agree about almost all the ordinals for all their top teams. Stepanova & Bukin, Hubbell & Donohue, Sinitsina & Katsalapov. It is the lower and less established teams where the judges diverge. And sometimes in very intriguing ways. The U.S. judge, for example, gives Hawayek & Baker lower marks in the RD in France than the Russian judge. Instead said U.S. judge places the Parsons higher. (Take a look at the Parsons' technical levels, and I'd say the same judge is actually right on target). It only happens in the RD, not the free. At Cup of Russia in the FD, the Russian judge places Evdokimova & Bazin above the two teams that finish ahead of them. Again, if you look at the levels, you'll see that E&B actually best the teams above them on those levels. (The Russian judges' placement in the RD is not as well validated).

What I'm saying here is that these judges aren't pushing the top teams as one might assume. It's the younger teams--teams that are less established and maybe don't have the reputation yet to bring the whole panel with them--that are getting more acknowledgment from their home country judges than from the rest of the panel. But in a number of these cases, these young teams--McNamara & Carpenter, The Parsons, Evdokimova & Bazin--they got the difficulty calls ahead of the teams the majority of the panel placed above them. The panel in general maybe isn't seeing those technical strengths yet. While their home country judges have had more time to get to know these athletes' strengths. (The tech callers in these events were not from either the U.S. or Russia). Anyway, I wouldn't have been surprised at all to see the U.S. and Russian dance judges playing chess. (They're fully capable of doing it, as are any of the dance judges from any country). But I see very little evidence of that happening when I look at the ordinals from this GP. At NHK, for example, the Russian & U.S. judges both agree that Zagorski & Guerreiro were stronger than Hawayek & Baker in the RD and both agree that Hawayek & Baker were stronger in the FD. And that was with a possible GPF berth on the line and between two teams that split results last season. Still, both judges acknowledged when their own top team made a mistake. They didn't agree about the Parsons or Skoptcova & Aleshin. But they didn't try to push the teams with the most on the line past an obvious error.

Here is the Skating Scores link for anyone who wants to compare GP rankings from judges for this season:
http://skatingscores.com/

samkrut@mail.ru · Dec 5, 2018

My final remark on the topic. I shall reiterate that in this particular area the most crucial and at the same time delicate thing is to go from numbers to conclusions.

Substituting correlation with causation is a common pitfall. For example, there might be a correlation between risky behavior of a person and the likelihood she is going to spend vacation in Las Vegas. But the causation: if a person goes on vacation to Las Vegas she is likely to take risks in life does not work. I, for example, visited Las Vegas 7 or 8 times in my life. But I do not like risk. And the reason for me to go there was never gambling but concerts, shows, and fine dining.

Here it is the same: correlations and standard deviations can not necessarily be explained by bias. And I gave the definition of bias in the previous post.

There was a very valid remark that the significance of tech controllers can easily eclipse that of the judges. Tech controllers "killed" Sotskova in France. And there were frame by frame comparisons of her vs. Kihira's combo landings during the SP. Based on those at least for me it was plain obvious that Rika's were no way better. But she got big GOE while Maria got UR.

Finally, the only cases which matter are those when judge's bias affected a significant outcome: podium placements, number of country spots, etc. Those should be studied and discussed. The rest is just a pastime.

Miller · Dec 5, 2018

Earlier on in the thread I said that I would provide an analysis of the Pairs competition at the Olympics and how the result might have been affected by the scoring of the Chinese and German judges – remember that Savchenko and Massot won by 0.43 points, 235.90 to 235.47 and that the Chinese judge was banned afterwards.

This is an attempt to show how much each individual judge potentially affected the result both from a marking own skaters perspective, and a marking down one as well. (N.B. By chance it also addresses Samkrut's post just above that says things should only be discussed if judges' bias affected a significant outcome e.g. winners, podium placements etc).

At the end I will then discuss 2 further competitions Virtue and Moir vs Papadakis and Cizeron at the Olympics, and Mako Yamashita vs Elizaveta Tuktamysheva at Skate Canada. I’m sure you’ll find the results interesting.

What I did for the Pairs at the Olympics was recalculate the figures for Sui/Han and Savchenko/Massot as if their judges didn’t judge them at all. Hence in the SP where there was both a Chinese judge and a German one you would be left with 8 judges scores where you would then remove the highest and lowest GOEs and PCSs, and then recalculate the figures accordingly (and factor differently where appropriate). Then in the LP where there was only the Chinese judge you would recalculate for Sui and Han as if there were 8 judges, but for Savchenko and Massot there would still be 9.

Then I did a separate calculation in an attempt to calculate how much marking down contributed. Of course this would be very difficult to stop totally because the judges have to come from somewhere, plus it would be virtually impossible to find a panel of experienced judges none of whose skaters were involved.

However I believe that if judges weren’t to allowed to mark their own skaters this would be greatly reduced as the judge would be sat there doing nothing while their skater skated, and would be reminded that this was only happening because of the ISU being concerned about biased judging – this should become a talking point further down when the results of the Virtue and Moir example are discussed.

Anyway what I did was recalculate the figures for Sui and Han and Savchenko and Massot as if the Chinese and German judges didn’t judge the competition at all. That way you can see the overall effect, and knocking off the own skater marking will give you the marking down effect – in this case it would be as if 7 judges judged the SP, the same as a Challenger event, and 8 judging the LP i.e. halfway inbetween a Challenger and a GP.

These are the results (I can provide a fuller breakdown of individual GOEs and PCSs if anyone wants them).

Sui/Han (no Chinese judge marking)
SP – 81.82 (BV 32.80, GOE 11.39, PCS 37.63) vs 82.39 actual i.e. the scoring of the Chinese judge resulted in Sui/Han scoring 0.57 marks more than they would have otherwise if the judge hadn’t scored the competition.

LP – 152.32 (BV 61.50, GOE 14.29, PCS 76.53) vs 153.08 actual i.e. +0.76, and overall 1.33 marks more as a result own skater judging.

Savchenko/Massot (no German judge judging) (SP only)

SP – 76.46 (BV 29.80, GOE 9.18, PCS 37.48) vs 76.59 actual i.e. +0.13

i.e. the effect of own skater judging was to increase Sui/Han’s score by 1.33 and Savchenko/Massot’s by 0.13. Hence if Sui/Han had won the Olympic Gold by 1.20 points or less you could definitely say it was down to own skater judging, and in this case National Bias – the judge was banned after all.

At this point I would say that the ISU almost certainly realised how close they came to another Olympics judging disaster, they may not have realised how close, but they certainly knew, hence why the ban.

Marking down.

Sui/Han – no German or Chinese Judge marking.
SP – 82.06 (32.80, GOE 11.38, PCS 37.88) vs 82.39 actual i.e. 0.33 less. As we’ve seen above the Chinese judge resulted in Sui/Han scoring 0.57 more than they would have otherwise, so this means that the German judge clawed back 0.24 points by their judging (if you look at the judges tallies on SkatingScores you’ll find that the German judge gave Sui/Han the lowest of the scores specifically for them).

LP – 152.32 vs 153.08 actual, exactly as above (no German judge to take account of).

Savchenko/Massot
SP – 76.62 (BV 29.80, GOE 9.38, PCS 37.44) vs 76.59 actual i.e. the effect of Chinese judge marking down was to lower S/M’s SP by 0.16 marks i.e. German Judge +0.13, Chinese judge -0.16, overall effect -0.03.

LP – 160.39 (BV 61.90, GOE 20.59, PCS 77.60) vs 159.31 i.e. the effect of the Chinese judge was to lower S/M’s score by 1.08 marks.

Overall, if neither judge had judged the competition the final score would have been Savchenko/Massot 237.01 vs 235.90 actual with 0.13 of the difference due to marking up, and 1.24 due to marking down, and for Sui/Han 234.38 vs 235.47 actual with 1.33 marks of the difference due to marking up, and 0.24 marking down.

Overall the Chinese judge was responsible for the gap between Sui/Han and Savchenko/Massot being 2.57 marks closer than it would otherwise have been, and the German Judge (who only judged the SP), clawing back 0.37 marks of the difference, 0.13 for marking up, 0.24 for marking down, and the final gap would have been 2.63 marks vs 0.43 actual, with the net 2.20 difference caused by own country marking/marking down i.e. 2.57 by the Chinese judge, and 0.37 by the German judge.

At this point you can then argue that if judges weren’t allowed to mark their own countries’ skaters, would they really stop marking down totally, but as you can see it’s possible for an individual judge in extremis to affect a result by as much as 2.57 points, and that really any margin of victory of say less than 3 points should at least be given the once over.

OK, onto the other examples.

I won’t give the full details of Virtue/Moir vs Papadakis and Cizeron, but suffice to say that their winning margin 0.79 points, 206.07 to 205.28 was reduced to 0.04 points if you removed the Canadian judge (who judged both segments) and the French judge (who only judged the SD) from marking their own skaters, but that if you removed them from the equation entirely, Papadakis and Cizeron would have won by 206.32 to 205.86.

However at this point you have to consider the above what would happen re marking down if you can’t mark your own countries skaters, plus there’s also the business of Gabriella’s costume malfunction, and P/C’s mistake on the twizzles in the SD that was seemingly ignored by the judges, possibly due to the costume malfunction. Hence, make your own mind up on this one, plus any ‘marking down’ is relative in this case e.g. the Canadian judge gave all 14 elements performed by P/C +2 or +3, it’s not as if they were really ‘marking them down’ - though I would also note in all 3 cases of the Canadian judge judging P/C and the French judge judging V/M they managed to give them the lowest scores specifically for those skaters i.e. 9th out of the 9 judges.

Finally, onto Mako Yamashita vs Elizaveta Tuktamysheva, and in this case you can definitively say the result was affected by own country marking, the only question is whether it was national bias, or whether it was just one of those things that resulted from fair judging, but the result just happened to change because the result was close in the first place (Elizaveta won by 0.26 points, 203.32 to 203.06). However IMO it was National Bias (unless someone can prove otherwise), and it all boils down to PCS (our good friend PCS...).

However these are the figures as if the Japanese and Russian judges didn’t mark their own skaters.

Elizaveta Tuktamysheva
SP 73.71 (BV 35.39, GOE 6.79, PCS 31.53) vs 74.22 actual (BV 35.39, GOE 7.00, PCS 31.83)

LP 128.57 (BV 60.35, GOE 4.82, PCS 64.40 – 1.00 Deduction) vs 129.10 actual (BV 60.35, GOE 4.91, PCS 64.84 – 1.00 Deduction).

Overall Total 202.28 vs 203.32 actual.

Mako Yamashita
SP 66.71 (BV 30.28, GOE 5.23, PCS 31.20) vs 66.30 actual (BV 30.28, GOE 5.13, PCS 30.89)

LP 137.31 (BV61.12, GOE 10.59, PCS 65.60) vs 136.76 actual (BV 61.12, GOE 10.55, PCS 65.09).

Overall Total 204.02 vs 203.06 actual, with Mako the winner by 204.02 to 202.28 if judges aren’t allowed to judge their own countries’ skaters.

Re marking down i.e. stripping the Russian and Japanese judges out of the equation entirely, I haven’t recalculated the scores, but suffice to say if you look at Shanshani’s details for Skate Canada in the original post, you’ll find that the Japanese judge absolutely loved Elizaveta and marked Mako down considerably (their prerogative of course). Hence if you’d recalculated the figures as if neither the Russian judge or Japanese judge marked the contest at all, then Mako’s winning margin would have increased.

BTW, there was no evidence of the Russian judge marking Mako down, yes her scores were a bit below the average, but nothing out of the ordinary, and indeed her PCS scores were increased from the SP to the LP so you can say there was definitely nothing going on there.

It all really boils down to the PCS and if you look at the figures above you’ll see that Liza’s PCS figures go down by 0.74 marks if you strip the Russian judge out of the equation, and if you look at the protocols (linked below), you’ll see that the Russian judge’s PCS figures stick out like a sore thumb, every single PCS element score is either the highest or joint highest, and as above the 0.74 difference more than makes up for the initial 0.26.

N.B. The Russian judge’s figures for GOEs were OK, higher than the average certainly, but also nothing out of the ordinary.

Hence overall, if someone can explain away Liza’s PCS scores (or even come up with lower higher PCS scores that are justifiable - I could always re-mark) then you could say the result was a result of fair/non-biased judging. However if not then you would have to say the result was as a result of National Bias, plus of course you can definitely say that Mako would have won no matter what if judges weren’t allowed to judge their own countries skaters.

Skate Canada protocols -
SP (the Russian Judge is judge 3 in both the SP and LP) - http://www.isuresults.com/results/season1819/gpcan2018/gpcan2018_Ladies_SP_Scores.pdf

LP - http://www.isuresults.com/results/season1819/gpcan2018/gpcan2018_Ladies_FS_Scores.pdf

Mathman · Dec 5, 2018

ribbit said:
But tech calls (levels, downgrades, underrotations, !, e, etc.) don't lend themselves to this kind of statistical analysis. Classifying calls as "fake" or "questionable" or "ignoring mistakes" is inherently subjective, for the same reasons that those calls themselves are necessarily subjective.

I think it is also worth remarking that even for judges' scores, statistical analyses accept the underlying assumption that the majority of judges is "right" and the odd-ball judge is "wrong," either because of national bias or incompetence. So we have to be careful before we frame our conclusions in accusatory language.

That said, all this information is very cool. The ISU did us a big favor by eliminating anonymous judging.

balabam · Dec 6, 2018

Miller said:
Earlier on in the thread I said that I would provide an analysis of the Pairs competition at the Olympics and how the result might have been affected by the scoring of the Chinese and German judges – remember that Savchenko and Massot won by 0.43 points, 235.90 to 235.47 and that the Chinese judge was banned afterwards.
...

This is really great and powerful analysis, mr. Miller. :agree: More of them, please :pray:

Miller · Dec 6, 2018

balabam said:
This is really great and powerful analysis, mr. Miller. :agree: More of them, please

Thank you, I'll do my best!

Seriously, I'm wondering if someone can help me. So far I've been doing the protocols by hand (didn't realise it would become a cottage industry...).

It doesn't take too long to do one, about 20 minutes maybe, but with 2 SPs and LPs for 2 skaters times going through twice, once for not marking own skaters and once for eliminating 2 judges entirely, the time soon adds up.

Hence I could do with some way of automating the process, plus presumably there's software already out there where you can plug a PDF protocol into it, and come up with the figures that you currently get i.e. with 9 judges judging.

What I could do with is a means of eliminating a judge or two, or select 8 or 7 of them, and let the software recalculate on the same basis as now i.e. take off the highest and lowest GOEs etc. etc. Such software would greatly speed up the process, plus also help anybody else who wanted to see what the results would be if you removed a judge or two, enabling far more competitions and so on to be looked at.

Is anyone in a position to help? (I'm not at all tech savvy when it comes to spreadsheets)

Shanshani · Dec 10, 2018

I'll get to replying to some more comments when I have more time/feel more recovered from seeing the GPF live, but in the meanwhile here's the spreadsheet for the final.

CanadianSkaterGuy · Dec 11, 2018

Come thruuuu Yuri Balkov! Man, something is always afoot with that guy. :biggrin:

Interesting data compilation!

Miller · Dec 11, 2018

Further to the above it looks as if the result of the Junior Ice Dance, i.e. the final podium position, at the JGPF could have been affected by the ability of judges to mark their own countries skaters.

I haven’t gone through the protocols, but as a proxy the Russian judge scored it 170.99 Khudaiberdieva/Nazarov vs 164.54 actual, and the Canadian couple Lajoie/Lagha 162.11 vs 164.51 actual, a differential of 8.88 points compared with 0.03 actual.

In contrast the Canadian judge scored it 167.02 Lajoie/Lagha vs 161.40 Khudaiberdieva/Nazarov, a differential of 5.62 points, so not as bad but still the classic pattern of higher marks for own skaters, lower ones for direct competitors.

If one were to strip the Russian and Canadian judges out of the equation, either entirely, or by not being able to mark their own skaters then it’s quite likely the result would have been different bearing in mind the actual 0.03 difference. Of course at such a margin it’s always going to be a coin flip, but I would say this is further evidence for judges not being able to mark their own countries skaters, even if it's only to remove the element of uncertainty.

In addition, I've also had a look at 2 more examples from GPs earlier in the year, and certainly in one of them, Stanislava Konstantinova vs Kaori Sakamoto for 2nd/3rd place as GP Helsinki, the result would have been different if judges were not allowed to mark their own countries skaters.

In this example (Stanislava got the silver by 0.15 marks), the Russian judge scored it Konstantinova 205.09 vs 193.58 Sakamoto, while the Japanese judge scored it 197.62 Sakamoto vs 196.45 Konstantinova – the actual figures were 197.57 Konstantinova vs 197.42 Sakamoto. Hence it’s virtually certain the result would have been different if judges weren’t allowed to judge their own countries skaters. EDIT - I've recalculated the figures and the result would have been Kaori 197.30 vs 196.59 Stanislava if judges weren't allowed to mark their own countries skaters, so the result would definitely have been different.

Finally for completeness, I also looked at Della Monica/Guarise vs Pavliuchenko/Khodykin, at Helsinki also. In this case it’s unclear if the result would have changed. The Italian judge had it Della Monica/Guarise 192.60 vs 188.32 Pavliuchenko/Khodykin, while the Russian judge had it 192.40 Pavliuchenko/Khodykin vs 188.85 Della Monica/Guarise, whereas the actual scores were 185.77 Della Monica/Guarise vs 185.61 Pavliuchenko/Khodykin.

Overall, the Italian judge had a slightly bigger differential (I got it be 0.41 marks once you took account of the actual 0.16 difference) compared with Russian judge. Hence it really would be in the fine detail if the result were to change.

Overall though it’s not great that you’ve already had 3 podiums this year potentially affected by judges being able to mark their own countries skaters. IMO it’s time for a change, even if it's only to remove the element of uncertainty, and present a better picture of governance to the outside world.

Shanshani · Dec 11, 2018

Mathman said:
I think it is also worth remarking that even for judges' scores, statistical analyses accept the underlying assumption that the majority of judges is "right" and the odd-ball judge is "wrong," either because of national bias or incompetence. So we have to be careful before we frame our conclusions in accusatory language.

That said, all this information is very cool. The ISU did us a big favor by eliminating anonymous judging.

Yes, I agree in certain cases the oddball judge might be "right" and the rest of the panel "wrong", and there's no way to correct for that without substituting in my own judgment. I think that's important to keep in mind when looking at judges from small feds or who have scored few skaters from their own country. For instance, if small fed skaters are often underscored relative to how they "should" be scored, then seeing national bias from a small fed judge may not be a big problem, since it would actually be leveling the playing field somewhat.

However, when it comes to judges from big feds who have scored many skaters, I think that this is less of a concern. Unless you want to argue, for instance, that the majority of Russian skaters are underscored by the rest of the panel consistently across many competitions, seeing someone like Olga Kozhemyakina (prolific Russian judge) consistently put up bias numbers in the high single/low double digits should be a red flag. In such cases, what's more likely--the rest of the panel is underscoring Russians or Olga Kozhemyakina is biased?

Substituting correlation with causation is a common pitfall. For example, there might be a correlation between risky behavior of a person and the likelihood she is going to spend vacation in Las Vegas. But the causation: if a person goes on vacation to Las Vegas she is likely to take risks in life does not work. I, for example, visited Las Vegas 7 or 8 times in my life. But I do not like risk. And the reason for me to go there was never gambling but concerts, shows, and fine dining.

I'm aware of the distinction between correlation and causation, thanks. And yes, it is possible that the bias numbers are explained through some other cause besides the judge being biased. I think this is, again, more of a concern when it comes to small fed judges who are only scoring one or two of their own skaters, and less of a concern when it comes to big fed judges who score many skaters. If a small fed judge is only scoring one skater, then it's possible that they overscore that skater because they like that skater's style/program/traits rather than because of national bias. But again, if a judge is scoring many different skaters across many different competitions (who will inevitably have many different program styles/personal qualities), this becomes less likely, since even skaters from the same federation are often quite different from each other.

Correlation may not necessarily be causation, but that doesn't mean it can't be causation either. Can you come up with a competing explanation for why, for instance, Olga Kozhemyakina (I pick on this judge a lot, but she's a good example because she judges a lot of competitions), consistently overscores Russian skaters and underscores non-Russian skaters other than nationalistic bias? Just on the Grand Prix, she has scored Sergei Voronov, Dmitri Aliev, Zabiiako/Enbert, Evgenia Medvedeva, Maria Sotskova, Stanislava Konstantinova, Boikova/Kozlovskii, Tarasova/Morozov, and Pavliuchenko/Khodykin, all of them to varying degrees above other judges, while she averages 4 points below other judges when it comes to how she scores other skaters. What do these skaters have in common that could explain her overscoring, other than them skating for the Russian federation?

Finally, the only cases which matter are those when judge's bias affected a significant outcome: podium placements, number of country spots, etc. Those should be studied and discussed. The rest is just a pastime.

Miller in this thread appears to be providing just the analysis your asking for. Thanks to him/her for his/her efforts! However, I still disagree with this sentiment. Even if a judge's bias didn't affect the results of a particular competition, it could very well affect the results of future competitions. Therefore, it's important that we discuss and keep account of judges' biases. Maybe, say, US judge Laurie Johnson's bias does not affect the competition today, but tomorrow her bias takes a Russian skater off the podium. If no one is keeping track, then judges have no incentive to behave in an unbiased way.

Sam-Skwantch · Dec 11, 2018

I’d be an oddball judge for certain!!

moonvine · Dec 11, 2018

The math eludes me as well, but all I have to say is that bias is inherent to judged sports. Judges are human and humans are prone to bias. Someone like Hanyu or Simone Biles is going to be given the benefit of the doubt while someone like Keegan Messing or Mariah Bell isn't. Also once you get a reputation for under rotation all of your jumps are going to be scrutinized, where someone who has a reputation for landing clean jumps may not be. Judges are human. It makes it kind of a relief to watch track or swimming or something that's timed sometimes (at least for me).

Shanshani · Dec 11, 2018

moonvine said:
The math eludes me as well, but all I have to say is that bias is inherent to judged sports. Judges are human and humans are prone to bias. Someone like Hanyu or Simone Biles is going to be given the benefit of the doubt while someone like Keegan Messing or Mariah Bell isn't. Also once you get a reputation for under rotation all of your jumps are going to be scrutinized, where someone who has a reputation for landing clean jumps may not be. Judges are human. It makes it kind of a relief to watch track or swimming or something that's timed sometimes (at least for me).

Sure, which is why I make the cut off for the red flag 6.5 points instead of, say, 2 points, but the fact that there are plenty of judges who don't exhibit much nationalistic bias in their marking should make it clear that it's hardly inevitable. (Plus, we are talking about nationalistic judging here, not reputational judging which I honestly have no idea how to assess objectively, unfortunately.)

Sam-Skwantch said:
I’d be an oddball judge for certain!!

But the question is whether you'd consistently be an oddball judge in the positive direction with skaters of your own federation. Those are the only oddball judges that get flagged

Miller · Dec 13, 2018

Shanshani said:
Even if a judge's bias didn't affect the results of a particular competition, it could very well affect the results of future competitions. Therefore, it's important that we discuss and keep account of judges' biases.

To this I would add the risk to figure skating posed by outside parties.

For example at the Olympics there was a Buzzfeed News article where the researcher involved comprehensively took apart figure skating judging and specifically the national bias side.

Luckily the ISU was able to ignore it because it was on an online site that people wouldn’t have heard of, plus there was so much else going on. However allied to this and the Chinese judging case, and the stuff that’s out there on sites like SkatingScores, the ISU is taking a hell of a risk re not being exposed sometime in the future.

However at the end of the day it’s their choice, they must know the risks involved. I just hope figure skating doesn’t suffer if and when you get the next article or huge judging scandal – that’s why I was so keen to highlight how close the ISU came to disaster at the Olympics with the Chinese judging case.

On lighter matters though, I’d like to give a huge shout-out to the Japanese judge Shizuko Ugaki who was the Japanese judge in the Mako Yamashita/Elizaveta Tuktamysheva case. They caused me to have go through all sorts of contortions to prove the Russian judge alone was enough to have caused the result to have changed - they scored the contest Mako 196.82 vs 211.47 Elizaveta, compared with 203.06 to 203.32 actual. I cannot say how rare this is.

At the Olympics/following World Championships I looked at 221 cases of own country judging, and there was nothing anywhere near this in terms of marking your own skater lower, and a direct rival a lot higher, so a huge shout-out to Shizuko for judging as they did.

Similarly, Doug Williams of the USA at the GPF, he had Nathan Chen only 0.80 points ahead of Shoma Uno, despite the actual result being a win for Nathan by 7.32 points, plus he had Shoma ahead of Nathan in the Free Skate as well, so a huge shout-out to Doug as well - perhaps there is hope after all.

Overall though I have to say things aren’t great, plus as we get to the business end of the season normal service seems to be resuming more and more. Early on there in the season there seemed to be marking up, but not so much in the way of marking down, but as things are getting more important you’re starting to get more and more of the latter as well – it would be really interesting to see what the judges database would look like if you only considered the marking of skaters who finished in say the top 6, that would really give some indication of how much bias is out there, plus any difference between skaters in the top 6 and those outside, well you’d definitely have some sort of proof positive that something was going on.

In fact I have to wonder what any non-biased judges out there make of it all. They’re there at the competitions, plus must know what’s going on, look at protocols etc. etc. (anything unusual would strike them straightaway), plus there’s even a round table post-competition debrief AFAIK where they could say something.

However does anyone say anything, or are they just scared to do so/are ignored. Alternatively do they just accept it as part of figure skating judging, plus as we’ve seen it’s really difficult to prove bias in one event, whereas over multiple ones/skaters it’s really easy (but that’s where the ISU has to be on the ball). Alternatively do they really think the scores are within the margin of error, or is it simply that most of them don’t have a leg to stand on anyway when it comes to calling out other judges, and all you’re left with is a competition to see who can get away with the most before finally the ISU steps in (as with the Chinese judges).

Lester · Dec 13, 2018

Conclusion: Israel hates Piper and Paul

Judging bias on the Grand Prix post CoP revision (numbers!)

Metis

Shepherdess of the Teal Deer

ribbit

Shanshani

4everchan

Shanshani

Ice Dance

[email protected]

Medalist

Miller

Mathman

balabam

🥕🥕🌵🌵😈😀

Miller

Shanshani

CanadianSkaterGuy

Miller

Shanshani

Sam-Skwantch

“I solemnly swear I’m up to no good”

moonvine

All Hail Queen Gracie

Shanshani

Miller

Lester

Piper and Paul are made of magic dust and unicorns

Similar threads

Connect with us