Judging math details | Page 8 | Golden Skate

Judging math details

Joined
Jun 21, 2003
Because crumby choreography is subjective statement. 'Great choreography' is subjective statement. Hiting Compositions bullet points is (or can be) objective statement, as a product of Composition being defined through those bullet points.

I think that the vision of the IJS is to quantify those aspects of skating that can be objectively quantified and to provide guidance for judging those aspects that require subjective assessment.

There is a subjective element to the question, "Did the skater hit the bullet point for Purpose (idea ,concept, vision, mood), or the bullet point for Phrase and Form (movement and parts of the program to match the musical phrasing)." Furthermore, it is not a yes or no question. One skater will do a better or worse job than another of "matching the musical phrasing," in the view of a particular judge.

How do we know that this is true, that judges are making subjective evaluations? We know this because one judge gives 8.0 for composition while for the exact same performance of the exact same program, another equally well-qualified judge gives 8.75.
 
Last edited:

Baron Vladimir

Record Breaker
Joined
Dec 18, 2014
I think that the vision of the IJS is to quantify those aspects of skating that can be objectively quantified and to provide guidance for judging those aspects that require subjective assessment.

There is a subjective element to the question, "Did the skater hit the bullet point for Purpose (idea ,concept, vision, mood), or the bullet point for Phrase and Form (movement and parts of the program to match the musical phrasing)." Furthermore, it is not a yes or no question. One skater will do a better or worse job than another of "matching the musical phrasing," in the view of a particular judge.

How do we know that this is true, that judges are making subjective evaluations? We know this because one judge gives 8.0 for composition while for the exact same performance of the exact same program, another equally well-qualified judge gives 8.75.

I agree that some desicions are not purely objective, but it is still more objective than to say how someone deserve 8.25 because has greater choreography... Different judges may gave different scores because:
- they couldnt see every detail of some programme in 3 minnutes time
- its easier to notice things which are closer to your own private world
- its easier to like something you already saw many times
- you hear audience reaction and you may hear/read informations about athletes personal life
- you want your nation to do better, and because of that you really think is better
- you have a bad day
I want to say that things which can lead to mistakes and influence the judges desicion may be not conscious and product of unclear state of mind. I think its not fair to acusse judges all the time to do some things bad or with higher purpose, when most of those things are proven to be normal human mistakes in a cognitive process of judging.
E: and i think athletes are aware of that, so...
And to add, Because of the possibility of too much subjectivety and possible mistakes we have 9 judges. And all are judging by the same (objective/written) rules. Which will make their type of mistakes similiar. So individual mistakes will likely be anulled...
 
Joined
Jun 21, 2003
I was wondering is it possible to say from a statistical point of view that a skater has actually won or had say a better component score than another skater. For example when I was at Uni I did a stats course they were talking about 95% probability and I took this as meaning the point at which a statistician would say this was 'a certainty'...

First, statisticians are never certain of anything. They only talk about probabilities. So already there is a problem, because what do we mean by a "probability"? In the context of figure skating scores, to say that "the probability is 95% that skater A really gave a better performance than skater B," we mean that if the two skaters duplicated that exact performance 100 times before different panels of judges chosen at random, then skater A would come out the winner 95 times out of 100.

OK, so we are already on shaky ground in the real world. But I suppose we could imagine 100 different panels of judges watching a video of the program. In any case, we don't have to actually do it, we just have to make sure we understand what we mean by "probability."

E.g. if a skater were to score a component score of 8.27 on a trimmed 7 out of 9 basis, at what point would a statistician say that it was 95% certain that the score was higher than that of another skater...

It depends on the variability of the individual numbers that averaged out to 8.27. The greater the variance the less reliable the conclusion. I will answer a slightly different question where the principles are less murky than the one you asked (for example, the trimming is an extra complication).

Suppose nine judges give these scores for Composition:

8,0, 8.25, 8,25, 8.5, 8.5, 8.5, 8,75, 8,75, 9.0

The average is 8.5 and that is the number that you see in the right-hand column of the protocol. Now, are these 9 numbers all spread out or all clustered together? That is what is measured by the standard deviation. In this case we have 3 numbers that are right on the money (8.5), we have 4 that are off by .25, and we have 2 that are off by .5. So, on the average, I will say that the typical nmber is off by no more than s = 0.3 points.

To make a "95% confidence interval for the true mean" we have the formula

+/- 2s/sqrt(n)

This is about 0.2 in this example. Adding and subtracting this from our sample mean of 8.5, we get the interval from 8.3 to to 8.7. If we did the experiment over and over (with the exact same skate every time, but with different judges) we would expect that the skater's Composition score would land in this interval 95% on the time. Statisticians, always hedging their bets, will say "I'm not completely sure, but I am 95% sure that the true and honest-to-God value of this program is somewhere in the interval 8.5+/-0.2

So that gives us a standard way to approach the problem of whether we beat the other skaters by our actual performance, or whether we won only because we faced this particular panel of judges, rather than a different one randomly chosen.

Also is it possible to do so on a final score. Say a lady skater wins a competition with 200 points with 19 GOE elements and 10 PCS scores in there. At what point would a statistician say it was 95% probable she'd actually won - a 2 point advantage, 3?

It's very complicated to try to do it right. We have already made a bunch of assumptions that are hard to verify. We are basically trying to predict the behavior of (imaginary) human judges, whose marks may not fit into a nice statistical model, even if they are following the rules as best they can. Just for the single component Composition all we could say is, we're pretty sure that the actual performance deserved, after averaging, somewhere between 8.3 and 8.7. We could narrow this interval by increasing the size of the judging panel (but not by enough to make it worth while to do so), and by the trimming process.
 
Last edited:

Miller

Final Flight
Joined
Dec 29, 2016
First, statisticians are never certain of anything. They only talk about probabilities. So already there is a problem, because what do we mean by a "probability"? In the context of figure skating scores, to say that "the probability is 95% that skater A really gave a better performance than skater B," we mean that if the two skaters duplicated that exact performance 100 times before different panels of judges chosen at random, then skater A would come out the winner 95 times out of 100.

OK, so we are already on shaky ground in the real world. But I suppose we could imagine 100 different panels of judges watching a video of the program. In any case, we don't have to actually do it, we just have to make sure we understand what we mean by "probability."



It depends on the variability of the individual numbers that averaged out to 8.27. The greater the variance the less reliable the conclusion. I will answer a slightly different question where the principles are less murky than the one you asked (for example, the trimming is an extra complication).

Suppose nine judges give these scores for Composition:

8,0, 8.25, 8,25, 8.5, 8.5, 8.5, 8,75, 8,75, 9.0

The average is 8.5 and that is the number that you see in the right-hand column of the protocol. Now, are these 9 numbers all spread out or all clustered together? That is what is measured by the standard deviation. In this case we have 3 numbers that are right on the money (8.5), we have 4 that are off by .25, and we have 2 that are off by .5. So, on the average, I will say that the typical nmber is off by no more than s = 0.3 points.

To make a "95% confidence interval for the true mean" we have the formula

+/- 2s/sqrt(n)

This is about 0.2 in this example. Adding and subtracting this from our sample mean of 8.5, we get the interval from 8.3 to to 8.7. If we did the experiment over and over (with the exact same skate every time, but with different judges) we would expect that the skater's Composition score would land in this interval 95% on the time. Statisticians, always hedging their bets, will say "I'm not completely sure, but I am 95% sure that the true and honest-to-God value of this program is somewhere in the interval 8.5+/-0.2

So that gives us a standard way to approach the problem of whether we beat the other skaters by our actual performance, or whether we won only because we faced this particular panel of judges, rather than a different one randomly chosen.



It's very complicated to try to do it right. We have already made a bunch of assumptions that are hard to verify. We are basically trying to predict the behavior of (imaginary) human judges, whose marks may not fit into a nice statistical model, even if they are following the rules as best they can. Just for the single component Composition all we could say is, we're pretty sure that the actual performance deserved, after averaging, somewhere between 8.3 and 8.7. We could narrow this interval by increasing the size of the judging panel (but not by enough to make it worth while to do so), and by the trimming process.

Thank you for taking the time and trouble to respond. Without wishing to make a big deal of it or take it any further would this mean if 2 skaters gave the exact same performance before the same set of judges you could say with 95% probability that each scored between 8.3 and 8.7 clustered on the mean (I still remember the standard deviation graph). However over 29 different GOE or PCS elements any differences would tend to average down to zero, but obviously it wouldn't be exactly zero (I was really expecting any exact answer over the 200 point question but you never know).

Incidentally, I just had a look at Nathan Chen's PCS from today's GPF (the only one I looked at). The spread of his individual components is uncannily like the ones you gave - 2 had a spread of 1.00, 1 was 1.25, and 2 were 0.75 i.e. an average spread was 0.95 plus the clustering was very, very similar also, in fact slightly more clustered. As each of his components was virtually the same average of 9.00 (Skating Skills exactly), I guess you could say with 95% probability (or is it less than that because it's based on just the 1 sample?) that his Skating Skills were between 8.80 and 9.20, ditto the others, but over the 5 the difference in average PCS would be less than the + or - 0.2, plus there's still all the other GOEs and PCS in the LP to take account of.
 
Joined
Jun 21, 2003
However over 29 different GOE or PCS elements any differences would tend to average down to zero, but obviously it wouldn't be exactly zero...

Not exactly. Although it is possible that one number is too high and the next too low, it is also possible that both (of two) numbers are too high and the errors compound instead of cancel. This is actually an interesting question in statistics, related to the "random walk" problem. A drunken person starts out at a light pole and staggers around, each time taking one step in a random direction, sometimes tending to get farther from the pole, sometimes coming back closer. How far away will he be, on the average, after n steps? Answer: SQRT(n). This is essentially how to combine the standard errors from each of the 5 components (under some simplifying assumptions).

If each component SS, TR, etc., has a standard error of .10 and there are 5 of them, then a good estimate for the standard error of the total PCSs is about .10x SQRT(5). This is about .224. This means, for a men's short program, a 95% confidence interval would be about +/ .45. About Nathan Chen's SP score for total PCSs we can say, "If this exact performance were scored over and over by many, many, many panels of 9 judges selected at random, 95% of the time we would expect that Nathan would get somewhere in the the range 8.55 to 9.45. (In terms of actual points, 42.75 points to 47.25 -- almost a five-point range, and twice that for the LP because of factoring.)

I would be a little bit nervous about throwing the GOEs into the same basket. But real statisticians (I am not one) are willing to tackle anything. ;)
 
Last edited:

gkelly

Record Breaker
Joined
Jul 26, 2003
Honestly, I don't think she learned anything at all from these marks. Transitions are always the lowest, apparently by some kind of hidden judges' convention.

After watching Alexandra Trusova's SP at the JGP Final, I thought one could make a case for Transitions to be her highest component.

It wasn't (although one judge had it tied for highest with Composition). But it was slightly higher than Skating Skills, which was the lowest mark here. I.e., more judges went up than down here, and a couple had the two components tied.

So some judges will break the convention in really obvious cases.
 

Miller

Final Flight
Joined
Dec 29, 2016
After watching Alexandra Trusova's SP at the JGP Final, I thought one could make a case for Transitions to be her highest component.

It wasn't (although one judge had it tied for highest with Composition). But it was slightly higher than Skating Skills, which was the lowest mark here. I.e., more judges went up than down here, and a couple had the two components tied.

So some judges will break the convention in really obvious cases.

Of the 40 couples/skaters at CoC Javier Fernandez and Chock and Bates has TRs a bit higher than SS so this also ties in with what you're saying. Sui/Han and Papadakis/Cizeron were also tied.
 

Miller

Final Flight
Joined
Dec 29, 2016
Thanks again. The spread in Nathan's potential PCS is really quite surprising.

Re GOEs/overall score, I wonder if it's possible to come up with some sort of standard error (deviation?), that includes GOEs as well. There's 29 elements across both programs, SQRT would be about 5.4, plus GOEs on average are 0.6 per GOE (half elements triples, choreo seq, maybe a step seq, other half spins, 2As, maybe the other step seq), and the potential plus/minus range for each would be +/- 1.8. Get an average error, plus one for PCS SP and LP, double the above, and multiply by 5.4, that would hopefully give you the 95% range for the full competition.

GOEs typically tend to be within 2 GOEs, sometimes 3 i.e. 1.2/1.8 marks of each other, except where all the same e.g. a fall or a really well executed item, that would typically be the same points equivalent range as 1 point of PCS (being very rough here). It's not impossible that the final answer might be something like confidence interval = 4.50 (difference between 42.75 and 47.25) * 5.4 / 2.25 = 10.8 i.e. +/- 5.4.

At course at this point I've just realised Nathan is not a lady! However if it were a lady scoring 200 points with PCS 9 then the confidence interval would 4.50 * 0.8 i.e. the above +/- 5.4 would become 4.32 giving a 'rough' range of 195.7 to 204.3 for 95% probability. Off to watch the Ladies short at the GPF!
 
Joined
Jun 21, 2003
Thanks again. The spread in Nathan's potential PCS is really quite surprising.

Re GOEs/overall score, I wonder if it's possible to come up with some sort of standard error (deviation?), that includes GOEs as well...

I think we have to be cautious. Statisticians deal with numbers -- it doesn't matter how these numbers are generated or what they represent. In this sense, sure, give me any collection of numbers whatsoever and we can start calculating. But two main problems are jumping out at me, before we conclude that Nathan's total SP score is 103 plus or minus 50 points! ;)

First, for GOEs we have to take seriously the point that Baron Vladimir is stressing. For GOEs, judges do not just give out scores representing their general impression of whether a jump was average, good, excellent, in the 75th percentile of all jumps the judge has ever seen, better ot worse than someone else's jump, etc. Rather the judge is supposed to go down the list of bullet points and check off, "good height and distance" (yes or no), "good flow out of the landing" (yes or no), "unusual or interesting entrance" (yes or no). Similarly, for negative GOEs, "two-footed landing" (yes or no), "excessive telegraphing" (yes or no). At the end, if the skater checked enough positive boxes and not any negative boxes, then that skater deserves a +2.

For PCS I think that the guidelines are less "yes or no" and more, "this skater's choreography was really quite excellent and she deserves to be in the mid-eights."

The other problem is that I did not really address your (Miller's) actual question; instead I substituted an easier one that I happened to know the answer to. :) The actual question is not "How big an interval do we need to be 95% confident that the true value of Nathan's performance lies somewhere in the interval." The question is rather, "Did Nathan really beat Shoma, or is the contest "too close to call" using standard statistic analysis. The difference between these two questions is captured by this language: the first question requires an "independent sample test," while the latter calls for a "paired data test."

What this means is that the analysis given above does not take into account the possibility that a particular judge might be tougher than another and might give lower than average scores to all the skaters, whether they skated well or not. If there are only two skaters in the contest, this can be addressed in a straightforward way, but with many skaters it starts to get murky. There is a cool way to handle this called a 2-way Analysis of variance (ANOVA), but I am afraid to try it because the conclusion might turn out to be that none of these numbers is worth squat and we have wasted our time. ;)
 
Last edited:

moriel

Record Breaker
Joined
Mar 18, 2015
Probably something more complex could be done here, since now we can actually track a lot of stuff about the judges.
For example, it is reasonable to expect that a judge that tends to give lower scores to skaters will do it in all competitions (and since there are 8 other judges, he is very likely to be below average everywhere).

But for a paired test, we could go by the difference: calculate the differences between the individual scores first, and then average that.
For example, if judge A gave skater 1 9 in TR, and gave skater 2 9.25 in TR, we got 0.25 difference there favouring skater 2. And so on, for all the PC scores.

For multiple skaters, we could do some sort of regression, with judge being one of the variables.

But the thing is, it is VERY likely that very few of those numbers are worth anything, and we will come up with a result such as "in fact, 3 of those 6 skaters should have gotten medals"
 

Miller

Final Flight
Joined
Dec 29, 2016
The other problem is that I did not really address your (Miller's) actual question; instead I substituted an easier one that I happened to know the answer to. :) The actual question is not "How big an interval do we need to be 95% confident that the true value of Nathan's performance lies somewhere in the interval." The question is rather, "Did Nathan really beat Shoma, or is the contest "too close to call" using standard statistic analysis. The difference between these two questions is captured by this language: the first question requires an "independent sample test," while the latter calls for a "paired data test."

What this means is that the analysis given above does not take into account the possibility that a particular judge might be tougher than another and might give lower than average scores to all the skaters, whether they skated well or not. If there are only two skaters in the contest, this can be addressed in a straightforward way, but with many skaters it starts to get murky. There is a cool way to handle this called a 2-way Analysis of variance (ANOVA), but I am afraid to try it because the conclusion might turn out to be that none of these numbers is worth squat and we have wasted our time. ;)

Of course this was the next question I was going to ask, LOL, but given that it was only 0.5 points out of 280 I would imagine it was the latter, plus of course there's always the random effect of things like deductions to take account of. According to the Eurosport commentators the only reason Shoma lost was that he had a new music edit that was longer than the 2 mins 50 seconds allowed for the SP and so he lost a point as a time violation. It would take a pretty good statistical system to take account of that!

Re Moriel's point, maybe there is someone out there gathering the data and that will analyse the data one day. It would be interesting from an intellectual point of view certainly.
 
Joined
Jun 21, 2003
But for a paired test, we could go by the difference: calculate the differences between the individual scores first, and then average that...

I thought of a crazy idea. We could match each skater against each of the others in this way. We would end up with a bunch of statements like, the probability that skater A really outperformed skater B is more than 95%, but we cannot say that the probability that skater C really outperformed skater D is more than 95%. Then we would have some sort of spreadsheet like in 6.0 ordinal judging, OBO, with only definite slays counting.

The best of both worlds. :)

For multiple skaters, we could do some sort of regression, with judge being one of the variables.

But the thing is, it is VERY likely that very few of those numbers are worth anything, and we will come up with a result such as "in fact, 3 of those 6 skaters should have gotten [gold] medals"

IMHO the conclusion (three skaters all won -- a sports decision) does not really follow from the premise (most of the numbers do not pass statistical muster -- a statistical decision). No, we cannot say that Nathan's performance would beat Shoma's performance 95% of the time -- but this time, with this panel of judges -- it did.
 
Last edited:
Joined
Jun 21, 2003
Re Moriel's point, maybe there is someone out there gathering the data and that will analyse the data one day. It would be interesting from an intellectual point of view certainly.

Publish or perish! I think we can be sure that many professors and researchers, and especially their graduate students, will jump on these questions for their next scholarly paper. :yes: There was a lull in such activity during the time of anonymous judging when it was hard to tease out the relevant data (not that people didn't try. :) ).

For that matter, the ISU can hire statisticians just like anyone else can, and I have no doubt that they are constantly running statistical tests and analyses on the huge amount of data generated in a figure skating season.

IMHO, though, the most relevant observation is that, yes, certainly it is "interesting from an intellectual view." :)
But I think that the ISU (and the IOC) instead take a "sports point of view" and are guided by these principles:

1. A sports contest should have a clear winner.

2. Whoever scores the most points, wins.

Is this an appropriate model for the sport of figure skating? Well, a few of us fought the good fight, but in the end that ship has sailed. The IJS is what we've got.
 
Last edited:

moriel

Record Breaker
Joined
Mar 18, 2015
I personally would find interesting poking around the data. My main issue is really collecting it, as i am rather lazy.

@Mathman, also, with enough data, we could probably have a collection of several competitions judges by a specific judge and so on, which would even add up.

As for the paired tests for everybody, i think that would require some bonferroni correction, which would kill the significance and result in some pretty wide intervals.
 
Top