Judging math details

GGFan · Nov 28, 2017

This is low key very interesting to me because so many consequences flow from these arbitrary decisions (from a margin of error point of view). At least back in the day you could carry your grudge against one or two judges. Now there are so many inputs that pretend to precise. Is a step sequence really worth 4.40 or 4.45 and if a feature is seen by one judge but not another did a GOE difference change the outcome?? Then the judges need to judge each component, etc. The potential for error just keeps piling up.

I'm being somewhat facetious but yes within a certain margin of error it would be nice to have a tiebreaker to make it clear that this is just a coin flip.

gkelly · Nov 28, 2017

GGFan said:
This is low key very interesting to me because so many consequences flow from these arbitrary decisions (from a margin of error point of view). At least back in the day you could carry your grudge against one or two judges.

And a big part of the fun of watching skating is grudges against judges?
(Anyone care to write a poem --maybe a limerick -- using that rhyme?)

Now there are so many inputs that pretend to precise. Is a step sequence really worth 4.40 or 4.45 and if a feature is seen by one judge but not another did a GOE difference change the outcome??

Your overall point is valid, but just to be pedantic...

Judges don't look for "features" used to determine levels and base values. That's the job of the technical panel, and the three tech panel members work together to identify the features and come up with the level of a step sequence. Most features are either/or yes/no decisions, so it usually is meaningful to talk about what feature an element "really" deserved.

Judges each independently determine the grade of execution based on recommended positive bullet points and recommended reductions. Many of these are defined in terms of "good" or "poor" qualities, which means that by definition they are subjective. So, e.g., if some judges award 0 GOE, some give +1 worth 0.7, and some give +2 worth 1.4, and the averaged GOE on a level 4 step sequence worth 3.9 -- after dropping high and low, averaging the rest -- works out to 0.55 for a total of 4.45, what would it take for the GOE to work out to 0.40 instead? One more judge giving 0 instead of +1?

It doesn't make sense to say that 4.40 (3.9 from the tech panel's determination of level 4 and 0.50 from the judging panel's average GOE) would be more correct than 4.45. If 0 and +1 are both valid GOEs for that element, judges who gave either score are both correct, there's no position of greater knowledge from which anyone can say with authority that exactly one of those independent judges should have scored that element one GOE step lower. The "correct" answer is whatever that panel came up with -- there is no correct GOE divorced from the evaluations of that particular panel.

With a different panel, or any other tiny difference in what the skaters did and what the judges saw and how they reflected that in numbers, the score might indeed have been different by hundredths or tenths of a point. But neither score is more correct than another. Make a different tiny change and you might end up with 4.50 instead of 4.45 or 4.40.

The GOEs are approximations, but there is no pre-existing truth that they approximate.

That was also true of 6.0 rankings and the various algorithms for determining results from the ordinals.

If the same number of judges agreed that X was better than Y, but one more judge preferred Z to X, that could flip the standings of X and Y regardless of where Z finished. If the ordinals were that mixed, does that make one result more correct than another?

Same as if you used different algorithms. E.g., what if the first step were "A skater with more 1st place ordinals than any other skater automatically wins" rather than "A skater with a majority of 1st place ordinals wins"?

Then 1 1 1 1 4 4 4 4 4 would beat 2 2 2 2 3 3 1 1 1, which was never the case under either the majority system or OBO. And definitely there were a few close decisions under majority that would have been different under OBO and vice versa.

The result was always what that particular judging panel did under the particular scoring rules in effect at that competition. Slightly different rules could result in different results, and neither result would be more correct than the other from any absolute omniscient position.

I'm being somewhat facetious but yes within a certain margin of error it would be nice to have a tiebreaker to make it clear that this is just a coin flip.

What kind of tiebreaker?

It seems to me what might make sense is to keep all decimal points to the end of the event and then to round only the Total Segment Score for that competition phase. (Yes, others have pointed out earlier in this thread that that means the totals shown to two decimal places won't always match the individual data points shown as contributing to those totals.)

Then round to one decimal place to determine the final results of that phase, and if rounded scores are identical, allow the tie. Or allow the TES or PCS to break the tie in short or long program, respectively, which is already the case for ties at the two-decimal-place level.

And of course for overall results, the freeskate score is the tiebreaker.

Mathman · Nov 28, 2017

To tell the truth, I have never been able to work up any righteous indignation over the details of the IJS. The reason why is that, mathematically speaking, we have gone off base as soon as we start adding and averaging. If one judge gives 8.25 and another gives 8.75, what meaning can be attached to the number that you get when you add these two scores together and divide by 2?

This problem was faced head-on by ordinal judging. You would never say that 1st place + 2nd place = 3rd place, or that the skater received "one-and-a-half'th place" over all. This would be an abuse of mathematical concepts.

Mathman · Nov 28, 2017

gkelly said:
The GOEs are approximations, but there is no pre-existing truth that they approximate.

:yes: This is the whole thing, right here. The GOEs are approximations, but what is it that they approximate?

Mathman · Nov 28, 2017

QuadThrow said:
The 95% confidence interval would be a really progessive method and a good shot form the mathematics point of view. All science are using this 95% interval so why not sports?

Honestly I think this will never happen and it would make a whole world of difference. But only because we all do not grow up with those statistics and we all have not get used to it so far.

That would be the beauty of it. You could use figure skating to teach kids some mathematics!

Just like skating teaches kids about classical music. Watching a figure skating competition on TV might be the only chance someone has of hearing a Rachmaninov piano concerto (other than a Buggs Bunny cartoon).

GGFan · Nov 28, 2017

gkelly said:
What kind of tiebreaker?

It seems to me what might make sense is to keep all decimal points to the end of the event and then to round only the Total Segment Score for that competition phase. (Yes, others have pointed out earlier in this thread that that means the totals shown to two decimal places won't always match the individual data points shown as contributing to those totals.)

Then round to one decimal place to determine the final results of that phase, and if rounded scores are identical, allow the tie. Or allow the TES or PCS to break the tie in short or long program, respectively, which is already the case for ties at the two-decimal-place level.

And of course for overall results, the freeskate score is the tiebreaker.

I feel bad that you analyzed my example because it wasn't meant to be accurate at all. I always appreciate your explanations of the system however!

I haven't thought very much about the tie breakers and some of my ideas might be way too radical. But we could inject 6.0 back in and use ordinals as a tiebreaker :biggrin:

Hating judges is not only fun, it's cathartic and a good place for fans to direct their ire at the subjective nature of the sport. I think the current system is not as clever in that sense. It aims to quell the ire by providing the satisfaction of numbers, but that doesn't change the subjective nature of the sport. Now you still have ire but no evil American or East German judge to be pissed at.

gkelly · Nov 28, 2017

Mathman said:
:yes: This is the whole thing, right here. The GOEs are approximations, but what is it that they approximate?

How does the math work for evaluating results in which a population scores subjective perceptions of any continuous variables on a visual analog scale vs. discrete integer scores "on a scale of 1 to 10" (or 0 to 6, or -3 to +3) that are then averaged to achieve a consensus score for the population as a whole?

If a market researcher determines that the focus group rated product P or candidate Q as 4.45 on the scale, on average, how does that relate to the "true" value of that product or candidate?

If subjects in a clinical trial rate the severity of drug side effects as 4.45 on average, how does that relate to the "true" severity?

These are also approximations of the same nature.

QuadThrow · Nov 28, 2017

Mathman said:
That would be the beauty of it. You could use figure skating to teach kids some mathematics!

Just like skating teaches kids about classical music. Watching a figure skating competition on TV might be the only chance someone has of hearing a Rachmaninov piano concerto (other than a Buggs Bunny cartoon).

This is exactly my plan as a math teacher.

There are so many things our kids should do at least once. But here in Europe the children usually get to know soccer and that's it. And after school they do not know what to do with their lives, because there experiences have not been colourful enough. The culture gets lost more and more.

Mathman · Nov 28, 2017

GGFan said:
Now you still have ire but no evil American or East German judge to be pissed at.

I admit that for me yelling at the referee in any sport is part of the deal, just don't throw beer bottles. (In figure skating we can get even with the judges by throwing flowers and stuffed animals onto the ice for our favorite, so there!) To me, all these numbers and calculations to the second decimal place suppress and bully-down our emotions, because who are you going to be mad at, the computer?

gkelly said:
How does the math work for evaluating results in which a population scores subjective perceptions of any continuous variables on a visual analog scale vs. discrete integer scores "on a scale of 1 to 10" (or 0 to 6, or -3 to +3) that are then averaged to achieve a consensus score for the population as a whole?

It's the same. There are tiny adjustments that attempt to compensate for approximating an integer-valued distribution by a theoretical continuous one, but it is not really worth mentioning.

If a market researcher determines that the focus group rated product P or candidate Q as 4.45 on the scale, on average, how does that relate to the "true" value of that product or candidate?

If subjects in a clinical trial rate the severity of drug side effects as 4.45 on average, how does that relate to the "true" severity?

In these cases the "true" value is the average of the evaluations that would be given by the entire population, if we had the resources to determine this. This "true mean" usually exists only theoretically, but it is real enough for applications of this type. The people in the focus group constitute a sample. If the sample is random, then the analysis is straight-forward. If the researcher makes an effort to choose a focus group that mirrors the characteristics of the population (how many men, how many women), then one hopes that the results are more reliable. But this depends on human intervention to determine which characteristics of the population are important (and is why professional poll-takers get paid).

By the way, in studies like this a conscientious researcher concludes: "In this study the severity of drug side-effects was rated at 4.45." This is 100% true. It's when you try to say: "Most likely you will have the same experience" that you have to start hedging and face up to the question of what "most likely" means.

These are also approximations of the same nature.

Yes, they are.

The population would be the set of all well-qualified figure skating judges (best, the fictitious population of all possible figure skating judges that have ever lived or might possibly live in the future). With a few mild and reasonable assumptions we can take the standard deviation of the sample to estimate (in unit-free numbers) how far away each judges' score is from this theoretical "true mean."

This is how studies of this sort go. As applied to figure skating, I am not sure what it is you learn when you learn something. At best you can say, judge #4 scored this skater significantly below average on transitions. This seems like a lot of work to conclude something that is obvious at a glance.

Edit: If you rank a product on a scale from one to ten, all this is cool. But if you have several products and are asked to rank them best, second best, etc., than all this is out the window. Now the objects of study are the lists of ordinals. These are not numbers ("I like this one best" is not a number); they cannot be "averaged" in any natural numerical way. Hence OBO, majority of ordinals, etc. As you mentioned above, the choice of which methed to use is not a mathematical one.

noidont · Nov 29, 2017

Well, with the quarter of a point increments aren't judges already rounding up in their heads?

QuadThrow · Nov 29, 2017

noidont said:
Well, with the quarter of a point increments aren't judges already rounding up in their heads?

No definitly not. That would mean the judges know the exact difference between a 7.75 and a 7.50. But noone knows. You did not think: "Oh that was a 7.543. So lets give a 7.50."

It is the other way around. You think: "That was quite ok. It is in the 7th. Lets give a 7.50."

That means the system is too sensitive and judges mark skaters more accurately than a human is probable able to..

moriel · Nov 29, 2017

QuadThrow said:
No definitly not. That would mean the judges know the exact difference between a 7.75 and a 7.50. But noone knows. You did not think: "Oh that was a 7.543. So lets give a 7.50."

It is the other way around. You think: "That was quite ok. It is in the 7th. Lets give a 7.50."

That means the system is too sensitive and judges mark skaters more accurately than a human is probable able to..

Yep exactly
Thats another thing that probably should go. One cannot define the difference between 7.75 and 7.50, but its probably doablme explaining the difference between 7 and 8.
So round that up to the nearest integer instead of the 0.25 gaps.

And they could take median instead of mean (imho its better because less sensible to possible outliers, and they wouldnt even need to remove 2 scores, since median would automatically take care of that)

Baron Vladimir · Nov 29, 2017

Mathman said:
I think that the ISU feels that a sports contest should always have a winner, come hell or high water. In the 6.0 era the judges were forbidden from giving the exact same score to two skaters. I don't think that the ISU would ever admit that their numbers are fuzzy.

Well, thats the point of every sport competition, to determine the winner, isn't it. But current system is more objective than 6.0. in terms that it gives more information about why and how someone win. Like in any other sport, winner is not always the one who is subjectively observed as better, but the one who meet criterias better, and thus score more points. And sometimes, when differences in scores are so small like in example of 2014 ice dance, just the one who had more luck with numbers. So in those cases where the sum of the numbers is applied, the margin between numbers is saying you when the winner is obvious, or in other cases when he/she/them just had more luck or applied some game rule better (when winner is determined just because that is an ultimate rule in sport). editing: now the problem is which margin between scores is telling you that, and i think that 2014 case is the best example when winner is determined just because it should be one

QuadThrow · Nov 29, 2017

Baron Vladimir said:
Well, thats the point of every sport competition, to determine the winner, isn't it. But current system is more objective than 6.0. in terms that it gives more information about why and how someone win. Like in any other sport, winner is not always the one who is subjectively observed as better, but the one who meet criterias better, and thus score more points. And sometimes, when differences in scores are so small like in example of 2014 ice dance, just the one who had more luck with numbers. So in those cases where the sum of the numbers is applied, the margin between numbers is saying you when the winner is obvious, or in other cases when he/she/them just had more luck or applied some game rule better (when winner is determined just because that is an ultimate rule in sport).

I think we should not be finally pleased with a system which allows somebody to win because of luck with numbers.

Especially if we are able to create a better system.

Baron Vladimir · Nov 29, 2017

QuadThrow said:
I think we should not be finally pleased with a system which allows somebody to win because of luck with numbers.

Especially if we are able to create a better system.

I think that is a philosophical problem of any sport and maths itself. There always be a room to say something like that when goal is to determine the winner. System can also give their own 'deffinition of luck' - factors which can affect the scores - as it is using one method of scoring... but also ice condition, noise in the arena, time of the competition, usual human mistakes in observations/judging etc etc could be defined as deciding factors when margin between scores is alowing that... if viewers and skaters themselves already dont consider those as factors :biggrin:

noidont · Nov 29, 2017

I don't know. I think it's quite likely that a judge would give Higuchi a 8.5 and Miyahara a 8.75 and later finds Mihara in the middle, for example. They would absolutely have to round up in this scenario. In 2014 Worlds this absolutely could play into it.

Also I disagree that IJS is more objective...IJS is very subjective. 6.0 is subjective in the sense that a panel of judges have interpretative authority. In IJS the one(s) who writes the COP has the authority. Judges and to a lesser degree technical specialists and the mathematic algorithm are just "functions" in this grand narrative. Take a current scoresheet and apply the change of BV and grades of GOE they want to make after the Olympics you would probably get a brand new podium. It only seems more "objective" because no one can put a name and face to the author and editor of IJS...

Baron Vladimir · Nov 29, 2017

noidont said:
I don't know. I think it's quite likely that a judge would give Higuchi a 8.5 and Miyahara a 8.75 and later finds Mihara in the middle, for example. They would absolutely have to round up in this scenario. In 2014 Worlds this absolutely could play into it.

Also I disagree that IJS is more objective...IJS is very subjective. 6.0 is subjective in the sense that a panel of judges have interpretative authority. In IJS the one(s) who writes the COP has the authority. Judges and to a lesser degree technical specialists and the mathematic algorithm are just "functions" in this grand narrative. Take a current scoresheet and apply the change of BV and grades of GOE they want to make after the Olympics you would probably get a brand new podium. It only seems more "objective" because no one can put a name and face to the author and editor of IJS...

Thats true. With current rules Evgeni would probably win over Evan and Asada would be very close with YuNa in Vancouver (if somebody calculated their scores by the current rules of BVs and GOEs, i would like to see that!). But the point is - in any sport rulles are changing, depending of what Sports organization would like to see in their sport. And in one exact point of time they are the same for all competitors which job is to learning to play by those rules. Everybody playing by same rules and judging them in one exact competition by those same rules is how objectivity in sport is achived.

Miller · Nov 29, 2017

Effect of different judges PCS or GOE scores.

If every judge gave 0.25 higher or lower across the board for PCS it would make a difference of plus or minus 2 points in the ladies LP i.e. 0.4 per component and 0.06 per counting judge, has this ever made a difference?

Effect of an individual judges GOE, 0.07 for a 0.5 per GOE element e.g. a spin, 0.10 for a 0.7 one e.g. it is impossible to get a level 4 step sequence that scores 4.45 with a 9 judge panel, has to be 4.40 or 4.50 (4.46 is possible with a 7 judge panel - 0.14 per GOE difference), and 0.14 for 1.0 one e.g. a quad. Yet again has this ever made a difference, it doesn't take much at all.

gkelly · Nov 29, 2017

moriel said:
Yep exactly
Thats another thing that probably should go. One cannot define the difference between 7.75 and 7.50, but its probably doablme explaining the difference between 7 and 8.

7 is officially defined as "Good" and 8 as "Very Good."

Say you have watched tens of thousands of skating programs at various levels over the years. Maybe hundreds of thousands for a very active long-time judge.

You have a pretty good idea of what you mean, and what your peers mean, when you say a performance was Good, vs. Very Good.

But the quality of these performances is a continuous variables, for each component -- some a little better and some a little worse.

For any given performance, you might say to yourself "The Interpretation in that performance was on the high side of all performances I would consider Good, but not quite up to the threshold I would consider Very Good. OK, was it closer to my benchmark for Good or my benchmark for Very Good?" If closer to Good, you could award 7.25. If closer to Very Good but not quite there, you could give 7.75. If right in the middle, then 7.5.

There will surely be times when you have to decide between 7.5 or 7.75 because on an analog scale you would have marked that performance somewhere around 2/3 of the way between Good and Very Good.

And sometimes you might remember what score you gave a previous skater in the same event and decide that this performance deserves the same or higher or lower and calibrate from there. If you've already given 7.5 to one and 7.75 to another and believe this latest performance should fit somewhere in between on that component, under the current scoring rules you wouldn't have the option of scoring it in between, but you might be able to reflect the difference in a different component that reflects some of the same strengths.

So round that up to the nearest integer instead of the 0.25 gaps.

Why? What would be the value of giving judges less control over reflecting the differences they perceive over different performances?

Using such a roughly calibrated scale would allow judges to separate skaters into groups of good or very good, but it wouldn't allow them to distinguish all the very good skaters from each other.

Remember that the 10.0 scale covers all of skating, where 0 represents not doing anything at all, or, officially, "Extremely Poor" and 1 is "Very Poor."

In any given competition, the range of scores will likely be much narrower. For example, at the Grand Prix Final we would expect all the skaters who qualified to be at least Very Good in most areas. (Of course, they might have a bad performance on the day and only deserve scores in the Good or Above Average ranges that day. But Average or Above Average would not be defined by what we typically see on the Grand Prix.)

At a novice competition, the best performances might be only in that Average range, with many qualifying as Fair or Weak on most components. So there wouldn't be a lot of room to distinguish 12 or 18 novice skaters from each other if the only scores you're using are 3, 4, and 5 with no decimal places.

A larger competition that includes skaters with a wider variety of skill levels -- e.g., one where entry is based primarily on age, such as a JGP event -- would likely have some outliers and would use a wider range of scores. But because the field is large, there will still be multiple skaters clustered in the same general skill range.

That's one reason why it is also useful to give separate scores for multiple components, even if they do end up in the same general range for all components.

If judges were scoring every bullet point of the components separately rather than grouping them into 5 general components, then it would make sense to give integer scores for each one. Suppose judges scored Transitions by giving separate integer scores for Continuity of movements from one element to another, Variety, Difficulty, and Quality. With four criteria, if you average an individual judges scores for those four you would end up with increments of 0.25. E.g., if a judge thinks the skater had Good (7) quality and continuity, but only Average (5) difficulty and Fair (4) variety, the average would come out to 5.75 for Transitions as a whole for that performance from that judge. Depending how each judge's thought processes work, maybe that's very similar to exactly how some judges are arriving at scores.

And they could take median instead of mean (imho its better because less sensible to possible outliers, and they wouldnt even need to remove 2 scores, since median would automatically take care of that)

If you used only integers for only 5 components and took medians instead of means, you would end up with a lot of absolute ties on PCS. Is the goal to devalue the effect that PCS has on final results and leave it to TES to drive results?

QuadThrow said:
I think we should not be finally pleased with a system which allows somebody to win because of luck with numbers.

Especially if we are able to create a better system.

What would such a better system be?

Do you mean keep the current division of labor for tech panels and judges, but maybe adjust the Scale of Values yet again in a direction you prefer, and also change the way the numbers are awarded by judges and crunched by computers/accounting? I.e., a tweak to the current basic concept, similar to how factored placements or OBO were tweaks to the 6.0 system as previously used?

I expect the next major change from the current scoring system to something completely different will be to use instruments rather than human perception to measure the aspects that can be objectively measured.

But since so much of what makes one performance "better" than another is qualitative and often subjective, I think we will always have to rely on human judgment to capture those distinctions. And if those opinions are somehow to be combined with objective measurements, they will need to somehow be turned into numbers of some sort.

moriel · Nov 29, 2017

Imho actually better control. Because, in case of 7 vs 8, it is a more objective difference. It is easier to evaluate. It is easier to distinguish.
When you give people too many gradations, it doesnt make it easier, it makes it harder.

Judging math details

GGFan

gkelly

Mathman

Mathman

Mathman

GGFan

gkelly

QuadThrow

Mathman

noidont

QuadThrow

moriel

Baron Vladimir

QuadThrow

Baron Vladimir

noidont

Baron Vladimir

Miller

gkelly

moriel

Similar threads

Connect with us