This is spurred by a couple of blog posts at Phil Birnbaum's site ("Noll-Scully doesn't measure anything real") and TangoTiger's site ("Trap of Noll-Scully"). And I also hope it helps with some Twitter discussion that didn't work out there (@BMMillsy, @guymolyneux, @dataandme).

Here's what I understand about the Noll-Scully RSD measure plus the best that I can make out about the issues raised above.

Thought process. The standard deviation of final winning percent is useful in relation to competitive balance (which also follows from the underlying distribution of talent in the league). Suppose one version of a perfectly balanced league, given by Pr(win)=0.5 for all teams and all games. Start this version of a completely
competitively balanced league at G

_{1}games. Change it to G_{2}games. What happens?
Let ISD bet the standard deviation of this version of a completely balanced league.
Fort and Quirk (Journal of Economic Literature, 1992) show that ISD=0.5/sqrt(G) for the binomial without ties. Thus, moving from G

_{1}to G_{2}, ISD(G_{1}) will be different than ISD(G_{2}) because ISD depends on G. This helps to make clear that*in general*, the standard deviation of winning percent depends on G as well. Let ASD be any standard deviation of winning percent from the league.
Here is what I get from the discussion/Twitter noted above. Suppose we have a statistic Z that measures the outcome of
applying the talent distribution in league play. In a league with season length G

_{1}, we get Z(G_{1}). If the league changes to season length G_{2}, we get Z(G_{2}). Define a*successful*Z to have the following characteristic: Z(G_{1}) = Z(G_{2}) because the underlying talent distribution is the same in either case, just applied in leagues of different season length.
So, how to reconcile Z and ASD? Even in a league of equal playing strength,
so that Pr(win)=0.5, ISD changes with G and so will ASD generally.

Let’s consider three alternatives (there may be more).

Alternative 1: Z

^{*}= ASD ± dASD/dG. Now it will be the case that Z^{*}(G_{1}) = Z^{*}(G_{2}) because the impact of just changing season length will be netted out by calculation of the impact of G on ASD and addition or subtraction. This is sort of like an “inflation adjustment”. The distribution of talent didn’t change and our Z^{*}provides the comforting result that it is the same for either G_{1}or G_{2}. Of course, this requires knowing dASD/dG.
Alternative 2: RSD =
ASD/ISD. It is immediately clear that
there is no way that RSD can be a successful Z.
It will always change with G and it contains no adjustment that would
make it stay the same regardless of G.

Alternative 3: Just
dump the standard deviation as useful because you don’t like either of the
above. There are other measures
of final season competitive balance as a reflection of the distribution of
talent. But don’t propose a game-level
or playoff access or dynasty alternative since we’re talking about final season
competitive balance. There are
other measures of those other aspects of balance as well.

So the point is well made that RSD cannot be a candidate for
Z. But it was never intended to be such (I know Noll quite well and knew Scully well prior to his death). It really is meant to be the distance comparison measure, 5 steps from my door are farther than 2 steps from your door, so I am farther from my door.

Perhaps there is just a semantics misunderstanding when the literature using RSD states that it "controls" for G? Surely the Z

^{*}measure does this forcefully. But RSD does it relatively, so maybe a better way of saying it is that RSD "recognizes" G in its relative comparison.
Some concluding comments...

So far, I haven’t seen anybody
take a crack at calculating dASD/dG. I
wonder if the related critics Owen & King (Economic Inquiry, 2015) are actually just simulating dASD/dG
in which case they are an ally to those seeking Z

^{*}. One could just use their simulation results rather than trying to determine the derivate, dASD/dG. Or perhaps the Pythagorean discussion in the references to blogs at the top of this post handle this problem already? If not, then there is a ways to go still with Z^{*}development.
But I’m still not so sure that taking the "inflation
adjustment approach" is any more informative than what is done with RSD. Z

^{*}distills the dASD/dG problem to an absolute level. RSD just puts the comparison at a relative level.
It does seem to me that the Z* devotees are not really critiquing
RSD as a normalization. They would just
prefer to take the direct Z

^{*}approach rather than taking the relativist approach.
And I chose the word “prefer” carefully. While it is easy to see that Z

^{*}is*different*than RSD, I still don’t see how Z^{*}is*superior*to RSD. And it is not enough to just say so. In any event, if Z^{*}is shown to be better at a later date, future work will be the better for it.
In the meantime, I have competitive balance to compare, within a league where season length changes and across leagues with different season lengths. And I haven't yet been dissuaded on RSD as one useful measure.

Hi, Dr. Fort,

ReplyDeleteMy argument is that there *is* a "successful" Z statistic:

Z = square root of (ASD squared - ISD squared).

In this measure, how is Z(G1)=Z(G2), that is, dZ/dG = 0? Both ASD and ISD change with G. Do you show this at your blog? Or can you email me a proof?

DeleteThis comment has been removed by the author.

DeleteLet me try to make the previous "proof" clearer.

ReplyDeleteISD is defined as the expected SD of an idealized league where every team is .500 talent.

I say: *Redefine* ISD as the expected SD of all the team's deviations from their expected record based on their talent.

For all teams being .500 talent, the definition is identical, since SD(team record) = SD(team record - constant .500).

For teams being different from .500, you can still use the same formula you're using (sqrt of .25/G). It will be close enough, because even for a .600 team, the actual value is (sqrt of .24/G), and for a .700 team, the actual value is (sqrt of .21/G), which is still close.

If you accept that approximation, and the redefinition of ISG to make it work for non-.500 leagues, then

ASD squared = ISD squared + TSD squared

Where TSD is the (estimated) SD of team talent, and its expected value does not depend on G.

I get what the measure is, but I still don't see how it satisfies the Z* idea (not mine, but the one that others keep stressing) that Z(G1) = Z(G2). Maybe you can just show this by demonstration if not by proof?

DeleteAlso, I apologize since it appears one of your replies is not here! I never get comments at my blog so I'm not proficient at managing them. I'm trying to figure out how I goofed up, but given this reply of yours made it, I think the flow of the discussion is intact. Again, my apologies.

DeleteSure, will write something up.

DeletePart I (had to split to get the comment to work):

DeleteLet me try to demonstrate it by starting the other way.

We agree that for a league of .500 teams, ISD= sqr(0.25/G). That is, the expected observed SD of that league is sqr(0.25/G). (I say "expected," but that might not be precisely true in the mathematical sense -- the expected value of the SD might not be that. Maybe the expected value of the variance is the square of that. So take "expected" in a colloquial sense, like how the SD is the "typical deviation".)

Now, suppose the league is half .600 teams, and half .400 teams (that's their expected record, like 60% and 40% biased coins). What's the expected observed SD now?

Well, first, the SD of the team expectations themselves is .100. That's the SD of "half the teams at .400 and half the teams at .600". (For the earlier case, where every team was at .500, the SD of team expecations was zero, since the SD of a bunch of identical ".500"s is zero.)

I'm going to call this, the SD of team expectations, "SD(talent)".

Now, let's talk about the SD of *deviations* from .600/.400. Because, of course, not every .600 team will finish at exactly .600, just like in your example, not every team finished at exactly its expectation of .500.

The SD of the expected difference from the expected record -- that is, the SD of the deviations from .600 or .400, respectively -- is sqr(0.24/G). That's by the same binomial approximation to normal as in the .500 case. Well, in the .500 case, it was (0.25/G). But they're close.

I'm going to call this "SD(luck)".

Now, a team's record is exactly its "talent" plus its "luck", the way we defined talent and luck here. That is, if one of the teams with .600 talent (96-66, say) goes .620 (98-68), its "talent" is .600, its "luck" is +.020, and its record is .620. That's by definition.

So

actual = talent + luck

So, by properties of variance,

Var(actual) = var(talent) + var(luck) + 2 cov(talent, luck)

Since talent and luck are independent, the covariance term is zero, so

Var(actual) = var(talent) + var(luck)

Which means

ASD squared = SD(talent) squared + SD(luck) squared

Which means, if we treat SD(luck) as the same as "ISD of the .500 case" -- which it almost is, except that it's sqr(.24/G) instead of sqr(.25/G) -- we get

ASD squared = SD(talent) squared + ".500 case ISD" squared

So SD(talent) squared = ASD squared - ".500 case ISD" squared

By definition, SD(talent) doesn't depend on G. ASD and ".500 case ISD" *do* depend on G, but they must "cancel each other out" in the subtraction, for this identity to work.

Which is why I say, the statistic Z, where

Z = square root of (ASD squared - ".500 case ISD" squared)

is a "successful Z" that doesn't depend on G. Furthermore, it's exactly what we want to know, because we defined it as the SD of team expectations, the quantity of interest!

Does that make sense?

------

Part II:

DeleteAs for a demonstration: I would predict that if you take a random sample of X games for each team, out of a season of 162 games, you will find that:

1. For Noll-Scully, the lower the X, the higher the Noll-Scully

2. For the Z above, Z will not depend on X at all. It'll jump around for random reasons, but will "average" about the same for all values of X. (Somewhere between .05 and .06, probably).

I'd predict that if you calculated N-S and Z *right now*, after 20 games or whatever it is in the MLB season right now, N-S will look very high, and Z will look "normal". But Z does have a large SD for a 20-game sample, so you never know. But I'd easily bet a substantial that today's Z divided by last year's Z is less than today's N-S divided by last year's N-S.

And if you gave me 2:1 odds, I'd bet that today's Z is less than last year's Z. (the correct odds are probably ... 1.2:1, or something?) But if you wanted me to bet that today's N-S is less than last year's N-S, I probably wouldn't even take 25:1.

Rod, I’m afraid you’ve started your rumination at what should be the conclusion: the “idealized” SD. Let’s instead start at the beginning: We want to compare competitiveness in two leagues of different season lengths G (or the same league at different times, with intervening change in G). We could simply compare ASDs and say one league is more competitive. But instead we sometimes use RSD. The only conceivable reason to do this is because G influences the ASD -- if it didn’t, we’d just use ASD and be done! So, the purpose of RSD must be to allow a fair comparison of leagues correcting for the influence of different season lengths.

ReplyDeleteTo control for season length, we should adjust ASD in a way consistent with the actual impact of G. If a given increase in G reduced both ASD and ISD proportionately, then your RSD metric would be an elegant solution. Unfortunately, increasing G does *not* increase ASD and ISD proportionately. And thus, dividing ASD by ISD cannot provide an estimate of competitive balance independent of G.

Nor does RSD tell us the “distance” from ideal balance, because the size of the “footsteps” in each league’s RSD are different, depending on G. And since ISD does not change proportionally to the change in ASD as G varies, the differing size of the footsteps makes RSD an apples-to-oranges comparison.

The RSD solution– “hey, let’s just divide by the ISD” – never had any statistically valid justification, which is why none has ever been offered. It was an intuition, but one that turns out on further inspection to take us nowhere. It would never be created today. The only reason to use it is because many have used it in the past. But that is really no reason at all..

Hopefully, you are Guy Molyneaux? If hope so since you were part of the Twitter discussion and I'm glad you found your way here.

DeleteI disagree on your point that a useful RSD requires proportional changes in ASD and ISD and that is moot if I can handle your very important second issue.

I think I now see what the problem is with the "step size" issue; a game in MLB is worth .0062 winning percent points and a game in the NFL is worth .0625 winning percent points and so on.

But that suggests that the "distance" can be normalized relative to a particular league so that comparisons across leagues are again viable.

I did so and you can see my try here:

https://umich.box.com/s/i44dxbz578suyavnfq1m16i6lakulsgu

The results are extremely important for RSD. Both the magnitude of relative RSD and the ranking given by relative RSD suggest big issues for the past use of RSD.

Rod: Yes, I am the same "Guy."

ReplyDeleteAs a non-UMich person, I cannot access your link. But in any case, the reason that RSD does not provide a common "footstep" size for comparison is that the "units" are each league's respective ISD, which of course varies by G. And if your answer to that is "that's OK, we want to account for the difference in G," then my reply is your metric should reflect the actual impact of G on the ASD. But RSD does not, as you acknowledge. And if ISD has no relation to the actual affect of G on ASD, then RSD is just ASD divided by an arbitrary and shifting denominator.

Here is an analogy: We know that the dimensions of a particular ballpark adds 5 HR for a player on that team. A 10 HR hitter will hit 15, and a 40 HR hitter will hit 45. To adjust for that in comparing these players to the rest of the league, we should subtract 5 HR from their total. But instead, we say the average player here hits 20 HR (but only 15 HR elsewhere), so we will subtract 25% from each player's HR total when comparing to the league. That is what RSD does: it takes an additive relationship (fewer games adds variance) and pretends it is multiplicative. And that gives you the wrong answer.

And note that RSD is not just "less good" than other competitive balance metrics. It is actively misleading. Berri, for example, uses it to show that the NFL is much more competitive than the NBA, despite similar ASD, once you adjust for schedule length. That is simply a false claim. So the use of RSD is literally reducing sports economists' understanding of these issues.

Sorry about that. I had the link access set incorrectly. I think you can get it now. And I think the spreadsheet there does go to your points. Look forward to your response.

DeleteThis comment has been removed by the author.

ReplyDeleteI'm not sure I follow your exercise. In any case, we may be at an impasse. Our objection is that RSD does not effectively control for length of schedule. I believe that you agree, but say that was not its true purpose. So that is really our disagreement.

ReplyDeleteObviously, I can't speak to *your* purpose in using the metric. But I would make two points: 1) it is in fact frequently used by sports economists, explicitly, for exactly that purpose (Vrooman 1995, Berri et. al.). You obviously know that body of work very well. If this is all a huge misunderstanding about the true purpose of RSD, why have you never pointed out this error to your colleagues? And 2) using RSD leads to false conclusions about competitive balance (e.g. that the NFL is highly competitive), undermining the value of work that relies on the metric. I'm not a sports economist, but if I were I think that would trouble me.

But at least we can agree that RSD does not control for schedule length, which is something. And perhaps you are (or will be) persuaded that Phil has correctly calculated "Z" (which he has). So this has been a productive discussion, even if we cannot reach consensus.

Thanks, Guy.

Delete