What is Regression Toward the Mean?

Read an article about baseball analysis or listen to a sabermetrically slanted podcast and you’re bound to come across the term “regression toward the mean” or a remark that “player X’s BABIP is bound to regress”.  But what exactly does this mean?  For God’s sake, the last statistics class I took was my sophomore year in college.  Even the least technical baseball analysts throw this phrase around like it’s common knowledge.

A Practical Example

Generally speaking, the average BABIP in Major League Baseball is in the neighborhood of .300.  There are a variety of factors that can influence a player’s BABIP to be higher or lower than that mark, but disregard that for this simple example.  For this next discussion let’s assume that just like the odds of a coin being flipped heads are .500, for the odds that any ball batted into play will result in a hit are .300.

So in this example world, if we took a group of 100 fantasy baseball hitters and let them play out an entire season, we would expect the BABIP for each individual, and for the group, to be .300.  Just like flipping a coin 100 times won’t always results in 50 heads and 50 tails, we realize that some players will have a BABIP much greater than .300 and others will fall greatly below .300.  Those with BABIPs over .300 will have benefited from luck, while those under .300 experienced bad luck.

Now assume the players were split into two groups.  One group of the 50 highest BABIPs in our fake world.  And the other group of the 50 lowest BABIPs.

Because every batted ball has a three-in-ten chance of being a hit (.300), even for the group of the 50 highest BABIPs we would still expect their batting average on balls in play to be .300 in the second season.  Likewise for the hitters with the lowest BABIPs.  Even though they had a low BABIP in the first year of our experiment, we would still expect a BABIP of .300 in the second year.

That’s What Regression Towards The Mean Is

Despite an above average performance in the past, you would still expect the player to have a .300 BABIP in the second year.  You expect their BABIP TO REGRESS TOWARD THE MEAN of .300.

The term regression applies to both those that outperformed in the past and those that underperformed.  A player with a BABIP of .250 in the first year of the experiment would be expected to “regress” toward the mean of .300.

Don’t Make a Huge Mistake

A common mistake is to assume that someone who has been lucky in the past will “punished” or experience bad luck in the future.  THIS IS NOT TRUE.  If you flipped a coin 10 times and it landed on heads all 10 of those flips, you would still expect five heads on your next 10 flips.  You would not expect zero heads or 10 tails.

If a player gets off to a “hot” or “lucky” start, you can expect them to “cool off” (or regress toward the mean).  But it would be a mistake to believe they will become “cold” or “unlucky”.  You should expect them to move toward their “average” or “expected” level.

Shall We Play A Game?

At the time of writing, Carlos Gomez’s BABIP is .421.  Assuming our simple world where every player is expected to have a BABIP of .300, what would “should” Gomez’s BABIP be at the end of the season?

A.  Something greater than .300

B.  .300

C.  Something below .300

The correct answer is….  A!  Let’s take a look.

To this point Gomez has a .421 BABIP based on 45 hits on balls in play and 107 total balls batted into play (45 / 107 = .42056).

Gomez has played in 39 games.  So we’re roughly 25% into the season.  Going through a very rough calculation, we would then assume through the next 120 games Gomez will put the ball into play 321 times (107 through roughly 40 games * 3 = 321 balls put into play for 120 games).

And if we assume a .300 BABIP on those 321 balls put into play, that calculates out to 96 hits on balls in play (.300 * .321 = 96).

For the season we have:

45 + 96 = 141 hits on balls in play

107 + 321 = 428 total balls put into play

141 / 428 = .330 BABIP for the season

Those 45 hits on balls in play are already “in the bag”. They cannot be taken away. So the end result should be a BABIP over .300 for the season.

Apply This Elsewhere, With Caution

Regression towards the mean can be applied to other statistics.  You must be careful to apply it only to statistics that are not significantly affected by skill.  For instance, good pitchers with good fastballs and “stuff” and deception and good control are simply going to strike out more batters than bad pitchers with poor control and limited ability.  It would be a mistake to expect a skilled pitcher’s strikeout rate to regress toward the league average.

Statistics like pitcher home run per fly ball, line drive percentage, and left-on-base percentage tend to fall in predictable ranges.  Extreme deviations from average are likely due to regress.

Realize THAT WE DON’T LIVE IN A SIMPLE WORLD

It’s very important to realize that we don’t live in a simple world.  Especially as it applies to baseball.  To some extent, all statistics in baseball can be influenced by the player’s skill level.

While nobody has been able to consistently flip heads on a coin 60% of the time, certain players have been able to consistently achieve BABIPs higher than .300.  We know faster players can achieve higher BABIPs.  But slow players have done this too.  Some pitchers can consistently control home runs per fly ball.  Some hitters consistently hit line drives.

The Take Away

It’s important to understand the concept of regression and to know the common pitfalls in applying the principles.  At the very least you can use this knowledge to identify “experts” that need to revisit their college statistics text book.

Make smart choices.