# Module 6

## Statistics

### Sample Size

• Descriptive Statistics – discussed in Module 1, characterize data, dispersion, central tendency (baseball analytics)
• Inferential Statistics – use a sample of the population and draw inferences about the whole population (projections)
• Law of Large Numbers – as sample size goes up, the sample estimate gets closer to the population estimate
• Calculating sample size is more of a statistical focus than this course.  Factors are z score, standard deviation of the population, and margin of error
• If z score goes up, sample size goes up
• If SD goes up, sample size goes up
• If margin of error goes up, sample size goes down
• Russell Carleton (pizzacutter) study in 2007 todetermine how long baseball stats take to normalize (stay consistent with each other)
• Plotted Batting Average in odd at bats vs. Batting Average in even at bats
• Wanted R^2 of at least 0.50

## Sabermetrics

### WAR

• Wins Above Replacement
• Statistical framework for total offense and defense contributions that reflects performance and playing time
• Variations
• WARP (Baseball Prospectus)
• fWAR (fangraphs)
• bWAR / rWAR (Baseball Reference)
• oWAR (Open source WAR)
• Components:
• Replacement level
• Runs to Wins
• Runs Estimators
• Morris Greenberg has studied the three main variations of WAR
• Cannot predict any WAR measure just by knowing another alternative WAR calculation
• The three variations do not have a relationship to each other
• The three methods are probably assigning different values of wins for hitter production
• Increased playing time leads to a greater variance between the three measures (amplifies the differences)
• Hitters who play 2B, SS, CF have more variant outcomes than other positions, meaning these positions have different variations in fielding calculations
• There are also more variation for pitchers in the extremes (the top and the bottom pitchers).  Probably due to how the systems attribute pitcher outcomes to pitchers versus defense.
• Basis framework of calculating WAR is to use the pythagorean calculations and extending out the calculation to become:  W/L = (R/RA)^2
• Instead of squaring this calculation, you can try todetermine the best exponent to use (instead of 2)
• For 1901 to current, the best fit exponent is 1.863

## Technology

• Anything over 100 is above average, below 100 is below average
• Can use this same formula for any batting statistic
• This SQL query on the retrosheet game log can help you calculate it for teams (this one is doing a HR park factor)

### R Studio

• lm() – fitting linear models
• function(arglist) expr
return(value)
• Build functions to create automated method of recalculating things you would commonly like to reperform

### SQL

• CASE
WHEN (expression)
ELSE
END

## History

### George Lindsey

• Canadian, University of Toronto, WWII, PhD from Cambridge, Nuclear Scientist
• First work, in 1959, was a look at batting average and if it could be used to predict future performance.
• Concludes there were too many variables at play (differing pitchers, parks, situations, leagues, etc.).
• Conducted a study of same-sided batter-pitcher matchups and found a statistically significant result that an advantage exists for opposite handed hitters
• First published statistical study of L/R batting splits
• Also first published look at run scoring expectancy in bases loaded situations
• Was trying to figure out if you should try for a force out at home in bases loaded situations or if you should try for a double play
• 2nd paper, Progress of the Score During a Baseball Game
• Isthe probability of scoring in any half inning the same?  Is run scoring by inning homogeneous?
• 1st (high), 2nd (low), 3rd (high) are different from mean
• Likely due to structure of batting order
• Does the history of scoring affectsubsequent innings?  Is scoring independent?
• Large run scoring seems to not be independent