## Statistics

### Sample Size

- Descriptive Statistics – discussed in Module 1, characterize data, dispersion, central tendency (baseball analytics)
- Inferential Statistics – use a sample of the population and draw inferences about the whole population (projections)
- Law of Large Numbers – as sample size goes up, the sample estimate gets closer to the population estimate
- Calculating sample size is more of a statistical focus than this course. Factors are z score, standard deviation of the population, and margin of error
- If z score goes up, sample size goes up
- If SD goes up, sample size goes up
- If margin of error goes up, sample size goes down

- Russell Carleton (pizzacutter) study in 2007 todetermine how long baseball stats take to normalize (stay consistent with each other)
- Plotted Batting Average in odd at bats vs. Batting Average in even at bats
- Wanted R^2 of at least 0.50

## Sabermetrics

### WAR

- Wins Above Replacement
- Statistical framework for total offense and defense contributions that reflects performance and playing time
- Variations
- WARP (Baseball Prospectus)
- fWAR (fangraphs)
- bWAR / rWAR (Baseball Reference)
- oWAR (Open source WAR)

- Components:
- Replacement level
- Runs to Wins
- Runs Estimators

- Morris Greenberg has studied the three main variations of WAR
- Cannot predict any WAR measure just by knowing another alternative WAR calculation
- The three variations do not have a relationship to each other
- The three methods are probably assigning different values of wins for hitter production
- Increased playing time leads to a greater variance between the three measures (amplifies the differences)
- Hitters who play 2B, SS, CF have more variant outcomes than other positions, meaning these positions have different variations in fielding calculations
- There are also more variation for pitchers in the extremes (the top and the bottom pitchers). Probably due to how the systems attribute pitcher outcomes to pitchers versus defense.

- Basis framework of calculating WAR is to use the pythagorean calculations and extending out the calculation to become: W/L = (R/RA)^2
- Instead of squaring this calculation, you can try todetermine the best exponent to use (instead of 2)
- For 1901 to current, the best fit exponent is 1.863

## Technology

- Park Factor = 100 * ((homeRS+homeRA)/homeG) / ((roadRS+roadRA)/roadG)
- Anything over 100 is above average, below 100 is below average
- Can use this same formula for any batting statistic
- This SQL query on the retrosheet game log can help you calculate it for teams (this one is doing a HR park factor)

### R Studio

- lm() – fitting linear models
- function(arglist) expr

return(value) - Build functions to create automated method of recalculating things you would commonly like to reperform

### SQL

- CASE

WHEN (expression)

ELSE

END

## History

### George Lindsey

- Canadian, University of Toronto, WWII, PhD from Cambridge, Nuclear Scientist
- First work, in 1959, was a look at batting average and if it could be used to predict future performance.
- Concludes there were too many variables at play (differing pitchers, parks, situations, leagues, etc.).
- Conducted a study of same-sided batter-pitcher matchups and found a statistically significant result that an advantage exists for opposite handed hitters
- First published statistical study of L/R batting splits
- Also first published look at run scoring expectancy in bases loaded situations
- Was trying to figure out if you should try for a force out at home in bases loaded situations or if you should try for a double play

- 2nd paper, Progress of the Score During a Baseball Game
- Isthe probability of scoring in any half inning the same? Is run scoring by inning homogeneous?
- 1st (high), 2nd (low), 3rd (high) are different from mean
- Likely due to structure of batting order

- Does the history of scoring affectsubsequent innings? Is scoring independent?
- Large run scoring seems to not be independent
- Lead changes follow a model of independence, meaning there is no tendency to overcome leads

- Studied home field advantage (54%)

- Isthe probability of scoring in any half inning the same? Is run scoring by inning homogeneous?
- 3rd paper, An Investigation of Strategies in Baseball
- Tries to address steal a base or not to, sacrifice bunt with less than 2 outs, IBB when 1st base is open, etc.
- The most important runs (in terms of affecting win probability) are those that tie the score or put you 1 run ahead
- Analyzed all base states and the probability of scoring 0, 1, 2 runs. Also looked at the expected runs for each situation.
- He uses these expectations to determine optimal strategy (if you bunt in this situation, what does it do to your expected runs?).
- If you are in a specific situation and then you hit a triple, what happens? What happens to your expected runs in the new situation? How much does it increase from before?
- The beginning of linear weights
- A value added approach of measuring a player
- Concludes IBB is not useful except for winning by one run and runners on 2 and 3 (the likelihood of scoring 0 runs increases by actually loading the bases, but the overall expected run value of this goes up too)
- Concludes to always play double play depth with a runner on 1st, except with a runner on third that would tie the game/take the lead
- Concludes sacrifice play does not improve scoring chances.
- Concludes stolen bases value is dependent upon an individual’s success rate