STATISTICS
Replacement Level
- Normal distribution bell curve
- In the middle of the distribution the mean, median, and mode are the same
- You can try to apply the normal distribution to baseball talent
- But we usually measure outcomes, not talent
- Be clear about what you’re doing
- Bill James first to write about “replacement level” instead of just average (1984 abstract, 1985 abstract)
- Earlier writers argued for using average as the comparison point
- James also argues that baseball talent is not normally distributed, it’s skewed.
- Keith Woolner has a great explanation of replacement level in 2002 Baseball Prospectus Annual
- Some people talk about replacement level as “the bench player” while others talk about the “zero cost option” (free agents, 26th man, etc.). Make sure you understand
- Best method of using replacement level is to do it for each position
TECHNOLOGY
R and R Studio Overview
- Graphic interface, presenting graphics, exploring data, fitting statistical models
- Interpreted language, not compiled
- SQL works for data
- Console is on the left of the window and is where the coding takes place
- Environment tab is on the top right and displays your variables while moving through the code
- History tab is also on the top right. The “To Source” button sends selected code to a text editor window that can be used to save code for later use.
- Help is on the bottom right and is very good, detailed
- Set variables as you would in traditional programming (e.g. a = 2 + 4, a =6)
- You can highlight code and then use the “Run” command in the Source window, then just the highlighted code gets run
- Before you start doing work you need to “Set a Working Directory”
R Tips and Tricks
- CTRL + L clears all information in the Console
- Up and Down arrow keys allow you to cycle through the different commands you have already typed into the Console, an easy way to rerun a command
R Variable TYpes
- Can hover over variables in the “Environment” tab to see what type of variable you have (string, number, etc.)
- Numeric
- String
- Logical (Boolean, True/False)
R Data Frames
- Similar to a spreadsheet or database
- Multiple columns, each column can be of different data type
R Console Commands
- summar(dataset_name) – returns min, max, median, mean, 1st quartile, 3rd quartile for each field in the dataset
- view(dataset_name) – loads the data set into a table view
- mode(variable_name) – returns the data type (str, num, bool)
- plot(dataset$fieldname_for_x_axis,dataset$fieldname_for_y_axis,xlab=”x axis label”,ylab=”y axis label”, pch = “plot data point type e.g. diamond, circle, etc.”, col=”color of plot data points”) – scatter plot of one field on the x axis and one field on the y axis
- sqrt(dataset_name$fieldname) – square root
- head(dataset_name) – gives top several records at the top of the dataset
- tail(dataset_name) – bottom six records in the dataset
HISTORY
Allan Roth
- First full-time statistician employee for an MLB club
- Suggested tracking all kind of split information (day/night, left/right, counts, batted ball location, etc.)
- A huge data collection driver
- In 1950 Branch Rickey went to PIT, but Roth stayed with Dodgers.
- The 1954 LIFE article from Rickey and Roth was groundbreaking
- First time run differential was used to analyze success
- They modelled offense and defense using the formulas they built
- O – D = G
- Offense – Defense = Games Behind
- Offense = OBP + ISO + “Clutch”
- Defense = OPP BA + WALK/HBP + “Pitching Clutch” – Strike Outs – Fielding