Module 4


Replacement Level

  • Normal distribution bell curve
    • In the middle of the distribution the mean, median, and mode are the same
  • You can try to apply the normal distribution to baseball talent
    • But we usually measure outcomes, not talent
    • Be clear about what you’re doing
  • Bill James first to write about “replacement level” instead of just average (1984 abstract, 1985 abstract)
  • Earlier writers argued for using average as the comparison point
  • James also argues that baseball talent is not normally distributed, it’s skewed.
  • Keith Woolner has a great explanation of replacement level in 2002 Baseball Prospectus Annual
  • Some people talk about replacement level as “the bench player” while others talk about the “zero cost option” (free agents, 26th man, etc.).  Make sure you understand
  • Best method of using replacement level is to do it for each position


R and R Studio Overview

  • Graphic interface, presenting graphics, exploring data, fitting statistical models
  • Interpreted language, not compiled
  • SQL works for data
  • Console is on the left of the window and is where the coding takes place
  • Environment tab is on the top right and displays your variables while moving through the codeR_Environment_Tab
  • History tab is also on the top right.  The “To Source” button sends selected code to a text editor window that can be used to save code for later use.R_To_Source
  • Help is on the bottom right and is very good, detailedR_Help_Menu
  • Set variables as you would in traditional programming (e.g. a = 2 + 4, a =6)
  • You can highlight code and then use the “Run” command in the Source window, then just the highlighted code gets runR_Run_Command
  • Before you start doing work you need to “Set a Working Directory”R_Set_Working_Directory

R Tips and Tricks

  • CTRL + L clears all information in the Console
  • Up and Down arrow keys allow you to cycle through the different commands you have already typed into the Console, an easy way to rerun a command

R Variable TYpes

  • Can hover over variables in the “Environment” tab to see what type of variable you have (string, number, etc.)
    • Numeric
    • String
    • Logical (Boolean, True/False)

R Data Frames


  • Similar to a spreadsheet or database
  • Multiple columns, each column can be of different data type

R Console Commands

  • summar(dataset_name) – returns min, max, median, mean, 1st quartile, 3rd quartile for each field in the dataset
  • view(dataset_name) – loads the data set into a table view
  • mode(variable_name) – returns the data type (str, num, bool)
  • plot(dataset$fieldname_for_x_axis,dataset$fieldname_for_y_axis,xlab=”x axis label”,ylab=”y axis label”, pch = “plot data point type e.g. diamond, circle, etc.”, col=”color of plot data points”) – scatter plot of one field on the x axis and one field on the y axis
  • sqrt(dataset_name$fieldname) – square root
  • head(dataset_name) – gives top several records at the top of the dataset
  • tail(dataset_name) – bottom six records in the dataset


Allan Roth

  • First full-time statistician employee for an MLB club
  • Suggested tracking all kind of split information (day/night, left/right, counts, batted ball location, etc.)
  • A huge data collection driver
  • In 1950 Branch Rickey went to PIT, but Roth stayed with Dodgers.
  • The 1954 LIFE article from Rickey and Roth was groundbreaking
    • First time run differential was used to analyze success
    • They modelled offense and defense using the formulas they built
  • O – D = G
    • Offense – Defense = Games Behind
    • Offense = OBP + ISO + “Clutch”
    • Defense = OPP BA + WALK/HBP + “Pitching Clutch” – Strike Outs – Fielding

Leave a Reply