- Bill James created the word “sabermetrics” in 1980, “the mathematical and statistical analysis of baseball records”
- Originated from SABR, the Society for American Baseball Research
- There are small differences in definitions used in the major dictionaries. Some refer to just analysis of data. Some refer to analysis of players. And some specifically to teams.
- Use of the word has increased dramatically since the early 2000’s
- Baseball is now the most written about major American sport (in books)
- The definition began rather narrow, but now it has grown to include “the pursuit of baseball knowledge and the activity of baseball research”
- Trying to gain a better understanding of the game, objectivity, skepticism. The “science” of baseball.
- Definition for this course: “the study of the game of baseball through observation and experiment (when applicable)” or “the scientific and objective analysis of baseball”.
- Descriptive Statistics – collecting data and then drawing conclusions about the data, only drawing conclusions about the exact data that has been collected
- Inferential Statistics – drawing conclusions about the population of data from only a sample of data (projecting player performance)
- Baseball statistics vs. the study of statistics (like the class you had in college)
- Study of Statistics
- Measures of central tendency (mean, median, mode)
- Measures of dispersion (standard deviation)
- Exploring data by using visual tools like graphs, scatterplots, correlation diagrams
- Regression to the mean
- Sabermetrics is one specific application of “data science” and “data analysis”
- The phenomenon of “big data” applies to Sabermetrics as well. Think PITCH f/x, new player tracking tools that are coming out, etc.
- “Big Data” does not necessarily produce better outcomes. The key is in applying the data, searching for meaning, studying the data, and drawing conclusions.
- OCCAM framework for big data (by KaiserFung, NYU)
- Observational (not really captured with a goal in mind, just observing a lot of things happening with no idea how to use the data)
- Lacking Controls (again, we’re collecting information with no real purpose in mind, so the controls in capturing the data may not be ideal for our ultimate usage of the data)
- Seemingly Complete (we really don’t have “all” the necessary data, often because the data collectors don’t even know what we are testing)
- OCCAM framework for big data (by KaiserFung, NYU)
Computer Science & Technology
- Databases are run by Database Management Systems that define the structure of the database and control addition/deletion of data and its retrieval
- We use the Lahman database
- SQL is a programming language and a “standard for querying data” that uses relational databases
- Widely used in sports analytics
- We will specifically be using MySQL in this course. It’s more or less a “dialect” of SQL that has become most widely used.
Lahman Baseball Database
- Housed on web since mid-90s by Sean Lahman
- Many tables in the database with relationships between them
- For example, the “Batting” table holds information about player batting seasons (Barry Bonds, 2005 season), but these also relate to Teams, Awards, Player Salaries tables, etc.
- Each table is defined by the list of fields (columns) in the table
- Each table has records (rows) that represent the data
- Each table has a set of “keys” or unique values that separate each record from one another.
- Database Model or “schema” is made up of all the tables, relationships, definitions, and fields. Records are not part of the schema.
- “stint” field in the Batting table increases sequentially for the second or third team a player plays for in a given season.
- Documentation of the Lahman database is here
- All queries should be in “SELECT>FROM>WHERE” format. You’re selecting data from a table and giving conditions about what data specifically you want to pull into your query.
- Table names are case-sensitive
- Single quotation marks surrounding character strings in your queries
- SELECT * means to select all fields
- WHERE is the filter you are putting on the query (e.g. “WHERE teamID = ‘SEA’;” returns only SEA players and filters out all others
- To further filter, use the “AND” clause. (e.g. “WHERE teamID = ‘SEA’ AND yearID = 2013”)
- There is a “SHOW TABLES” command that lists all tables in the whole database
- Then the “DESCRIBE tablename” lists all fields in the specified table
- Syntax for selecting multiple fields – SELECT field1, field2, field3
- You can do math with fields by adding a new field to your query using this syntax: SELECT field1, field2, field3, field2 + field3 as field4
- Can also do SELECT *, field2 / field3 as field4
- SELECT>FROM>WHERE>ORDER BY (fieldname ASC or DESC)
- “Father of Baseball”
- Started out as a cricket writer
- Was a scorekeeper and recorded rules
- Elected to Hall of Fame for writing, inventor of box score, author of first rule book, and chairman of rules committee
- First sabermetrician. Might not have statistics in the way we do now without him and his efforts to begin recording data and documenting the events that happened on the field.
Dave Cameron, Fangraphs
- Sabermetrics is more about ideas, understanding of the game, and pursuit of an even better understanding
- Part of the science is being able to take complicated data and translate it into English for others to be able to understand and digest
- Just start writing. Publish every day. Build a portfolio. Create things, even if they’re wrong and they break down. This advances you.
Jonah Keri, Grantland
- The slowness of adoption in the Major Leagues was initially just a survival instinct. “Leave me alone with your new ideas and technologies”.
- But that won’t be long-lived when teams can actually gain a demonstrable edge by applying these ideas
- We’re still in the early ages, seemingly. Especially when you think about the boon of data we’re hopefully going to get from the new MLBAM tracking information about hit data, defensive tracking, etc.