Lahman Baseball Database

Source

An overview of the database is provided here:

The front page of Sean Lahman’s website has links for the csv files:

Description of the Data

The data set contains many tables about baseball performance statistics, back to 1871.

We build this data into a sqlite database, as described below.

Transformations to the original data source

Then we create a new sqlite database:

sqlite3 lahman.db

Inside sqlite3, we do the following sql import statements, with many thanks to WebucatorTraining for this script, which we modified:

Remove all of the INSERT INTO lines.

Remove the header and everything down through the DROP TABLE lines.

Remove the tables created by WEBUCATOR too.

Remove all of the occurences of ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;

Add a semicolon at the end of each create statement.

Remove all of the lines:

ID INT NOT NULL AUTO_INCREMENT, /* ADDED BY WEBUCATOR */

PRIMARY KEY, FOREIGN KEY, UNIQUE KEY

Each 2B and 3B needs to be in single quotes

remove these two columns from homegames: spanfirst_date date DEFAULT NULL, spanlast_date date DEFAULT NULL

cat Batting.csv | cut -d, -f1-23 >BattingNEW.csv cat Parks.csv | cut -d, -f2-7 >ParksNEW.csv cat People.csv | cut -d, -f2-25 >PeopleNEW.csv

need to remove extra fields from these lines:

brklyncuny,Brooklyn College, The City University of New York,Brooklyn,NY,USA
california,University of California, Berkeley,Berkeley,CA,USA
calpoly,California Polytechnic State University, San Luis Obispo,San Luis Obispo,CA,USA
calpolypom,California Polytechnic State University, Pomona,Pomona,CA,USA
cwpost,Long Island University, C W Post Campus,Brookville,NY,USA
fairldick,Fairleigh Dickinson University, Metropolitan Campus,Teaneck,NJ,USA
flmanat,State College of Florida, Manatee-Sarasota,Bradenton,FL,USA
flmdkjc,Miami-Dade College, Kendall Campus,Miami,FL,USA
flmdnjc,Miami-Dade College, North Campus,Miami,FL,USA
flmdwjc,Miami-Dade College, Wolfson Campus,Miami,FL,USA
huntercuny,Hunter College, The City University of New York,New York,NY,USA
iaseccs,Southeastern Iowa Community College, Keokuk Campus,Keokuk,IA,USA
longisland,Long Island University, Brooklyn Campus,Brooklyn,NY,USA
mankatost,Minnesota State University, Mankato,Mankato,MN,USA
mdccbcc,Community College of Baltimore County, Catonsville Campus,Catonsville,MD,USA
mdccbce,Community College of Baltimore County, Essex Campus,Essex,MD,USA
mdmontg,Montgomery College, Rockville Campus,Rockville,MD,USA
meramec,St. Louis Community College, Meramec Campus,Kirkwood,MO,USA
mnmoorh,Minnesota State University, Moorhead,Moorhead,MN,USA
mnmorris,University of Minnesota, Morris,Morris,MN,USA
mofores,St. Louis Community College, Forest Park Campus,St. Louis,MO,USA
paalleb,Community College of Allegheny County, Boyce Campus,Monroeville,PA,USA
paalles,Community College of Allegheny County, South Campus,West Mifflin,PA,USA
psubehrend,Penn State Erie, the Behrend College of Pennsylvania State University,Erie,PA,USA
rutgers,Rutgers, the State University of New Jersey,New Brunswick,NJ,USA
sunybuff,University at Buffalo, State University of New York,Buffalo,NY,USA
txsjjcn,San Jacinto College, North Campus,Houston,TX,USA
ucdavis,University of California, Davis,Davis,CA,USA
ucirvine,University of California, Irvine,Irvine,CA,USA
ucla,University of California, Los Angeles,Los Angeles,CA,USA
ucriver,University of California, Riverside,Riverside,CA,USA
ucsb,University of California, Santa Barbara,Santa Barbara,CA,USA
ucsd,University of California, San Diego,La Jolla,CA,USA
umbc,University of Maryland, Baltimore County,Baltimore,MD,USA

The People table needs to look like this:

CREATE TABLE people (
  playerID varchar(9) NOT NULL,
  birthYear int(11) DEFAULT NULL,
  birthMonth int(11) DEFAULT NULL,
  birthDay int(11) DEFAULT NULL,
  birthCity varchar(255) DEFAULT NULL,
  birthCountry varchar(255) DEFAULT NULL,
  birthState varchar(255) DEFAULT NULL,
  deathYear int(11) DEFAULT NULL,
  deathMonth int(11) DEFAULT NULL,
  deathDay int(11) DEFAULT NULL,
  deathCountry varchar(255) DEFAULT NULL,
  deathState varchar(255) DEFAULT NULL,
  deathCity varchar(255) DEFAULT NULL,
  nameFirst varchar(255) DEFAULT NULL,
  nameLast varchar(255) DEFAULT NULL,
  nameGiven varchar(255) DEFAULT NULL,
  weight int(11) DEFAULT NULL,
  height int(11) DEFAULT NULL,
  bats varchar(255) DEFAULT NULL,
  throws varchar(255) DEFAULT NULL,
  debut varchar(255) DEFAULT NULL,
  bbrefID varchar(255) DEFAULT NULL,
  finalGame varchar(255) DEFAULT NULL,
  retroID varchar(255) DEFAULT NULL
);

When we are SURE that the csv files are ready to import, we need to run:

sed -i '1d' *.csv

so that we remove the first line of each file

.read mdw-updated-lahman-mysql.sql
.separator ,
.import AwardsManagers.csv AwardsManagers
.import AwardsPlayers.csv AwardsPlayers
.import AwardsShareManagers.csv AwardsShareManagers
.import AwardsSharePlayers.csv AwardsSharePlayers
.import CollegePlaying.csv CollegePlaying
.import HallOfFame.csv HallOfFame
.import Salaries.csv Salaries
.import SchoolsNEW.csv Schools
.import AllstarFull.csv AllstarFull
.import Appearances.csv Appearances
.import BattingNEW.csv Batting
.import BattingPost.csv BattingPost
.import Fielding.csv Fielding
.import FieldingOF.csv FieldingOF
.import FieldingOFsplit.csv FieldingOFsplit
.import FieldingPost.csv FieldingPost
.import HomeGames.csv HomeGames
.import Managers.csv Managers
.import ManagersHalf.csv ManagersHalf
.import ParksNEW.csv Parks
.import PeopleNEW.csv People
.import Pitching.csv Pitching
.import PitchingPost.csv PitchingPost
.import SeriesPost.csv SeriesPost
.import Teams.csv Teams
.import TeamsFranchises.csv TeamsFranchises
.import TeamsHalf.csv TeamsHalf

then type Control-D to quit sqlite3

back in the bash shell, type:

rm *.csv
rm mdw-updated-lahman-mysql.sql