Lahman Baseball Database
Source
An overview of the database is provided here:
The front page of Sean Lahman’s website has links for the csv files:
Description of the Data
The data set contains many tables about baseball performance statistics, back to 1871.
We build this data into a sqlite database, as described below.
Transformations to the original data source
Then we create a new sqlite database:
sqlite3 lahman.db
Inside sqlite3, we do the following sql import statements, with many thanks to WebucatorTraining for this script, which we modified:
Remove all of the INSERT INTO lines.
Remove the header and everything down through the DROP TABLE lines.
Remove the tables created by WEBUCATOR too.
Remove all of the occurences of ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
Add a semicolon at the end of each create statement.
Remove all of the lines:
ID INT NOT NULL AUTO_INCREMENT, /* ADDED BY WEBUCATOR */
PRIMARY KEY, FOREIGN KEY, UNIQUE KEY
Each 2B and 3B needs to be in single quotes
remove these two columns from homegames: spanfirst_date date DEFAULT NULL, spanlast_date date DEFAULT NULL
cat Batting.csv | cut -d, -f1-23 >BattingNEW.csv cat Parks.csv | cut -d, -f2-7 >ParksNEW.csv cat People.csv | cut -d, -f2-25 >PeopleNEW.csv
need to remove extra fields from these lines:
brklyncuny,Brooklyn College, The City University of New York,Brooklyn,NY,USA california,University of California, Berkeley,Berkeley,CA,USA calpoly,California Polytechnic State University, San Luis Obispo,San Luis Obispo,CA,USA calpolypom,California Polytechnic State University, Pomona,Pomona,CA,USA cwpost,Long Island University, C W Post Campus,Brookville,NY,USA fairldick,Fairleigh Dickinson University, Metropolitan Campus,Teaneck,NJ,USA flmanat,State College of Florida, Manatee-Sarasota,Bradenton,FL,USA flmdkjc,Miami-Dade College, Kendall Campus,Miami,FL,USA flmdnjc,Miami-Dade College, North Campus,Miami,FL,USA flmdwjc,Miami-Dade College, Wolfson Campus,Miami,FL,USA huntercuny,Hunter College, The City University of New York,New York,NY,USA iaseccs,Southeastern Iowa Community College, Keokuk Campus,Keokuk,IA,USA longisland,Long Island University, Brooklyn Campus,Brooklyn,NY,USA mankatost,Minnesota State University, Mankato,Mankato,MN,USA mdccbcc,Community College of Baltimore County, Catonsville Campus,Catonsville,MD,USA mdccbce,Community College of Baltimore County, Essex Campus,Essex,MD,USA mdmontg,Montgomery College, Rockville Campus,Rockville,MD,USA meramec,St. Louis Community College, Meramec Campus,Kirkwood,MO,USA mnmoorh,Minnesota State University, Moorhead,Moorhead,MN,USA mnmorris,University of Minnesota, Morris,Morris,MN,USA mofores,St. Louis Community College, Forest Park Campus,St. Louis,MO,USA paalleb,Community College of Allegheny County, Boyce Campus,Monroeville,PA,USA paalles,Community College of Allegheny County, South Campus,West Mifflin,PA,USA psubehrend,Penn State Erie, the Behrend College of Pennsylvania State University,Erie,PA,USA rutgers,Rutgers, the State University of New Jersey,New Brunswick,NJ,USA sunybuff,University at Buffalo, State University of New York,Buffalo,NY,USA txsjjcn,San Jacinto College, North Campus,Houston,TX,USA ucdavis,University of California, Davis,Davis,CA,USA ucirvine,University of California, Irvine,Irvine,CA,USA ucla,University of California, Los Angeles,Los Angeles,CA,USA ucriver,University of California, Riverside,Riverside,CA,USA ucsb,University of California, Santa Barbara,Santa Barbara,CA,USA ucsd,University of California, San Diego,La Jolla,CA,USA umbc,University of Maryland, Baltimore County,Baltimore,MD,USA
The People table needs to look like this:
CREATE TABLE people ( playerID varchar(9) NOT NULL, birthYear int(11) DEFAULT NULL, birthMonth int(11) DEFAULT NULL, birthDay int(11) DEFAULT NULL, birthCity varchar(255) DEFAULT NULL, birthCountry varchar(255) DEFAULT NULL, birthState varchar(255) DEFAULT NULL, deathYear int(11) DEFAULT NULL, deathMonth int(11) DEFAULT NULL, deathDay int(11) DEFAULT NULL, deathCountry varchar(255) DEFAULT NULL, deathState varchar(255) DEFAULT NULL, deathCity varchar(255) DEFAULT NULL, nameFirst varchar(255) DEFAULT NULL, nameLast varchar(255) DEFAULT NULL, nameGiven varchar(255) DEFAULT NULL, weight int(11) DEFAULT NULL, height int(11) DEFAULT NULL, bats varchar(255) DEFAULT NULL, throws varchar(255) DEFAULT NULL, debut varchar(255) DEFAULT NULL, bbrefID varchar(255) DEFAULT NULL, finalGame varchar(255) DEFAULT NULL, retroID varchar(255) DEFAULT NULL );
When we are SURE that the csv files are ready to import, we need to run:
sed -i '1d' *.csv
so that we remove the first line of each file
.read mdw-updated-lahman-mysql.sql
.separator ,
.import AwardsManagers.csv AwardsManagers
.import AwardsPlayers.csv AwardsPlayers
.import AwardsShareManagers.csv AwardsShareManagers
.import AwardsSharePlayers.csv AwardsSharePlayers
.import CollegePlaying.csv CollegePlaying
.import HallOfFame.csv HallOfFame
.import Salaries.csv Salaries
.import SchoolsNEW.csv Schools
.import AllstarFull.csv AllstarFull
.import Appearances.csv Appearances
.import BattingNEW.csv Batting
.import BattingPost.csv BattingPost
.import Fielding.csv Fielding
.import FieldingOF.csv FieldingOF
.import FieldingOFsplit.csv FieldingOFsplit
.import FieldingPost.csv FieldingPost
.import HomeGames.csv HomeGames
.import Managers.csv Managers
.import ManagersHalf.csv ManagersHalf
.import ParksNEW.csv Parks
.import PeopleNEW.csv People
.import Pitching.csv Pitching
.import PitchingPost.csv PitchingPost
.import SeriesPost.csv SeriesPost
.import Teams.csv Teams
.import TeamsFranchises.csv TeamsFranchises
.import TeamsHalf.csv TeamsHalf
then type Control-D to quit sqlite3
back in the bash shell, type:
rm *.csv
rm mdw-updated-lahman-mysql.sql