Kenny Shirley
NYC Sports Analytics Meetup, August 19, 2014

Built for machines; safe, secure, and habitable by humans.

Joint work with:
Carlos Scheidegger (University of Arizona)
 
and
Carson Sievert (Iowa State University)


Jeff Cirillo, 3B, 1994 - 2007: .296 BA, 112 HR, 32 WAR, 2 All-Star Teams, 2013 HOF ballot

Jeromy Burnitz, OF, 1993 - 2006: .253 BA, 315 HR, 17.4 WAR, 1 All-Star Team, 2012 HOF ballot

Dan Plesac, P, 1986 - 2003: 65 - 71 Win-Loss, 3.64 ERA, 17.2 WAR, 3 All-Star Teams, 2009 HOF ballot
##               Name Year  WAR Votes NumBallots
## 1     Jacque Jones 2014 11.5     1        571
## 2      David Segui 2010  7.8     1        539
## 3   Shawon Dunston 2008  9.1     1        543
## 4       Walt Weiss 2006 14.6     1        520
## 5      Randy Myers 2004 14.2     1        506
## 6    Cecil Fielder 2004 14.7     1        506
## 7       Mark Davis 2003  6.8     1        496
## 8     Jim Deshaies 2001 10.2     1        515
## 9  Steve Bedrosian 2001 13.2     1        515
## 10      Ray Knight 1994 10.9     1        456##   Year              Name Pos NumBallots Votes Percentage
## 1 1936           Ty Cobb  OF        226   222     98.20%
## 2 1936      Honus Wagner  SS        226   215     95.10%
## 3 1936         Babe Ruth  OF        226   215     95.10%
## 4 1936 Christy Mathewson   P        226   205     90.70%
## 5 1936    Walter Johnson   P        226   189     83.60%Does it really take 15 years to decide?
The so-called 'morals' clause, rule 5 out of 9:
Voting: Voting shall be based upon the player's record, playing ability, integrity, sportsmanship, character, and contributions to the team(s) on which the player played.
(from http://baseballhall.org/hall-famers/rules-election/BBWAA)
The voters don't actively cover baseball!
Q: Does that mean some Hall of Fame voters don’t even cover baseball any more?
A: Yes. The BBWAA trusts that its voters take their responsibility seriously, and even those honorary members who are no longer covering baseball do their due diligence to produce a thoughtful ballot.
(from http://bbwaa.com/voting-faq/)
Stupid things are called out as stupid, and people try to figure out how to correct them. It's encouraging. It's kinda nice. And then there is Baseball Hall of Fame voting.

We were really interested in the trajectories of voting percentages of players who had appeared on the ballot multiple times.
dat <- read.csv(file="HOFregression_updated.csv", as.is=TRUE)
par(mfrow=c(1, 2))
sel <- dat[, "Name"] == "Alan Trammell"
plot(dat[sel, "Year"], dat[sel, "p"], ylim=c(0, 1), las=1, pch=19, xlab="Year", 
     ylab="Voting Proportion")
lines(dat[sel, "Year"], dat[sel, "p"])
title(main="Alan Trammell")
abline(h = 0.05, col=2, lwd=2)
abline(h = 0.75, col=3, lwd=2)
sel <- dat[, "Name"] == "Bert Blyleven"
plot(dat[sel, "Year"], dat[sel, "p"], ylim=c(0, 1), las=1, pch=19, xlab="Year", 
     ylab="Voting Proportion")
lines(dat[sel, "Year"], dat[sel, "p"])
title(main="Bert Blyleven")
abline(h = 0.05, col=2, lwd=2)
abline(h = 0.75, col=3, lwd=2)We were really interested in the trajectories of voting percentages of players who had appeared on the ballot multiple times.

How did these guys end up with such different voting trajectories?
[1] For Batting Average, 2.5 points for each season over .300, 5.0 for over .350, 15 for over .400. Seasons are not double-counted. I require 100 games in a season to qualify for this bonus. [2] For hits, 5 points for each season of 200 or more hits. [3] 3 points for each season of 100 RBI's and 3 points for each season of 100 runs. [4] 10 points for 50 home runs, 4 points for 40 HR, and 2 points for 30 HR. [5] 2 points for 45 doubles and 1 point for 35 doubles. [6] 8 points for each MVP award and 3 for each AllStar Game, and 1 point for a Rookie of the Year award. [7] 2 points for a gold glove at C, SS, or 2B, and 1 point for any other gold glove. [8] 6 points if they were the regular SS or C on a WS winning team, 5 points for 2B or CF, 3 for 3B, 2 for LF or RF, and 1 for 1B. I don't have the OF distribution, so I give 3 points for OF (requires at least 82 games as the position). ... [19] ...
# Group 1: batters
var.names[[1]] <- c("Yrs", "G", "AB", "R", "H", "HR", "RBI", "SB", "BB",
                    "BA", "OBP", "SLG",
                    "posC", "pos1B", "pos2B", "pos3B", "posSS", "posLF", "posCF", "posRF")
# Group 2: pitchers
var.names[[2]] <- c("Yrs", "W", "L", "G", "GS", "SV", "IP", "H", "HR", "BB", "SO",
                    "ERA", "WHIP")
# Group 3: returning players
# Just use the previous year's voting percentage as the sole predictor
var.names[[3]] <- c("prev1")
# Fit the model using weak priors:
fit <- bayesglm(data[sel, "p"] ~ X.scale, weights=data[sel, "NumBallots"], 
                family=binomial(link = "logit"), 
                prior.mean=0, prior.scale=2.5)Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -4.95399    0.05830 -84.972  < 2e-16 ***
X.scaleYrs    0.50693    0.05896   8.597  < 2e-16 ***
X.scaleG      1.02455    0.19500   5.254 1.49e-07 ***
X.scaleAB    -3.63447    0.48990  -7.419 1.18e-13 ***
X.scaleR      2.27816    0.14228  16.012  < 2e-16 ***
X.scaleH      3.09098    0.51523   5.999 1.98e-09 ***
X.scaleHR     1.02895    0.11611   8.862  < 2e-16 ***
X.scaleRBI   -0.96718    0.11939  -8.101 5.44e-16 ***
X.scaleSB     0.05451    0.02301   2.370   0.0178 *  
X.scaleBB     0.11784    0.10958   1.075   0.2822    
X.scaleBA     0.36248    0.14991   2.418   0.0156 *  
X.scaleOBP   -0.87497    0.12853  -6.807 9.93e-12 ***
X.scaleSLG    0.66728    0.12253   5.446 5.15e-08 ***
X.scaleposC   1.23696    0.08342  14.828  < 2e-16 ***
X.scalepos1B  0.62907    0.08655   7.268 3.65e-13 ***
X.scalepos2B  0.69809    0.07841   8.903  < 2e-16 ***
X.scalepos3B  0.54610    0.07735   7.060 1.66e-12 ***
X.scaleposSS  0.98036    0.07683  12.759  < 2e-16 ***
X.scaleposLF  0.40763    0.08836   4.613 3.97e-06 ***
X.scaleposCF -0.01915    0.08636  -0.222   0.8245    
X.scaleposRF  0.49648    0.08293   5.987 2.14e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
             Name Prediction Actual
1     Dave Parker       39.4   17.5
2    Dwight Evans       51.8    5.9
3     Ken Griffey       12.7    4.7
4 Garry Templeton        5.3    0.4
5   Terry Kennedy        0.5    0.2
6      Terry Puhl        0.2    0.2
for (year in 1997:2014) {
  for (group in c("batters", "pitchers", "returning")) {
    train <- Year < year & Group == group
    test <- Year == year & Group == group
    historical.fit <- glm(y[train] ~ data[train, ])
    predict <- predict(historical.fit, newdata=data[test, ])
  }
}| Group | Baseline | ||
|---|---|---|---|
| First-ballot Batters (n = 151) | 18.4% | ||
| First-ballot Pitchers (n = 85) | 9.7% | ||
| Returning Players (n = 262) | 5.7% | ||
| Overall | 11.7% | 
              Name Previous Predicted
1     Craig Biggio     68.2      77.0
2      Jack Morris     67.7      76.5
3     Jeff Bagwell     59.6      67.5
4      Mike Piazza     57.8      65.3
5       Tim Raines     52.2      58.0
6        Lee Smith     47.8      52.0
7   Curt Schilling     38.8      39.7
8     Frank Thomas      0.0      39.2
9        Jeff Kent      0.0      38.7
10   Roger Clemens     37.6      38.1
11     Greg Maddux      0.0      37.5
12     Barry Bonds     36.2      36.3
13  Edgar Martinez     35.9      35.9
14   Alan Trammell     33.6      33.0
15   Luis Gonzalez      0.0      23.0
16    Larry Walker     21.6      20.3
17    Fred McGriff     20.7      19.5
18    Mark McGwire     16.9      16.4
19    Mike Mussina      0.0      16.3
20     Tom Glavine      0.0      15.1
21   Don Mattingly     13.2      13.8
22      Sammy Sosa     12.5      13.3
23 Rafael Palmeiro      8.8      11.1
24     Moises Alou      0.0      10.5
25      Ray Durham      0.0       7.1
26 Armando Benitez      0.0       3.3
27      Sean Casey      0.0       0.9
28      Eric Gagne      0.0       0.8
29   Richie Sexson      0.0       0.6
30    Paul Lo Duca      0.0       0.5
31       J.T. Snow      0.0       0.4
32    Kenny Rogers      0.0       0.4
33      Hideo Nomo      0.0       0.2
34    Jacque Jones      0.0       0.1
35      Todd Jones      0.0       0.1
36     Mike Timlin      0.0       0.1

Year Name Actual Predicted Residual 1 2002 Ozzie Smith 91.7 9.3 82.4 2 2001 Kirby Puckett 82.1 2.8 79.3 3 2005 Wade Boggs 91.9 57.1 34.8 4 2004 Paul Molitor 85.2 56.0 29.2 5 2010 Edgar Martinez 36.2 7.6 28.6
1 2011 Rafael Palmeiro 11.0 93.8 -82.8 2 2013 Barry Bonds 36.2 99.5 -63.3 3 2013 Roger Clemens 37.6 99.5 -61.9 4 2010 Fred McGriff 21.5 73.9 -52.4 5 2013 Julio Franco 1.1 52.3 -51.3
Don't google image search "Barry Bonds before vs. after" unless you have half an hour to kill...

Don't google image search "Barry Bonds before vs. after" unless you have half an hour to kill...

Don't google image search "Barry Bonds before vs. after" unless you have half an hour to kill...

People have really put a lot of work into this sort of comparison
Related searches: McGwire, Clemens, Sosa.
Coefficients:
                   Estimate Std. Error  z value Pr(>|z|)    
(Intercept)        -5.06179    0.04106 -123.281  < 2e-16 ***
X.scaleYrs          0.45714    0.03482   13.127  < 2e-16 ***
X.scaleG            0.04859    0.14207    0.342 0.732324    
X.scaleAB           1.19308    0.35354    3.375 0.000739 ***
X.scaleR            0.77132    0.08730    8.835  < 2e-16 ***
X.scaleH           -0.45612    0.33860   -1.347 0.177963    
X.scaleHR           0.23513    0.07811    3.010 0.002611 ** 
X.scaleRBI         -0.25777    0.07589   -3.397 0.000682 ***
X.scaleSB           0.05965    0.01926    3.098 0.001950 ** 
X.scaleBB           0.19772    0.07434    2.660 0.007821 ** 
X.scaleBA           0.70078    0.09828    7.130 1.00e-12 ***
X.scaleOBP         -0.34334    0.09172   -3.743 0.000182 ***
X.scaleSLG          0.44604    0.08467    5.268 1.38e-07 ***
X.scaleposC         0.15346    0.02416    6.351 2.14e-10 ***
X.scalepos1B        0.12147    0.02200    5.523 3.34e-08 ***
X.scalepos2B       -0.11253    0.02412   -4.665 3.09e-06 ***
X.scalepos3B       -0.05741    0.02361   -2.431 0.015055 *  
X.scaleposSS        0.10689    0.02301    4.646 3.38e-06 ***
X.scaleposLF        0.03260    0.02365    1.379 0.168038    
X.scaleposCF       -0.20443    0.02516   -8.127 4.41e-16 ***
X.scaleposRF       -0.17331    0.02406   -7.203 5.87e-13 ***
X.scaledrugs       -0.91577    0.02574  -35.583  < 2e-16 ***
X.scaleAllStarpy    1.12873    0.01691   66.752  < 2e-16 ***
X.scalegold.gloves  0.20908    0.01136   18.411  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Nice -- the z-scores for the three new predictors are highly significant, and have signs that we expected.
| Group | Baseline | Awards + Drugs | |
|---|---|---|---|
| First-ballot Batters (n = 151) | 18.4% | 15.2% | |
| First-ballot Pitchers (n = 85) | 9.7% | 8.6% | |
| Returning Players (n = 262) | 5.7% | 5.7% | |
| Overall | 11.7% | 10.0% | 
              Name Previous Predicted
1      Greg Maddux      0.0      96.0
2     Craig Biggio     68.2      77.0
3      Jack Morris     67.7      76.5
4     Jeff Bagwell     59.6      67.5
5      Mike Piazza     57.8      65.3
6       Tim Raines     52.2      58.0
7     Mike Mussina      0.0      57.7
8        Lee Smith     47.8      52.0
9     Frank Thomas      0.0      51.0
10     Tom Glavine      0.0      50.5
11  Curt Schilling     38.8      39.7
12   Roger Clemens     37.6      38.1
13     Barry Bonds     36.2      36.3
14  Edgar Martinez     35.9      35.9
15   Alan Trammell     33.6      33.0
16   Luis Gonzalez      0.0      20.9
17    Larry Walker     21.6      20.3
18    Fred McGriff     20.7      19.5
19    Mark McGwire     16.9      16.4
20   Don Mattingly     13.2      13.8
21      Sammy Sosa     12.5      13.3
22     Moises Alou      0.0      13.1
23 Rafael Palmeiro      8.8      11.1
24       Jeff Kent      0.0      11.0
 Whew. At least now the results are plausible. Maddux is a lock, and Glavine and Thomas are above 50% (just barely).
 Whew. At least now the results are plausible. Maddux is a lock, and Glavine and Thomas are above 50% (just barely).
Year Name Actual Predicted Residual 1 2001 Kirby Puckett 82.1 12.0 70.1 2 1999 Robin Yount 77.5 8.6 68.9 3 1999 George Brett 98.2 54.3 43.9 4 2004 Paul Molitor 85.2 50.3 34.9 5 2005 Wade Boggs 91.9 60.4 31.4
Year Name Actual Predicted Residual 1 2013 Barry Bonds 36.2 99.3 -63.1 2 2013 Roger Clemens 37.6 97.9 -60.3 3 2008 Tim Raines 24.3 81.7 -57.4 4 2007 Jose Canseco 1.1 38.6 -37.5 5 2007 Mark McGwire 23.5 58.7 -35.2
Year Name Actual Predicted Residual 1 2001 Kirby Puckett 82.1 12.0 70.1 2 1999 Robin Yount 77.5 8.6 68.9 3 1999 George Brett 98.2 54.3 43.9 4 2004 Paul Molitor 85.2 50.3 34.9 5 2005 Wade Boggs 91.9 60.4 31.4
Year Name Actual Predicted Residual 1 2013 Barry Bonds 36.2 99.3 -63.1 2 2013 Roger Clemens 37.6 97.9 -60.3 3 2008 Tim Raines 24.3 81.7 -57.4 4 2007 Jose Canseco 1.1 38.6 -37.5 5 2007 Mark McGwire 23.5 58.7 -35.2
Coefficients:
                      Estimate Std. Error  z value Pr(>|z|)    
(Intercept)              -1.104016   0.004748 -232.535  < 2e-16 ***
X.scaleprev1              1.456224   0.018842   77.288  < 2e-16 ***
X.scaleprev1.squared     -0.419837   0.017857  -23.510  < 2e-16 ***
X.scaletop3              -0.189887   0.004369  -43.460  < 2e-16 ***
X.scalereturn            -0.016023   0.004591   -3.490 0.000482 ***
X.scaleballot2ndyear     -0.071682   0.009175   -7.813 5.60e-15 ***
X.scaleballotfinal        0.026895   0.004286    6.276 3.48e-10 ***
X.scaleballot2nd.x.prev1  0.091116   0.008260   11.031  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
| Group | Baseline | Awards + Drugs | Milestones, One-team, and 'Returning' | 
|---|---|---|---|
| First-ballot Batters (n = 151) | 18.4% | 15.2% | 13.5% | 
| First-ballot Pitchers (n = 85) | 9.7% | 8.6% | 9.6% | 
| Returning Players (n = 262) | 5.7% | 5.7% | 4.6% | 
| Overall | 11.7% | 10.0% | 9.1% | 
Well, it's not great that the 1st-ballot pitchers RMSE went up (adding milestones, or rookie of the year, or something made the model perform worse for them), but we'll keep the effect in.
          Name Previous Predicted
1      Greg Maddux      0.0      97.4
2     Craig Biggio     68.2      73.7
3      Jack Morris     67.7      70.2
4     Frank Thomas      0.0      64.2
5      Mike Piazza     57.8      63.7
6     Jeff Bagwell     59.6      60.4
7      Tom Glavine      0.0      53.5
8       Tim Raines     52.2      52.7
9        Lee Smith     47.8      47.7
10    Mike Mussina      0.0      42.4
11  Curt Schilling     38.8      38.5
12   Roger Clemens     37.6      36.8
13     Barry Bonds     36.2      34.8
14  Edgar Martinez     35.9      33.1
15   Alan Trammell     33.6      30.4
16    Larry Walker     21.6      17.5
17    Fred McGriff     20.7      16.7
18    Mark McGwire     16.9      13.5
19   Don Mattingly     13.2      10.8
20   Luis Gonzalez      0.0       9.6
21      Sammy Sosa     12.5       9.4
22     Moises Alou      0.0       8.7
23 Rafael Palmeiro      8.8       8.2
24       Jeff Kent      0.0       5.1

                Name Previous Predicted
1      Randy Johnson      0.0      99.8
2     Pedro Martinez      0.0      93.5
3        John Smoltz      0.0      72.4
4       Craig Biggio     74.8      69.2
5        Mike Piazza     62.2      59.2
6       Jeff Bagwell     54.3      51.3
7         Tim Raines     46.1      41.9
8      Roger Clemens     35.4      29.3
9        Barry Bonds     34.7      28.5
10         Lee Smith     29.9      23.4
11    Curt Schilling     29.2      22.6
12    Edgar Martinez     25.2      18.7
13     Alan Trammell     20.8      14.9
14      Mike Mussina     20.3      13.8
15         Jeff Kent     15.2       9.9
16      Fred McGriff     11.7       8.7
17      Mark McGwire     11.0       8.3
18      Larry Walker     10.2       7.8
19     Don Mattingly      8.2       7.8
20 Nomar Garciaparra      0.0       7.8
21    Gary Sheffield      0.0       7.6
22        Sammy Sosa      7.2       6.4
23     Troy Percival      0.0       5.3
24    Carlos Delgado      0.0       1.9
                Name Previous Predicted
1      Randy Johnson      0.0      99.8
2     Pedro Martinez      0.0      93.5
3        John Smoltz      0.0      72.4
4       Craig Biggio     74.8      69.2
5        Mike Piazza     62.2      59.2
6       Jeff Bagwell     54.3      51.3
7         Tim Raines     46.1      41.9
8      Roger Clemens     35.4      29.3
9        Barry Bonds     34.7      28.5
10         Lee Smith     29.9      23.4
11    Curt Schilling     29.2      22.6
12    Edgar Martinez     25.2      18.7
13     Alan Trammell     20.8      14.9
14      Mike Mussina     20.3      13.8
15         Jeff Kent     15.2       9.9
16      Fred McGriff     11.7       8.7
17      Mark McGwire     11.0       8.3
18      Larry Walker     10.2       7.8
19     Don Mattingly      8.2       7.8
20 Nomar Garciaparra      0.0       7.8
21    Gary Sheffield      0.0       7.6
22        Sammy Sosa      7.2       6.4
23     Troy Percival      0.0       5.3
24    Carlos Delgado      0.0       1.9

                Name Previous Predicted
1      Randy Johnson      0.0      99.8
2     Pedro Martinez      0.0      93.5
3        John Smoltz      0.0      72.4
4       Craig Biggio     74.8      69.2
5        Mike Piazza     62.2      59.2
6       Jeff Bagwell     54.3      51.3
7         Tim Raines     46.1      41.9
8      Roger Clemens     35.4      29.3
9        Barry Bonds     34.7      28.5
10         Lee Smith     29.9      23.4
11    Curt Schilling     29.2      22.6
12    Edgar Martinez     25.2      18.7
13     Alan Trammell     20.8      14.9
14      Mike Mussina     20.3      13.8
15         Jeff Kent     15.2       9.9
16      Fred McGriff     11.7       8.7
17      Mark McGwire     11.0       8.3
18      Larry Walker     10.2       7.8
19     Don Mattingly      8.2       7.8
20 Nomar Garciaparra      0.0       7.8
21    Gary Sheffield      0.0       7.6
22        Sammy Sosa      7.2       6.4
23     Troy Percival      0.0       5.3
24    Carlos Delgado      0.0       1.9

                Name Previous Predicted
1      Randy Johnson      0.0      99.8
2     Pedro Martinez      0.0      93.5
3        John Smoltz      0.0      72.4
4       Craig Biggio     74.8      69.2
5        Mike Piazza     62.2      59.2
6       Jeff Bagwell     54.3      51.3
7         Tim Raines     46.1      41.9
8      Roger Clemens     35.4      29.3
9        Barry Bonds     34.7      28.5
10         Lee Smith     29.9      23.4
11    Curt Schilling     29.2      22.6
12    Edgar Martinez     25.2      18.7
13     Alan Trammell     20.8      14.9
14      Mike Mussina     20.3      13.8
15         Jeff Kent     15.2       9.9
16      Fred McGriff     11.7       8.7
17      Mark McGwire     11.0       8.3
18      Larry Walker     10.2       7.8
19     Don Mattingly      8.2       7.8
20 Nomar Garciaparra      0.0       7.8
21    Gary Sheffield      0.0       7.6
22        Sammy Sosa      7.2       6.4
23     Troy Percival      0.0       5.3
24    Carlos Delgado      0.0       1.9
 ???
 ???
                Name Previous Predicted
1      Randy Johnson      0.0      99.8
2     Pedro Martinez      0.0      93.5
3        John Smoltz      0.0      72.4
4       Craig Biggio     74.8      69.2
5        Mike Piazza     62.2      59.2
6       Jeff Bagwell     54.3      51.3
7         Tim Raines     46.1      41.9
8      Roger Clemens     35.4      29.3
9        Barry Bonds     34.7      28.5
10         Lee Smith     29.9      23.4
11    Curt Schilling     29.2      22.6
12    Edgar Martinez     25.2      18.7
13     Alan Trammell     20.8      14.9
14      Mike Mussina     20.3      13.8
15         Jeff Kent     15.2       9.9
16      Fred McGriff     11.7       8.7
17      Mark McGwire     11.0       8.3
18      Larry Walker     10.2       7.8
19     Don Mattingly      8.2       7.8
20 Nomar Garciaparra      0.0       7.8
21    Gary Sheffield      0.0       7.6
22        Sammy Sosa      7.2       6.4
23     Troy Percival      0.0       5.3
24    Carlos Delgado      0.0       1.9