here be dragons: 15 years of statistical strangeness

85
Here be Dragons 15 Years of Statistical Strangeness Sean Forman President, Sports Reference LLC Philly PA @sean_forman HELLO! Great to be here and welcome to the conference. I’m Sean Forman And my talk today is going to be a bit dierent than the talks you’ll hear later. I’m still going to talk about stats, but I’m going to talk about where the stats come from, how we compile what happens on the field into our database and some of the quirks and diculties we’ve come across in building our sites.

Upload: sports-reference

Post on 19-Jul-2015

297 views

Category:

Sports


2 download

TRANSCRIPT

Page 1: Here Be Dragons: 15 Years of Statistical Strangeness

Here be Dragons 15 Years of Statistical Strangeness

Sean Forman President, Sports Reference LLC

Philly PA @sean_forman

HELLO!Great to be here and welcome to the conference.

I’m Sean Forman

And my talk today is going to be a bit different than the talks you’ll hear later. I’m still going to talk about stats, but I’m going to talk about where the stats come from, how we compile what happens on the field into our database and some of the quirks and difficulties we’ve come across in building our sites.

Page 2: Here Be Dragons: 15 Years of Statistical Strangeness

On old maps, when cartographers had no idea what was at the edge of a map they were drawing, they would often write, “Here be dragons”.

In 2000 when I started the first site, there were A LOT of things that I didn’t know. The map of what I knew was very small.

The current map of the sites is vast, we have millions of pages and billions of cells of data in our databases. Most all of that was uncharted territory when I started.

In the first half of my talk today I’m going to talk about some of the “dragons” I’ve discovered while exploring the edges of the sports statistics world.

And in the 2nd half I’m going to talk some of the fun things we’ve discovered and some of the fun things we’ve done on the sites.

Page 3: Here Be Dragons: 15 Years of Statistical Strangeness

Some History,

We now have 7 sites, but started with Baseball-Reference.com in April of 2000.

Page 4: Here Be Dragons: 15 Years of Statistical Strangeness

2007, I quit my job as a math professor to do the site full time.

Page 5: Here Be Dragons: 15 Years of Statistical Strangeness

And now 2009 when we started our current look.

The NL East is about the exact opposite of what we expect this year.

Page 6: Here Be Dragons: 15 Years of Statistical Strangeness

Coming Summer 2015

And now in 2014, we have seven sites, a half a million visitors a day and 2m page views every day

And an 8th site is on the way this summer.

Page 7: Here Be Dragons: 15 Years of Statistical Strangeness

History is Murky

Most of my examples will be about baseball, and when we talk about baseball we are talking about an old sport.

If you want to do a site about MLS, your job is pretty easy. MLS is 20 years old, and the records have always been maintained on computers. NBA a bit harder, NFL and NHL harder still.

MLB as a professional organization is nearly 50 years older than either the NFL or the NHL.

Page 8: Here Be Dragons: 15 Years of Statistical Strangeness

What’s a Major League?

• National League (1876-now)

• American League (1901-now)

Even the question of what is a major league becomes challenging to determine and requires judgment calls. We all know the AL and NL.

But prior to 1900 it’s not that easy. After the Civil War in the 1860s, baseball was booming, so dozens of leagues were starting. Some had big money backing, so they could poach National League players and many had teams west of the Mississippi where there was great interest in baseball, but no National League teams.

Page 9: Here Be Dragons: 15 Years of Statistical Strangeness

What’s a Major League?• National League (1876-now)

• American League (1901-now)

• American Association (1882-1891)

• Union Association (1884)

• Players League (1890)

• Federal League (1914-1915)

• National Association (1871-1875), not MLB recognized

So somebody has to decide what is and isn’t a Major League and essentially it fell to researchers 30-40 years after the fact to determine what is and isn’t a major league.

UA is a joke, < 50% of players in UA played in any other major league, only one HOFer, a 20yo Tommy McCarthy appeared in a UA game.

1875 National association on the other hand had 55% of its players go into the NL and had 6 HOFers playing in the league.

Page 10: Here Be Dragons: 15 Years of Statistical Strangeness

So why does this cause problems? As an example, Cap Anson, besides being a horrible racist, was a pretty good ballplayer and retired in 1897 with the most hits all-time.

Cap got his start at age 19 in the 1871 National Association. We consider the NA a major league, but MLB doesn’t, so Cap’s career hit totals can be a bit confusing. MLB 3011 hits, BR 3435 (3012 in NA)

Probably thinking 1871-1897, why is this is an issue and why is this a problem.

So as with most problems in my life. I blame Derek Jeter.

Page 11: Here Be Dragons: 15 Years of Statistical Strangeness

At the start of the 2014 season, MLB listed Jeter and his 3,316 hits in 9th place well ahead of Cap Anson, but we listed him in 10th place behind Cap Anson.

One other detail in this story. Is that we don’t even agree on Honus Wagner’s hit total either, but for completely different reasons.

Page 12: Here Be Dragons: 15 Years of Statistical Strangeness

Orioles => Yankees

This is not a statement that the Orioles are greater than or equal to the yankees, but rather a question of what is a franchise, and is a particular team part of the lineage of a franchise.

In 1902, the Orioles went bankrupt in the middle of the season and were taken over by the league. The AL had ulterior motives as they wanted a team in NYC. So at the end of the year they sold the franchise to NY owners who then went on to create the Yankees.

The question then becomes, do we count the 1901-02 Orioles as part of the Yankees history. We count the Expos as part of the Nationals history. We decided it was a new franchise as only 5 Orioles played for the Yankees in 1903. It was a new team. But there are consequences in the records as to the decisions you make. On our blog article about this decision there are 130 responses regarding our decision to make this change.

Page 13: Here Be Dragons: 15 Years of Statistical Strangeness

RBI’S

• Introduced in 1920 - "The summary (ed: meaning the box score) shall contain: The number of runs batted in by each batsman."

So we’ve had to discuss what’s a league, what’s a franchise, and now we even have to discuss what’s a stat?

Stats didn’t come to the league fully formed. They evolved over time. It took 50 years for the RBI to become an official MLB stat.

In 1920, when they added the rule, here was their definition. You don’t need to be a mathematician to see that this definition has problems.Double plays? Errors? Walks? bunts? Ground outs?

We have evidence that the St. Louis Browns scorer didn’t understand how RBI’s were calculated. In 1922, for the first half of the year, the RBI totals for the Browns’ games had dozens of errors. Home runs with no RBIs, more RBIs than runs scored. Detroit home games were kept very well. Well after the Tigers visited the Browns, the Browns scorer became much much better at calculating rbis, so we think the Tigers scorer taught him how to count them.

Page 14: Here Be Dragons: 15 Years of Statistical Strangeness

RBI’S• Introduced in 1920 - "The summary (ed: meaning the

box score) shall contain: The number of runs batted in by each batsman.”

• 1931 - “Runs Batted In are runs scored on safe hits (including home runs), sacrifice hits, outfield put-outs, infield put-outs, and when the run is forced over by reason of the batsman becoming a base-runner. With less than two outs, if an error is made on a play on which a runner from third would ordinarily score, credit the batsman with a Run Batted In.”

Through the 20’s the rule continued to be very vague and in 1931 the league finally fixed the issue with the rule almost as we have today.

So if they didn’t collect them until 1920 how do we know Babe Ruth’s pre-1920 RBIs?

Page 15: Here Be Dragons: 15 Years of Statistical Strangeness

So Where do RBIs come from?

• 1920-Present: RBI tracked by official scorers and tabulated by the official statistician (please note that official does not mean that these numbers are without error)

• 1907-1919: Unofficial compilations by Ernie Lanigan (1930-50s)

• 1876-1890: Unofficial compilations by John Tattersall (1960s)

• 1891-1919: Unofficial compilations by David Neft (1960s)

• 1871-1930, Modern researchers recreating accounts of every run in a season.

A lot of hard work by a few people.

Microfilm accounts of old ballgames.

I enjoy looking at old newspapers. If only because the advertisements.

Page 16: Here Be Dragons: 15 Years of Statistical Strangeness
Page 17: Here Be Dragons: 15 Years of Statistical Strangeness

Retroactive Debuts

Even defining when a player starts his MLB career can have issues.

So if a game is suspended and played later, the league considers the game to have been played on the first date for purposes of consecutive hit streaks or games played streaks.

This graphic shows at least the first two games of the Barry Bonds’ career.He also got a hit in the resumption. Which date is his debut? In which game did he get his first hit?

Last year there were a half dozen players with this issue. Note that it’s possible for a player to play for both teams in a single game.

Page 18: Here Be Dragons: 15 Years of Statistical Strangeness

Discrepancies

Now I want to talk about actual errors in the collection of the data.

Page 19: Here Be Dragons: 15 Years of Statistical Strangeness

Ellis Valentine OF Assists

So the issue here is that the

RF numbers come from play-by-play accounts of every game that season and the OF numbers are the numbers that the league published in its annual totals

Which is correct? Probably 25, but it’s impossible to know without carefully reconciling all 25 potential assists.

And it turns out we’ve located 10’s of thousands of possible discrepancies like this.

Page 20: Here Be Dragons: 15 Years of Statistical Strangeness

retrosheet.org• Recreating a play-by-play account of every game

in major league history.

• By end of 2015, there will be 210,877 MLB games in history.

• Retrosheet will have play-by-play for 135,311 games (mostly hand entered), 65%

• Box Scores for 39,000 more, 83% of all games.

So I’m mentioning pbp accounts a lot. Where do these come from. Well there is a volunteer group that has been working for 26 years to accumulate play-by-play accounts of every game in major league history. EVERY GAME. which now counts about 211k.

This play-by-play can be tested and verified by computer and multiple sources, so errors can be caught quickly and easily. For example Pitcher Hits Allowed = Batter Hits.

Not possible for the “official totals” stored by the league.

Page 21: Here Be Dragons: 15 Years of Statistical Strangeness

August 5, 1927

This

Page 22: Here Be Dragons: 15 Years of Statistical Strangeness

or this

Page 23: Here Be Dragons: 15 Years of Statistical Strangeness

is converted into this. A machine readable language that can be parsed and we turn into

Page 24: Here Be Dragons: 15 Years of Statistical Strangeness

and we turn it into this is a row from our play-by-play database of 11m plays. Over 200 columns of data on each play.

So what did the official records look like.

Page 25: Here Be Dragons: 15 Years of Statistical Strangeness

This was state of the art in 1923 and for a long time after.

Page 26: Here Be Dragons: 15 Years of Statistical Strangeness

Pete Behan

Notice he’s credited with 34 games played by the league. Now I’ve counted this 5 times and get 35 games every time.

Thousands of such clerical errors in the historical record and we will never know the accurate values for all such seasons.

Ty Cobb had 4190 hits plus or minus 5

Page 27: Here Be Dragons: 15 Years of Statistical Strangeness

Paul Erdős

Paul air-dish, born 1913 in Hungary, was probably the most prolific mathematicians of the 20th century

Page 28: Here Be Dragons: 15 Years of Statistical Strangeness
Page 29: Here Be Dragons: 15 Years of Statistical Strangeness

Plays

The next kind of play I’ll describe are what I’d call Facepalm plays or in some cases decisions.

These are decisions or plays that either don’t make sense or in someway break out systems.

Page 30: Here Be Dragons: 15 Years of Statistical Strangeness

2008 Bengie Molina “Home Run”

In 2008 video replay for home runs was introduced. Sept 27th Bengie Molina hits a ball off the top of the fence, stops at first base. Emmanuel Burriss runs out to PR for him.

Bruce Bochy then goes out and asks for a video review of the play. It’s ruled a home run. Burris then trots home to score the run.

Molina gets a HR, 2 RBI and no run scored on the play. Unfortunately we have no means of handling a mid-play change in runners.

Page 31: Here Be Dragons: 15 Years of Statistical Strangeness

Segura goes backwards

Anyone here from MLB or the NFL? Please leave if you are.

April 19, 2013

Our software was completely unable to deal with this event and it’s parsed over 11m different plays in MLB history, so our play-by-play decoder now has special code to handle this one particular play.

Page 32: Here Be Dragons: 15 Years of Statistical Strangeness

NFL End of Game

Sometimes understanding what happens in a play is easy, but translating that into a clear definition in a database can be very hard.

Page 33: Here Be Dragons: 15 Years of Statistical Strangeness

• Charlie Whitehurst pass complete short middle to Dexter McCluster for 6 yards, lateral to Nate Washington for -10 yards, lateral to Charlie Whitehurst for 20 yards, lateral to Delanie Walker for 33 yards (tackle by Dawan Landry)

• Whitehurst, the quarterback, gets a completion on the play for 49 yards. He also gets 20 receiving yards, but no reception as that is given to McCluster.

Page 34: Here Be Dragons: 15 Years of Statistical Strangeness

So recording this one play takes two tables and 13 rows in the tables.

Page 35: Here Be Dragons: 15 Years of Statistical Strangeness

Brought up during the 2006 ALDS after Mark Ellis broke his finger. Played in 2 ALCS games, but never appeared in a regular season game.

Ernie Banks played in 2528 regular season games and not one postseason game.

Page 36: Here Be Dragons: 15 Years of Statistical Strangeness

2014 split data.

3 different times an ump lost track of the count. 1 time the batter walked on ball 5and 2 times they struck out on a 4-2 count

Page 37: Here Be Dragons: 15 Years of Statistical Strangeness

Courtesy Fielders

9/1/17 (Tigers at Indians) - (COURTESY FIELDER) Tris Speaker was on 3b in the bottom of the first with Joe Evans at bat. Tris tried to steal home but Evans hit away and lined a ball into Speaker's face. Detroit manager Hughie Jennings, as a courtesy, allowed Tris to sit out in the second inning while his face was sewn up. Elmer Smith played cf in the top of the second and Speaker went back in for the third.

Page 38: Here Be Dragons: 15 Years of Statistical Strangeness

This particular set of all-stars broke our database table since we had apparently erroneously assumed a player could only be names to one A-S team.

Page 39: Here Be Dragons: 15 Years of Statistical Strangeness

Duplicate Names

We spend a lot of time cleaning up data sets assigning our id system to the players and to the teams. Generally it can be automated by just searching on player last name or player last and first name, but there are cases where it fails

Page 40: Here Be Dragons: 15 Years of Statistical Strangeness

Steve Ontiverii

Page 41: Here Be Dragons: 15 Years of Statistical Strangeness

There are also two C.J. Wilson’s currently in the NFL.

Page 42: Here Be Dragons: 15 Years of Statistical Strangeness

Markieff & Marcus Morris Goran & Zoran Dragic

Brothers can be difficult to handle because they have the same last name, they go to the same high school, born in the same city, maybe the same college.The suns had two sets of brothers this year.Goran and his little brother Zoran,And identical Twins Marcus and Markieff Morris

Page 43: Here Be Dragons: 15 Years of Statistical Strangeness

Antonio Bastardo & nanny

We allow users to post news articles to our site, so we try to be careful that nothing “bad” comes on the site, but as you can see we can be overly aggressive at times

Page 44: Here Be Dragons: 15 Years of Statistical Strangeness

Goofy Stuff

Page 45: Here Be Dragons: 15 Years of Statistical Strangeness

Worst Cup of Coffee• Ron Wright played one major league game

• In 2nd,struck out with runners on 1st and 2nd

• In 4th, grounded into a triple play with runners on 1st and 3rd.

• In 6th, grounded into double play with runners on first and second.

A player who plays just one game in the major leagues for his career is said to have just gotten a cup of coffee in the majors.

Page 46: Here Be Dragons: 15 Years of Statistical Strangeness

2nd Worst Cup of Coffee

• Larry Yount, Debut: Starting Pitcher on Sept. 15th, 1971 for the Astros vs. the Braves

Brother of Robin Yount.

Injured his shoulder during warmups, so he has 1 game played, no pitches thrown and no batters faced. He never appeared in the major leagues again.

We share a lot of weird and interesting stuff in the office.

Page 47: Here Be Dragons: 15 Years of Statistical Strangeness

MLB Gameday

circa 1919

Page 48: Here Be Dragons: 15 Years of Statistical Strangeness

Iranian Leader Honorary Captain of CFB Team

GW student body chose the Shah of Iran as the Honorary Captain prior to the game against Georgetown

Page 49: Here Be Dragons: 15 Years of Statistical Strangeness

Drew Cannon (sabrematician)

Page 50: Here Be Dragons: 15 Years of Statistical Strangeness

Now this easter egg is in my personal HOF

Page 51: Here Be Dragons: 15 Years of Statistical Strangeness
Page 52: Here Be Dragons: 15 Years of Statistical Strangeness

Too Many Seans

Page 53: Here Be Dragons: 15 Years of Statistical Strangeness

Me

Page 54: Here Be Dragons: 15 Years of Statistical Strangeness
Page 55: Here Be Dragons: 15 Years of Statistical Strangeness

Sean Holtz

Page 56: Here Be Dragons: 15 Years of Statistical Strangeness

Sean Smith

Page 57: Here Be Dragons: 15 Years of Statistical Strangeness

Sean Wrona

Page 58: Here Be Dragons: 15 Years of Statistical Strangeness

Shane Holmes

Page 59: Here Be Dragons: 15 Years of Statistical Strangeness

Sean Burrill

Page 60: Here Be Dragons: 15 Years of Statistical Strangeness

And then I came across this guy who runs a Sports Analytics conference in England.

Page 61: Here Be Dragons: 15 Years of Statistical Strangeness

Old Ballclub Photos

Looking at pictures of old baseball teams never gets old.

This is the Forest City Kekiogas who played in the first game in MLB history in 1871.

Page 62: Here Be Dragons: 15 Years of Statistical Strangeness

The More Things Change…

Everyone complains about modern players not understanding the fundamentals.

“I Have always maintained the baseball of 20 years ago was as speedy as the article played now a days. In my time we had five or six real stars to the team stead of the one or two as is the case today. Or men were more versatile and better abel to stand the grind than present crop of players.” Cap Anson 1918

Page 63: Here Be Dragons: 15 Years of Statistical Strangeness
Page 64: Here Be Dragons: 15 Years of Statistical Strangeness

Casually Racist Nicknames

These goes along with nicknaming everyone irish, Red, everyone German Dutch, and everyone American Indian Chief.

My favorite part of this is that poor Olaf Henriksen is the only Danish player ever to play in the majors

Page 65: Here Be Dragons: 15 Years of Statistical Strangeness

from the 1980s. Computer-using celebrities tell us what they want for Christmas.

Davey Johnson: “I’d like a program that could tell me the results of every batter-pitcher confrontation in the National League.”

Page 66: Here Be Dragons: 15 Years of Statistical Strangeness

E-mails we get

We get a lot of goofy emails and we respond to nearly every one, but there are some that are just so out there that I have to share them.

Some users think that whatever google ranks first for a search is the official site for that search term. We rank first in NFL Officials

Page 67: Here Be Dragons: 15 Years of Statistical Strangeness

We Aren’t NFL Officials

“I’m a Giants fan and hate the Cowboys, but that was the worst call I’ve ever seen. Those Refs should be **BLANKED** for that call in a crucial game.”

“Really Bad Officiating!! Those refs ned to retire. Even the announcers were shocked a how off the calls were. I think I’ll watch the cooking channel next week instead of football. Less Heartburn.”

Page 68: Here Be Dragons: 15 Years of Statistical Strangeness
Page 69: Here Be Dragons: 15 Years of Statistical Strangeness

E-mail Pro Tip

“Is there a stat that shows how many times the 1-4-3 DP was executed?”

Page 70: Here Be Dragons: 15 Years of Statistical Strangeness

E-mail Pro Tip

Now we have someone hired to read all of these and respond. This looks like a normal e-mail until you see who it’s from. You’d think that a CAL RIPKEN e-mail would be a big deal in the office. He never mentioned it.

Two years later I’m invited to a HOF event because we contributed stats and Cal Ripken is there and I’m introduced and he mentions, “Oh yeah I e-mailed you guys awhile back.” Which was news to me.

After returning to Philly, I mentioned to Neil that I should always be alerted if a HOFer contacts the site.

I also met Joe Morgan at that event, and he had never heard of the site.

Page 71: Here Be Dragons: 15 Years of Statistical Strangeness

Sports Reference Pranks

We run a really boring stats site. Row after row after row of numbers.

But we still manage to show a personality in certain ways.

Page 72: Here Be Dragons: 15 Years of Statistical Strangeness

Bynum shot chart

In 2014, Andrew Bynum was kicked out of practice and released by the Cavaliers for ruining a practice by shooting from half court whenever he could.

On April 1st that year we launched practice shot charts.

Page 73: Here Be Dragons: 15 Years of Statistical Strangeness

We put together a tool this April 1st to help players with their hairstyling needs.

Page 74: Here Be Dragons: 15 Years of Statistical Strangeness

Of course we had to have a Will Ferrell Page.

Page 75: Here Be Dragons: 15 Years of Statistical Strangeness

What you should read and learn.

So now in all seriousness you are here to learn stuff, So what do I think you should read to learn more about sabermetrics.

Page 76: Here Be Dragons: 15 Years of Statistical Strangeness

Hidden Game of Baseball

• RE-RELEASED this year

• Sabermetrics 101

Page 77: Here Be Dragons: 15 Years of Statistical Strangeness

The Book by Tom Tango and MGLSabermetrics 301

Page 78: Here Be Dragons: 15 Years of Statistical Strangeness

Bill James Abstracts

These are available on Ebay and at least the last seven aren’t too expensive. The first five are VERY expensive, even as reprints.

And dozens of blogs and writers. Of course, fangraphs, and baseball prospectus and a dozen other analytics blogs

Page 79: Here Be Dragons: 15 Years of Statistical Strangeness

Fangraphs, Baseball Prospectus,

Hardball Times, Beyond the Boxscore

Page 80: Here Be Dragons: 15 Years of Statistical Strangeness

Basketball: Zach Lowe

(@zachlowe_nba), Neil Paine (@neil_paine),

Ken Pomeroy (@kenpomeroy)

Page 81: Here Be Dragons: 15 Years of Statistical Strangeness

NFL: Chris Brown

(@smartfootball) and Brian Burke

(@adv_nfl_stats), Chase Stuart (@fbgchase)

Page 82: Here Be Dragons: 15 Years of Statistical Strangeness

Soccer: Michael Caley

@MC_of_A

Page 83: Here Be Dragons: 15 Years of Statistical Strangeness

And much much more

Page 84: Here Be Dragons: 15 Years of Statistical Strangeness

SR Crew

• Mike Kania (@zempf), David Corby, Hans Van Slooten (@cantpitch), Mike Lynch (@sportinfo247), Adam Wodon (@CHN_AdamWodon), Jay Virshbo (@jvirshbo)

Page 85: Here Be Dragons: 15 Years of Statistical Strangeness

Thank You! Questions?

@sean_forman see my twitter acct for slides location