Parsing NBA Substitutions in Play-by-Play Data
Posted on Tue 29 December 2020 in Data Science
Parsing substitutions in basketball play-by-play data is a problem that has eluded me for a while. It's massively important when considering lineup-contextual events or statistics like plus/minus or for parsing rotation data. The below is the approach I came up with to parse this data out and get an idea of which lineups were on the court together and for how long. The best way I have figured to do it was using python classes to store lineup and player data and just change an on/off court value as they were subbed in and out. I am sure there are more elegant ways to handle this data, but I am not a computer scientist!
First we need the box score to get the player rosters for the game we want to parse. In this example I'm going to use Blazers/Nuggets Game 7 from 2019, not that I'm biased at all in choosing one of the best wins for my Blazers in my lifetime.
import requests
import json
import pandas as pd
import re
import numpy as np
import os
import datetime as dt
box_url = 'https://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=10&EndRange=28800&GameID=0041800237&RangeType=0&Season=2018-19&SeasonType=Playoffs&StartPeriod=1&StartRange=0'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)', 'x-nba-stats-origin': 'stats', 'x-nba-stats-token': 'true', 'Host':'stats.nba.com', 'Referer':'https://stats.nba.com/game/0021900306/'}
r= requests.get(box_url, headers=headers, timeout = 5)
data = json.loads(r.text)
box = pd.DataFrame.from_dict(data['resultSets'][0]['rowSet'])
col_names = data['resultSets'][0]['headers']
box.columns = col_names
box.columns = box.columns.str.lower()
box
| game_id | team_id | team_abbreviation | team_city | player_id | player_name | start_position | comment | min | fgm | fga | fg_pct | fg3m | fg3a | fg3_pct | ftm | fta | ft_pct | oreb | dreb | reb | ast | stl | blk | to | pf | pts | plus_minus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0041800237 | 1610612757 | POR | Portland | 203090 | Maurice Harkless | F | 16:47 | 3.0 | 5.0 | 0.600 | 0.0 | 1.0 | 0.000 | 0.0 | 1.0 | 0.000 | 3.0 | 2.0 | 5.0 | 3.0 | 1.0 | 1.0 | 0.0 | 5.0 | 6.0 | -8.0 | |
| 1 | 0041800237 | 1610612757 | POR | Portland | 202329 | Al-Farouq Aminu | F | 7:08 | 1.0 | 4.0 | 0.250 | 0.0 | 2.0 | 0.000 | 1.0 | 2.0 | 0.500 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | -7.0 | |
| 2 | 0041800237 | 1610612757 | POR | Portland | 202683 | Enes Kanter | C | 39:39 | 6.0 | 13.0 | 0.462 | 0.0 | 1.0 | 0.000 | 0.0 | 0.0 | 0.000 | 4.0 | 8.0 | 12.0 | 1.0 | 0.0 | 0.0 | 1.0 | 3.0 | 12.0 | 1.0 | |
| 3 | 0041800237 | 1610612757 | POR | Portland | 203468 | CJ McCollum | G | 45:17 | 17.0 | 29.0 | 0.586 | 1.0 | 3.0 | 0.333 | 2.0 | 2.0 | 1.000 | 1.0 | 8.0 | 9.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 37.0 | 6.0 | |
| 4 | 0041800237 | 1610612757 | POR | Portland | 203081 | Damian Lillard | G | 45:25 | 3.0 | 17.0 | 0.176 | 2.0 | 9.0 | 0.222 | 5.0 | 6.0 | 0.833 | 0.0 | 10.0 | 10.0 | 8.0 | 3.0 | 0.0 | 1.0 | 3.0 | 13.0 | 8.0 | |
| 5 | 0041800237 | 1610612757 | POR | Portland | 1628380 | Zach Collins | 23:17 | 2.0 | 6.0 | 0.333 | 1.0 | 3.0 | 0.333 | 2.0 | 2.0 | 1.000 | 2.0 | 4.0 | 6.0 | 1.0 | 0.0 | 4.0 | 1.0 | 5.0 | 7.0 | 5.0 | ||
| 6 | 0041800237 | 1610612757 | POR | Portland | 203918 | Rodney Hood | 20:11 | 2.0 | 6.0 | 0.333 | 0.0 | 3.0 | 0.000 | 2.0 | 2.0 | 1.000 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 6.0 | -2.0 | ||
| 7 | 0041800237 | 1610612757 | POR | Portland | 203552 | Seth Curry | 16:20 | 0.0 | 2.0 | 0.000 | 0.0 | 2.0 | 0.000 | 0.0 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 0.0 | 7.0 | ||
| 8 | 0041800237 | 1610612757 | POR | Portland | 202323 | Evan Turner | 19:12 | 3.0 | 7.0 | 0.429 | 0.0 | 0.0 | 0.000 | 8.0 | 9.0 | 0.889 | 2.0 | 5.0 | 7.0 | 2.0 | 0.0 | 1.0 | 0.0 | 4.0 | 14.0 | 1.0 | ||
| 9 | 0041800237 | 1610612757 | POR | Portland | 203086 | Meyers Leonard | 6:44 | 1.0 | 4.0 | 0.250 | 0.0 | 2.0 | 0.000 | 0.0 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 9.0 | ||
| 10 | 0041800237 | 1610612757 | POR | Portland | 1627746 | Skal Labissiere | DNP - Coach's Decision | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 11 | 0041800237 | 1610612757 | POR | Portland | 1627774 | Jake Layman | DNP - Coach's Decision | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 12 | 0041800237 | 1610612757 | POR | Portland | 1629014 | Anfernee Simons | DNP - Coach's Decision | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 13 | 0041800237 | 1610612743 | DEN | Denver | 1628470 | Torrey Craig | F | 33:01 | 2.0 | 5.0 | 0.400 | 0.0 | 2.0 | 0.000 | 4.0 | 5.0 | 0.800 | 4.0 | 4.0 | 8.0 | 2.0 | 0.0 | 0.0 | 1.0 | 2.0 | 8.0 | 6.0 | |
| 14 | 0041800237 | 1610612743 | DEN | Denver | 200794 | Paul Millsap | F | 31:55 | 3.0 | 13.0 | 0.231 | 0.0 | 2.0 | 0.000 | 4.0 | 6.0 | 0.667 | 1.0 | 6.0 | 7.0 | 1.0 | 0.0 | 3.0 | 0.0 | 6.0 | 10.0 | 3.0 | |
| 15 | 0041800237 | 1610612743 | DEN | Denver | 203999 | Nikola Jokic | C | 41:53 | 11.0 | 26.0 | 0.423 | 2.0 | 6.0 | 0.333 | 5.0 | 7.0 | 0.714 | 4.0 | 9.0 | 13.0 | 2.0 | 0.0 | 4.0 | 2.0 | 3.0 | 29.0 | -1.0 | |
| 16 | 0041800237 | 1610612743 | DEN | Denver | 203914 | Gary Harris | G | 39:10 | 7.0 | 11.0 | 0.636 | 0.0 | 1.0 | 0.000 | 1.0 | 2.0 | 0.500 | 0.0 | 6.0 | 6.0 | 3.0 | 0.0 | 0.0 | 1.0 | 3.0 | 15.0 | -7.0 | |
| 17 | 0041800237 | 1610612743 | DEN | Denver | 1627750 | Jamal Murray | G | 37:53 | 4.0 | 18.0 | 0.222 | 0.0 | 4.0 | 0.000 | 9.0 | 9.0 | 1.000 | 2.0 | 4.0 | 6.0 | 5.0 | 0.0 | 0.0 | 1.0 | 1.0 | 17.0 | -2.0 | |
| 18 | 0041800237 | 1610612743 | DEN | Denver | 203486 | Mason Plumlee | 18:48 | 1.0 | 3.0 | 0.333 | 0.0 | 0.0 | 0.000 | 2.0 | 5.0 | 0.400 | 1.0 | 5.0 | 6.0 | 0.0 | 0.0 | 2.0 | 0.0 | 3.0 | 4.0 | -7.0 | ||
| 19 | 0041800237 | 1610612743 | DEN | Denver | 203115 | Will Barton | 19:58 | 4.0 | 9.0 | 0.444 | 0.0 | 2.0 | 0.000 | 0.0 | 0.0 | 0.000 | 1.0 | 2.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 8.0 | -9.0 | ||
| 20 | 0041800237 | 1610612743 | DEN | Denver | 1627736 | Malik Beasley | 7:15 | 0.0 | 1.0 | 0.000 | 0.0 | 1.0 | 0.000 | 0.0 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.0 | ||
| 21 | 0041800237 | 1610612743 | DEN | Denver | 1628420 | Monte Morris | 10:07 | 1.0 | 3.0 | 0.333 | 0.0 | 1.0 | 0.000 | 3.0 | 5.0 | 0.600 | 0.0 | 2.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 5.0 | -2.0 | ||
| 22 | 0041800237 | 1610612743 | DEN | Denver | 1627823 | Juancho Hernangomez | DNP - Coach's Decision | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 23 | 0041800237 | 1610612743 | DEN | Denver | 1626168 | Trey Lyles | DNP - Coach's Decision | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 24 | 0041800237 | 1610612743 | DEN | Denver | 202738 | Isaiah Thomas | DNP - Coach's Decision | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| 25 | 0041800237 | 1610612743 | DEN | Denver | 1629020 | Jarred Vanderbilt | DNP - Coach's Decision | None | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Filtering to get the starters
starters = box[box['start_position']!= '']
starters = starters[['team_id','team_abbreviation','player_id','player_name','start_position']]
starters
| team_id | team_abbreviation | player_id | player_name | start_position | |
|---|---|---|---|---|---|
| 0 | 1610612757 | POR | 203090 | Maurice Harkless | F |
| 1 | 1610612757 | POR | 202329 | Al-Farouq Aminu | F |
| 2 | 1610612757 | POR | 202683 | Enes Kanter | C |
| 3 | 1610612757 | POR | 203468 | CJ McCollum | G |
| 4 | 1610612757 | POR | 203081 | Damian Lillard | G |
| 13 | 1610612743 | DEN | 1628470 | Torrey Craig | F |
| 14 | 1610612743 | DEN | 200794 | Paul Millsap | F |
| 15 | 1610612743 | DEN | 203999 | Nikola Jokic | C |
| 16 | 1610612743 | DEN | 203914 | Gary Harris | G |
| 17 | 1610612743 | DEN | 1627750 | Jamal Murray | G |
Now pulling play by play data and some helper stuff to convert time strings to integers:
pbp_url = 'https://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0041800237&RangeType=2&Season=2018-19&SeasonType=Playoffs&StartPeriod=1&StartRange=0'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)', 'x-nba-stats-origin': 'stats', 'x-nba-stats-token': 'true', 'Host':'stats.nba.com', 'Referer':'https://stats.nba.com/game/0021900306/'}
r= requests.get(pbp_url, headers=headers, timeout = 5)
data = json.loads(r.text)
pbp = pd.DataFrame.from_dict(data['resultSets'][0]['rowSet'])
col_names = data['resultSets'][0]['headers']
pbp.columns = col_names
pbp.columns = pbp.columns.str.lower()
pbp_times = pbp['pctimestring'].str.split(':',2, expand=True)
pbp_times[0] = pbp_times[0].astype(str).astype(int)
pbp_times[1] = pbp_times[1].astype(str).astype(int)
pbp['timeinseconds'] = (pbp_times[0]*60) + pbp_times[1]
pbp['play_elapsed_time'] = pbp['timeinseconds'].shift(1) - pbp['timeinseconds']
pbp['play_elapsed_time'] = pbp['play_elapsed_time'].fillna(0)
pbp['play_elapsed_time'] = np.where(pbp['period'] != pbp['period'].shift(1), 0, pbp['play_elapsed_time'])
pbp['total_elapsed_time'] = pbp.groupby(['game_id'])['play_elapsed_time'].cumsum()
pbp['max_time'] = pbp.groupby('game_id')['play_elapsed_time'].transform('sum')
pbp['time_remaining'] = pbp['max_time'] - pbp['total_elapsed_time']
pbp['scoremargin'] = np.where(pbp['scoremargin']=='TIE',0,pbp['scoremargin'])
pbp['scoremargin'] = pbp['scoremargin'].fillna(0).astype(int)
pbp.head()
| game_id | eventnum | eventmsgtype | eventmsgactiontype | period | wctimestring | pctimestring | homedescription | neutraldescription | visitordescription | score | scoremargin | person1type | player1_id | player1_name | player1_team_id | player1_team_city | player1_team_nickname | player1_team_abbreviation | person2type | player2_id | player2_name | player2_team_id | player2_team_city | player2_team_nickname | player2_team_abbreviation | person3type | player3_id | player3_name | player3_team_id | player3_team_city | player3_team_nickname | player3_team_abbreviation | video_available_flag | timeinseconds | play_elapsed_time | total_elapsed_time | max_time | time_remaining | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0041800237 | 2 | 12 | 0 | 1 | 3:41 PM | 12:00 | None | None | None | None | 0 | 0 | 0 | None | NaN | None | None | None | 0 | 0 | None | NaN | None | None | None | 0 | 0 | None | NaN | None | None | None | 0 | 720 | 0.0 | 0.0 | 2880.0 | 2880.0 |
| 1 | 0041800237 | 4 | 10 | 0 | 1 | 3:41 PM | 12:00 | Jump Ball Millsap vs. Kanter: Tip to Harkless | None | None | None | 0 | 4 | 200794 | Paul Millsap | 1.610613e+09 | Denver | Nuggets | DEN | 5 | 202683 | Enes Kanter | 1.610613e+09 | Portland | Trail Blazers | POR | 5 | 203090 | Maurice Harkless | 1.610613e+09 | Portland | Trail Blazers | POR | 1 | 720 | 0.0 | 0.0 | 2880.0 | 2880.0 |
| 2 | 0041800237 | 7 | 6 | 26 | 1 | 3:41 PM | 11:45 | None | None | Aminu Offensive Charge Foul (P1.T1) (J.Goble) | None | 0 | 5 | 202329 | Al-Farouq Aminu | 1.610613e+09 | Portland | Trail Blazers | POR | 4 | 200794 | Paul Millsap | 1.610613e+09 | Denver | Nuggets | DEN | 1 | 0 | None | NaN | None | None | None | 1 | 705 | 15.0 | 15.0 | 2880.0 | 2865.0 |
| 3 | 0041800237 | 9 | 5 | 37 | 1 | 3:41 PM | 11:45 | None | None | Aminu Offensive Foul Turnover (P1.T1) | None | 0 | 5 | 202329 | Al-Farouq Aminu | 1.610613e+09 | Portland | Trail Blazers | POR | 0 | 0 | None | NaN | None | None | None | 1 | 0 | None | NaN | None | None | None | 1 | 705 | 0.0 | 15.0 | 2880.0 | 2865.0 |
| 4 | 0041800237 | 10 | 1 | 6 | 1 | 3:42 PM | 11:28 | Harris 2' Driving Layup (2 PTS) | None | None | 0 - 2 | 2 | 4 | 203914 | Gary Harris | 1.610613e+09 | Denver | Nuggets | DEN | 0 | 0 | None | NaN | None | None | None | 0 | 0 | None | NaN | None | None | None | 1 | 688 | 17.0 | 32.0 | 2880.0 | 2848.0 |
Now to set up the classes in order to keep track of who is on and off the court. I'm creating a class object for each player, team and lineup (called LineupStats) and then a Game class that parses through the play by play. Creating the team class also runs helper functions to pull the rosters and starters from the box score we pulled earlier:
class Player():
def __init__(self, playerid, teamid, name):
self.playerid = playerid
self.teamid = teamid
self.name = name
self.oncourt = 0
self.court_time = 0
def to_dict(self):
return {
'court_time' : self.court_time,
'playerid' : self.playerid,
'teamid' : self.teamid
}
class Team():
def __init__(self, teamid, gameid):
self.court_time = 0
self.roster = []
self.lineup = []
self.teamid = teamid
self.gameid = gameid
self.starters = []
self.lineups = []
def getRoster(self, box):
for index,row in box.iterrows():
if row['team_id'] == self.teamid:
x = Player(playerid = row['player_id'],teamid = row['team_id'], name = row['player_name'])
self.roster.append(x)
def getStarters(self, box):
for index,row in box.iterrows():
if row['team_id'] == self.teamid and row['start_position'] != '':
for p in self.roster:
if p.playerid == row['player_id']:
self.lineup.append(p)
self.starters.append(p)
p.oncourt = 1
def initLineup(self):
if self.lineup:
self.lu = LineupStats(self.lineup, self.gameid, self.teamid)
def Sub(self, sub_in, sub_out, event, time):
self.resetLineup(event, time)
for x in self.lineup:
if x.playerid == sub_out:
x.oncourt = 0
self.lineup.remove(x)
for x in self.roster:
if x.playerid == sub_in:
x.oncourt = 1
self.lineup.append(x)
def quarterSubs(self, lineup):
for x in self.lineup[:]:
if x.playerid not in lineup:
self.lineup.remove(x)
x.oncourt = 0
for l in lineup:
for x in self.roster:
if x.playerid == l and x.oncourt == 0:
self.lineup.append(x)
x.oncourt = 1
def resetLineup(self, event, time):
self.lineups.append(self.lu.to_dict(time))
self.lu.pts = 0
self.lu.drbd = 0
self.lu.orbd = 0
self.lu.stl = 0
self.lu.blk = 0
self.lu.ast = 0
self.lu.fgm = 0
self.lu.fga = 0
self.lu.ftm = 0
self.lu.fta = 0
self.lu.pf = 0
self.lu.tov = 0
self.lu.lu_time = 0
self.lu.diff = 0
self.lu.fg3a = 0
self.lu.fg3m = 0
self.lu.poss = 0
self.lu.event_start = event
self.lu.time_on = time
def to_dict(self):
return {
'teamid' : self.teamid,
'gameid' : self.gameid,
'court_time' : self.court_time,
'starters' : [int(x.playerid) for x in self.starters]
}
class LineupStats():
def __init__(self, lineup, gameid,teamid):
self.lineup = lineup
self.pts = 0
self.drbd = 0
self.orbd = 0
self.stl = 0
self.blk = 0
self.ast = 0
self.fgm = 0
self.fga = 0
self.ftm = 0
self.fta = 0
self.pf = 0
self.tov = 0
self.lu_time = 0
self.diff = 0
self.fg3a = 0
self.fg3m = 0
self.event_start = 0
self.time_on = 0
self.time_off = 0
self.event_end = 0
self.gameid = gameid
self.teamid = teamid
self.poss = 0
def to_dict(self, time_end):
return {
'lineup' : [int(x.playerid) for x in self.lineup],
'pts' : self.pts,
'drbd' : self.drbd,
'stl' : self.stl,
'blk' : self.blk,
'ast' : self.ast,
'fgm' : self.fgm,
'fga' : self.fga,
'ftm' : self.ftm,
'fta' : self.fta,
'orbd' : self.orbd,
'pf' : self.pf,
'tov' : self.tov,
'fg3a' : self.fg3a,
'fg3m' : self.fg3m,
'lu_time' : self.lu_time,
'diff' : self.diff ,
'event_start' : self.event_start,
'time_on' : self.time_on,
'time_off' : time_end,
'gameid' : self.gameid,
'teamid' : self.teamid,
'poss' : self.poss
}
class Game():
def __init__(self, hteam, ateam, gameid, pbp, box):
self.hteam = hteam
self.ateam = ateam
self.time_elapsed = 0
self.event = 1
self.pbp = pbp
self.box = box
self.poss = 0
def initRosters(self):
self.hteam.getRoster(self.box)
self.ateam.getRoster(self.box)
def initStarters(self):
self.hteam.getStarters(self.box)
self.ateam.getStarters(self.box)
def addCourtTime(self, time):
for x in self.hteam.lineup:
x.court_time += time
for x in self.ateam.lineup:
x.court_time += time
def getQuarterStarters(self, quarter):
if quarter == 2:
start_range = 7201
end_range = 7493
elif quarter == 3:
start_range = 14410
end_range = 14640
elif quarter == 4:
start_range = 21621
end_range = 21913
starters_url = 'https://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=14&GameID=0041800237&RangeType=2&Season=2018-19&SeasonType=Playoffs&StartPeriod=1&StartRange=' + str(start_range) + '&EndRange=' + str(end_range)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)', 'x-nba-stats-origin': 'stats', 'x-nba-stats-token': 'true', 'Host':'stats.nba.com', 'Referer':'https://stats.nba.com/game/0021900306/'}
r= requests.get(starters_url, headers=headers, timeout = 5)
data = json.loads(r.text)
starters = pd.DataFrame.from_dict(data['resultSets'][0]['rowSet'])
col_names = data['resultSets'][0]['headers']
starters.columns = col_names
starters.columns = starters.columns.str.lower()
hteam_starters = starters[starters['team_id']==self.hteam.teamid]
ateam_starters = starters[starters['team_id']==self.ateam.teamid]
hteam_starters = list(hteam_starters['player_id'])
ateam_starters = list(ateam_starters['player_id'])
self.hteam.quarterSubs(hteam_starters)
self.ateam.quarterSubs(ateam_starters)
def parseGame(self):
self.initRosters()
self.initStarters()
self.hteam.initLineup()
self.ateam.initLineup()
for index, row in self.pbp.iterrows():
assert len(self.hteam.lineup)==5, 'home lineup not equal to 5'
assert len(self.ateam.lineup)==5, 'away lineup not equal to 5'
if row['pctimestring'] == '12:00' and row['period'] != 1 and row['period'] != prev_row_period:
self.getQuarterStarters(int(row['period']))
self.addCourtTime(row['play_elapsed_time'])
if row['eventmsgtype'] == 1:
if row['player1_team_id'] == self.hteam.teamid:
self.hteam.lu.diff += row['scoremargin']
self.ateam.lu.diff -= row['scoremargin']
self.hteam.lu.pts += row['scoremargin']
else:
self.ateam.lu.diff += row['scoremargin']
self.hteam.lu.diff -= row['scoremargin']
self.ateam.lu.pts += row['scoremargin']
if row['eventmsgtype'] == 8:
if row['player1_team_id'] == game.hteam.teamid:
self.hteam.Sub(sub_in=row['player2_id'], sub_out=row['player1_id'], event=row['eventnum'], time=row['time_remaining'])
else:
self.ateam.Sub(sub_in=row['player2_id'], sub_out=row['player1_id'], event=row['eventnum'], time=row['time_remaining'])
prev_row_period = row['period']
por = Team(gameid='0041800237', teamid=1610612757)
por.getRoster(box)
for x in por.roster:
print(x.playerid, x.name)
203090 Maurice Harkless
202329 Al-Farouq Aminu
202683 Enes Kanter
203468 CJ McCollum
203081 Damian Lillard
1628380 Zach Collins
203918 Rodney Hood
203552 Seth Curry
202323 Evan Turner
203086 Meyers Leonard
1627746 Skal Labissiere
1627774 Jake Layman
1629014 Anfernee Simons
por.getStarters(box)
for x in por.starters:
print(x.playerid, x.name)
for x in por.lineup:
print(x.playerid, x.name)
203090 Maurice Harkless
202329 Al-Farouq Aminu
202683 Enes Kanter
203468 CJ McCollum
203081 Damian Lillard
203090 Maurice Harkless
202329 Al-Farouq Aminu
202683 Enes Kanter
203468 CJ McCollum
203081 Damian Lillard
Notice above the starters for Portland is the same as the lineup for Portland because we've only pulled in the box score rosters and the starters from the box score. Trivial, but important to note where I'm getting that data before getting into the play by play. Getting into the real meat and potatoes of how I'm parsing substitutions, here's the function that runs everything within the Game Class:
def parseGame(self):
self.initRosters()
self.initStarters()
self.hteam.initLineup()
self.ateam.initLineup()
for index, row in self.pbp.iterrows():
assert len(self.hteam.lineup)==5, 'home lineup not equal to 5'
assert len(self.ateam.lineup)==5, 'away lineup not equal to 5'
if row['pctimestring'] == '12:00' and row['period'] != 1 and row['period'] != prev_row_period:
self.getQuarterStarters(int(row['period']))
self.addCourtTime(row['play_elapsed_time'])
if row['eventmsgtype'] == 1:
if row['player1_team_id'] == self.hteam.teamid:
self.hteam.lu.diff += row['scoremargin']
self.ateam.lu.diff -= row['scoremargin']
self.hteam.lu.pts += row['scoremargin']
else:
self.ateam.lu.diff += row['scoremargin']
self.hteam.lu.diff -= row['scoremargin']
self.ateam.lu.pts += row['scoremargin']
if row['eventmsgtype'] == 8:
if row['player1_team_id'] == game.hteam.teamid:
self.hteam.Sub(sub_in=row['player2_id'], sub_out=row['player1_id'], event=row['eventnum'], time=row['time_remaining'])
else:
self.ateam.Sub(sub_in=row['player2_id'], sub_out=row['player1_id'], event=row['eventnum'], time=row['time_remaining'])
prev_row_period = row['period']
The first five rows I'm just initializing the game. Then I start to loop through each row of the play by play.
if row['pctimestring'] == '12:00' and row['period'] != 1 and row['period'] != prev_row_period:
game.getQuarterStarters(int(row['period']))
NBA's PBP has a separate game event for the end and start of each period, so if the time at the current row is equal to '12:00' then I use the getQuarterStarters helper function in order to get the starters of each quarter from the NBA's box score query feature.
def getQuarterStarters(self, quarter):
if quarter == 2:
start_range = 7201
end_range = 7493
elif quarter == 3:
start_range = 14410
end_range = 14640
elif quarter == 4:
start_range = 21621
end_range = 21913
starters_url = 'https://stats.nba.com/stats/boxscoretraditionalv2?EndPeriod=14&GameID=0041800237&RangeType=2&Season=2018-19&SeasonType=Playoffs&StartPeriod=1&StartRange=' + str(start_range) + '&EndRange=' + str(end_range)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)', 'x-nba-stats-origin': 'stats', 'x-nba-stats-token': 'true', 'Host':'stats.nba.com', 'Referer':'https://stats.nba.com/game/0021900306/'}
r= requests.get(starters_url, headers=headers, timeout = 5)
data = json.loads(r.text)
starters = pd.DataFrame.from_dict(data['resultSets'][0]['rowSet'])
col_names = data['resultSets'][0]['headers']
starters.columns = col_names
starters.columns = starters.columns.str.lower()
hteam_starters = starters[starters['team_id']==self.hteam.teamid]
ateam_starters = starters[starters['team_id']==self.ateam.teamid]
hteam_starters = list(hteam_starters['player_id'])
ateam_starters = list(ateam_starters['player_id'])
self.hteam.quarterSubs(hteam_starters)
self.ateam.quarterSubs(ateam_starters)
If the starters of the next quarter are different than the lineup that ended the quarter, I run the quarterSubs function from the Team class to replace the correct players:
def quarterSubs(self, lineup):
for x in self.lineup[:]:
if x.playerid not in lineup:
self.lineup.remove(x)
x.oncourt = 0
for l in lineup:
for x in self.roster:
if x.playerid == l and x.oncourt == 0:
self.lineup.append(x)
x.oncourt = 1
Since no time elapsed between the end of quarters, I don't need to change any lineup or player statistics.
game.addCourtTime(row['play_elapsed_time'])
Just adding playing time for each player from the previous game event to the current.
The play-by-play from the NBA's API has an 'eventmsgtype' column that has a different key for each event on the court. For our purposes, 1 = made basket and 8 = substitution. Now we can check and see if there was a change in the score so we can update the scoring margin for each lineup:
if row['eventmsgtype'] == 1:
if row['player1_team_id'] == game.hteam.teamid:
game.hteam.lu.diff += row['scoremargin']
game.ateam.lu.diff -= row['scoremargin']
game.hteam.lu.pts += row['scoremargin']
else:
game.ateam.lu.diff += row['scoremargin']
game.hteam.lu.diff -= row['scoremargin']
game.ateam.lu.pts += row['scoremargin']
Finally, parsing the actual substitutions.
if row['eventmsgtype'] == 8:
if row['player1_team_id'] == game.hteam.teamid:
game.hteam.Sub(sub_in=row['player2_id'], sub_out=row['player1_id'], event=row['eventnum'], time=row['time_remaining'])
else:
game.ateam.Sub(sub_in=row['player2_id'], sub_out=row['player1_id'], event=row['eventnum'], time=row['time_remaining'])
def Sub(self, sub_in, sub_out, event, time):
self.resetLineup(event, time)
for x in self.lineup:
if x.playerid == sub_out:
print('sub found')
x.oncourt = 0
self.lineup.remove(x)
for x in self.roster:
if x.playerid == sub_in:
x.oncourt = 1
self.lineup.append(x)
Each substitution in the Game class calls the Sub function from our team class. In the PBP data, we see that for each substitution we have columns for the players involved (player_1_player_id, player_2_player_id, etc.). After resetting the team's lineup because this is the end of that specific lineup's time on the court, we then search through the list of player ids inside of our lineup to find the player getting subbed out. We use the remove function in python in order to remove that player from the list, and then we append the new player's id into our lineup object.
Here's how this looks in python if we sub in Zach Collins for Enes Kanter:
por = Team(gameid='0041800237', teamid=1610612757)
den = Team(gameid='0041800237', teamid=1610612743)
game = Game(den, por, '0041800237',pbp,box)
game.initRosters()
game.initStarters()
game.hteam.initLineup()
game.ateam.initLineup()
for x in game.ateam.lineup:
print(x.playerid, x.name)
203090 Maurice Harkless
202329 Al-Farouq Aminu
202683 Enes Kanter
203468 CJ McCollum
203081 Damian Lillard
game.ateam.Sub(sub_in=1628380, sub_out=202683, event=1, time=200)
for x in game.ateam.lineup:
print(x.playerid, x.name)
203090 Maurice Harkless
202329 Al-Farouq Aminu
203468 CJ McCollum
203081 Damian Lillard
1628380 Zach Collins
por = Team(gameid='0041800237', teamid=1610612757)
den = Team(gameid='0041800237', teamid=1610612743)
game = Game(den, por, '0041800237',pbp,box)
game.parseGame()
for x in game.ateam.roster:
print(x.name, x.court_time)
Maurice Harkless 1008.0
Al-Farouq Aminu 428.0
Enes Kanter 2379.0
CJ McCollum 2717.0
Damian Lillard 2725.0
Zach Collins 1397.0
Rodney Hood 1211.0
Seth Curry 979.0
Evan Turner 1152.0
Meyers Leonard 404.0
Skal Labissiere 0
Jake Layman 0
Anfernee Simons 0
A quick check of the box score shows that Damian Lillard played 45 minutes and 25 seconds. Our calculated court time within the player class shows that he played -- 2725 seconds or 45 minutes and 25 seconds!
Hopefully you find this methodology useful. I've tinkered with this problem on-and-off for a while now and most of the solutions I tried (e.g. doing this in Pandas) just weren't very robust and had a lot of problems.