Name: Rachel Yun¶

Date: 08/21/2025¶

logo

Introduction¶

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the "author" key in the YAML header. When you're finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise, and avoid long printouts. Feel free to add in as many new code chunks as you'd like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops). Please do not bring in any outside data, and use the provided data as truth (for example, some "home" games have been played at secondary locations, including TOR's entire 2020-21 season. These are not reflected in the data and you do not need to account for this.) Note that the OKC and DEN 2024-25 schedules in schedule_24_partial.csv intentionally include only 80 games, as the league holds 2 games out for each team in the middle of December due to unknown NBA Cup matchups. Do not assign specific games to fill those two slots.

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. We may refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers¶

Part 1¶

Question 1: 26 4-in-6 stretches in OKC's draft schedule.

Question 2: 25.1 4-in-6 stretches on average.

Question 3:

  • Most 4-in-6 stretches on average: CHA (28.11)
  • Fewest 4-in-6 stretches on average: NYK (22.19)

Question 4: This is a written question. Please leave your response in the document under Question 4.

Question 5:

  • BKN Defensive eFG%: 54.5%
  • When opponent on a B2B: 53.6%

Part 2¶

Please show your work in the document, you don't need anything here.

Part 3¶

Question 8:

  • Most Helped by Schedule: POR (9.0 wins)
  • Most Hurt by Schedule: CLE (-8.4 wins)

Setup and Data¶

In [89]:
import pandas as pd
import numpy as np
# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/schedule_project/ from each string.
schedule = pd.read_csv("schedule.csv")
draft_schedule = pd.read_csv("schedule_24_partial.csv")
locations = pd.read_csv("locations.csv")
game_data = pd.read_csv("team_game_data.csv")

Part 1 -- Schedule Analysis¶

In this section, you're going to work to answer questions using NBA scheduling data.

Question 1¶

QUESTION: How many times are the Thunder scheduled to play 4 games in 6 nights in the provided 80-game draft of the 2024-25 season schedule? (Note: clarification, the stretches can overlap, the question is really “How many games are the 4th game played over the past 6 nights?”)

In [90]:
#Data Manipulation : want only OKC Games + sort dates into chronological order
okc = draft_schedule[draft_schedule['team'] == 'OKC'].copy()
okc["gamedate"] = pd.to_datetime(okc["gamedate"])
okc = okc.sort_values("gamedate").reset_index(drop=True)
dates = okc["gamedate"].values 

#find 4 games in 6 nights
left_idxs = np.searchsorted(dates, dates - np.timedelta64(5, "D"), side="left") #look back the 5 days
counts_last_6_nights = np.arange(len(dates)) - left_idxs + 1
mask_4_in_6 = counts_last_6_nights >= 4 #boolean for 4 games in 6 nights
count_4_in_6 = int(mask_4_in_6.sum())
count_4_in_6
Out[90]:
26

ANSWER 1:

26 4-in-6 stretches in OKC's draft schedule.

Question 2¶

QUESTION: From 2014-15 to 2023-24, what is the average number of 4-in-6 stretches for a team in a season? Adjust each team/season to per-82 games before taking your final average.

In [91]:
# gamedate is datetime and data sorted
schedule["gamedate"] = pd.to_datetime(schedule["gamedate"])

all_per82 = [] #store count of 4-in-6 games in a list

for (season, team), group in schedule.groupby(["season", "team"]): #group by season, team ex) OKC, 2020
    #Q1 setup
    dates = group.sort_values("gamedate")["gamedate"].values
    left = np.searchsorted(dates, dates - np.timedelta64(5, "D"), side="left")
    counts = np.arange(len(dates)) - left + 1
    four_in_six = (counts >= 4).sum()
    per82 = four_in_six * 82 / len(dates)
    if 2014 <= season <= 2023:   # keep only 2014–15 through 2023–24
        all_per82.append((team,per82)) #Q3 DF

#code for question 3 (makes my life easier)
df = pd.DataFrame(all_per82, columns=["team", "per82"])

avg_per82 = np.mean([x[1] for x in all_per82])
avg_per82
Out[91]:
25.10330883503872

ANSWER 2:

25.1 4-in-6 stretches on average.

Question 3¶

QUESTION: Which of the 30 NBA teams has had the highest average number of 4-in-6 stretches between 2014-15 and 2023-24? Which team has had the lowest average? Adjust each team/season to per-82 games.

In [92]:
team_avg = df.groupby("team", as_index=False)["per82"].mean()
high = team_avg.loc[team_avg["per82"].idxmax()]
low = team_avg.loc[team_avg["per82"].idxmin()]
high, low
Out[92]:
(team           CHA
 per82    28.109188
 Name: 3, dtype: object,
 team           NYK
 per82    22.186111
 Name: 19, dtype: object)

ANSWER 3:

  • Most 4-in-6 stretches on average: CHA (28.11)
  • Fewest 4-in-6 stretches on average: NYK (22.19)

Question 4¶

QUESTION: Is the difference between most and least from Q3 surprising, or do you expect that size difference is likely to be the result of chance?

ANSWER 4:

In Q3, I observed that the difference between the team with the highest and lowest average number of 4-in-6 stretches was about six games per season, or approximately ±3 games around the league average. This level of variation is not unexpected, as it can be attributed to a combination of factors such as regional differences in travel demands, scheduling constraints faced by teams that share arenas, and natural variation arising from the league’s complex scheduling process.

Question 5¶

QUESTION: What was BKN's defensive eFG% in the 2023-24 season? What was their defensive eFG% that season in situations where their opponent was on the second night of back-to-back?

In [93]:
#Formula for eFG: (FGM + 0.5 * 3PM) / FGA. Found through https://www.nba.com/bucks/features/boeder-120917

schedule["gamedate"] = pd.to_datetime(schedule["gamedate"])
game_data["gamedate"] = pd.to_datetime(game_data["gamedate"])

#Data Manipulation for 2023, BKN
opp_off = game_data[(game_data["season"] == 2023) & (game_data["def_team"] == "BKN")].copy() #BKN is the defensive team

# BKN's defensive eFG%
opp_off["efg"] = (opp_off["fgmade"] + 0.5 * opp_off["fg3made"]) / opp_off["fgattempted"]
def_efg = opp_off["efg"].mean()

# opp back-to-back
team_games = schedule.groupby("team")["gamedate"].apply(set).to_dict()

opp_off["is_b2b"] = opp_off.apply(
    lambda row: (row["gamedate"] - pd.Timedelta(days=1)) in team_games[row["off_team"]],
    axis=1
)

def_efg_b2b = opp_off.loc[opp_off["is_b2b"], "efg"].mean()

def_efg, def_efg_b2b
Out[93]:
(0.5450563595207142, 0.5363431117395119)

ANSWER 5:

  • BKN Defensive eFG%: 54.5%
  • When opponent on a B2B: 53.6%

Part 2 -- Trends and Visualizations¶

This is an intentionally open ended section, and there are multiple approaches you could take to have a successful project. Feel free to be creative. However, for this section, please consider only the density of games and travel schedule, not the relative on-court strength of different teams.

Question 6¶

QUESTION: Please identify at least 2 trends in scheduling over time. In other words, how are the more recent schedules different from the schedules of the past? Please include a visual (plot or styled table) highlighting or explaining each trend and include a brief written description of your findings.

In [94]:
import matplotlib.pyplot as plt


schedule["gamedate"] = pd.to_datetime(schedule["gamedate"])

def four_in_six_count(dates):
    left = np.searchsorted(dates, dates - np.timedelta64(5,"D"), side="left")
    counts = np.arange(len(dates)) - left + 1
    return (counts >= 4).sum()

season_trends = []
for (season, team), g in schedule.groupby(["season","team"]):
    dates = np.sort(g["gamedate"].values)
    season_trends.append({"season": season, "team": team, "four_in_six": four_in_six_count(dates)})

df_trends = pd.DataFrame(season_trends)
season_avg = df_trends.groupby("season")["four_in_six"].mean().reset_index()

plt.figure(figsize=(8,5))
plt.plot(season_avg["season"], season_avg["four_in_six"], marker="o")
plt.title("Average 4-in-6 Games per Team by Season")
plt.xlabel("Season (year = start of season)")
plt.ylabel("Avg 4-in-6 per Team")
plt.grid(True)
plt.show()

# count games per team-season
games_per_season = schedule.groupby(["season","team"]).size().reset_index(name="games")

# average across teams for each season
avg_games = games_per_season.groupby("season")["games"].mean().reset_index()

plt.figure(figsize=(8,5))
plt.plot(avg_games["season"], avg_games["games"], marker="o", linewidth=2)
plt.title("Average Number of Games per Team by Season")
plt.xlabel("Season (start year)")
plt.ylabel("Games per Team")
plt.grid(True)
plt.show()
No description has been provided for this image
No description has been provided for this image

ANSWER 6:

Trend 1: Compressed Scheduling Has Declined Over Time

The chart shows that in the mid-2010s, NBA teams frequently faced highly compressed schedules. Between 2014 and 2016, the average team had around 28–30 four-in-six stretches per season. That meant teams were playing four games within a six-day window nearly thirty times across an 82-game season which is a very demanding pace.

Starting in 2016–17 and 2017–18, however, there is a sharp downward shift. By the 2018–19 season, the average number of four-in-six stretches had dropped to below 20 per team. This decline coincides with the NBA’s deliberate scheduling reforms, including the elimination of four-games-in-five-nights (4-in-5s) and the decision to begin the regular season earlier in mid-October. These changes extended the calendar while keeping the total at 82 games, reducing the need to stack multiple games in tight clusters.

In the seasons following the reforms, the number of compressed stretches stabilized. Post-pandemic, most teams experienced around 23–24 four-in-six stretches per season, far below the nearly 30 per season that was common earlier in the decade. This demonstrates a sustained improvement in schedule balance.

Takeaway: The NBA has successfully reduced schedule compression over the last decade. By stretching the season calendar and redistributing games more evenly, the league lowered the burden from nearly 30 stretches per season in 2014–2016 to the low 20s after 2019, making the schedule more manageable for players.

Trend 2: Pandemic Disruptions Created a Temporary Spike

While the overall trend shows steady improvement, the 2019–20 season is a dramatic outlier. That year, the average number of four-in-six stretches jumped back up to nearly 30 per team, almost identical to the heavy compression levels of 2014–2016. This spike did not reflect a reversal of NBA policy but was instead a direct result of the COVID-19 pandemic.

When the pandemic forced a suspension of the regular season, the NBA later resumed play under unusual conditions, including a shortened window and the “bubble” restart in Orlando. These disruptions compressed the schedule and forced teams to play more frequently over shorter spans, bringing back the same heavy game density that the league had worked to eliminate.

In the following seasons, however, the schedule quickly normalized. By 2020–21 and 2021–22, the average fell back to around 23–24 per team, showing that the spike was temporary. This return to the reformed baseline indicates that the long-term downward trend in compressed scheduling remained intact, and the pandemic spike stands out as a unique historical anomaly rather than a reversal of progress.

Takeaway: The 2019–20 season demonstrates how external disruptions can override structural reforms. Teams briefly faced nearly 30 compressed stretches per season, but once conditions stabilized, the schedule returned to the mid-20s range, reinforcing the long-term improvements the NBA has made in reducing extreme scheduling burdens.

Question 7¶

QUESTION: Please design a plotting tool to help visualize a team’s schedule for a season. The plot should cover the whole season and should help the viewer contextualize and understand a team’s schedule, potentially highlighting periods of excessive travel, dense blocks of games, or other schedule anomalies. If you can, making the plots interactive (for example through the plotly package) is a bonus.

Please use this tool to plot OKC and DEN's provided 80-game 2024-25 schedules.

ANSWER 7:

In [95]:
from matplotlib.lines import Line2D

#HAVERSINE FORMULA (CHATGPT)
def haversine(lat1, lon1, lat2, lon2, miles=True):
    R = 6371.0088 * (0.621371 if miles else 1.0)
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    return 2 * R * np.arcsin(np.sqrt(
        np.sin((lat2-lat1)/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin((lon2-lon1)/2)**2
    ))

def plot_team_scatter(team_code, use_miles=True):
#DATA MANIPULATION (chronological order + reset the index)
    df = draft_schedule.query("team == @team_code").copy()
    df["gamedate"] = pd.to_datetime(df["gamedate"])
    df = df.sort_values("gamedate").reset_index(drop=True)

#LAT AND LON
    loc = locations.set_index("team")[["latitude","longitude"]]
    home_lat, home_lon = loc.loc[team_code]
    #set lat, lon for the game based on home and away
    df["venue_lat"] = np.where(df["home"]==1, home_lat, df["opponent"].map(loc["latitude"]))
    df["venue_lon"] = np.where(df["home"]==1, home_lon, df["opponent"].map(loc["longitude"]))
    #compute distance and shift one down + fill the first game as 0 distance and rest day as 1 for the first game (asusmption)
    df["travel"] = haversine(df["venue_lat"].shift(), df["venue_lon"].shift(),
                             df["venue_lat"], df["venue_lon"], miles=use_miles).fillna(0)
    df["rest_days"] = df["gamedate"].diff().dt.days.fillna(1)

#4-IN-6 DETECTION
    #Pulled from earlier Q's
    dates = df["gamedate"].values.astype("datetime64[D]")
    left = np.searchsorted(dates, dates - np.timedelta64(5, "D"), side="left")
    df["is_4in6"] = (np.arange(len(dates)) - left + 1) >= 4


#VISUAL ENCODINGS (CHATGPT)
    sizes = (2 - np.clip(df["rest_days"],0,2))*120 + 30
    colors = np.where(df["home"]==1, "tab:blue", "tab:orange")

    fig, ax = plt.subplots(figsize=(12,5))
    ax.scatter(df["gamedate"], df["travel"], s=sizes, c=colors, alpha=0.7, edgecolor="k", lw=0.3)
    if df["is_4in6"].any():
        ax.scatter(df.loc[df["is_4in6"],"gamedate"], df.loc[df["is_4in6"],"travel"],
                   s=sizes[df["is_4in6"]]*1.1, marker="*", facecolors="none",
                   edgecolors="crimson", lw=1.2, label="4-in-6")
#LEGEND
    legend = [
    Line2D([0],[0], marker='o', color='w', label='Home',
           markerfacecolor='tab:blue', markeredgecolor='k', markersize=8, linestyle='None'),
    Line2D([0],[0], marker='o', color='w', label='Away',
           markerfacecolor='tab:orange', markeredgecolor='k', markersize=8, linestyle='None'),
    Line2D([0],[0], marker='*', color='crimson', label='4-in-6',
           markerfacecolor='none', markeredgecolor='crimson', markersize=12, linestyle='None'),
            ]
    
    ax.legend(handles=legend, loc="upper left", framealpha=0.95, title="Encoding")
    ax.text(1.0, 1.02, "Point size = shorter rest", transform=ax.transAxes, ha="right", va="bottom", fontsize=9, color="#444")
    ax.set(title=f"{team_code} Schedule — Travel vs Date", xlabel="Date",
           ylabel=f"Travel from previous game ({'miles' if use_miles else 'km'})")
    fig.autofmt_xdate(); plt.tight_layout(); plt.show()

#PLOT
plot_team_scatter("OKC")
plot_team_scatter("DEN")
No description has been provided for this image
No description has been provided for this image

Question 8¶

QUESTION: Using your tool, what is the best and worst part of OKC’s 2024-25 draft schedule? Please give your answer as a short brief to members of the front office and coaching staff to set expectations going into the season. You can include context from past schedules.

ANSWER 8:

Briefing: OKC 2024–25 Draft Schedule

Best Part: While OKC will log significant mileage this season, the schedule design helps reduce strain by limiting 4-in-6 stretches during long travel runs. In other words, even when the team is flying across the country, they generally have at least some rest buffer built in. The most favorable stretches come in December and April, where homestands and multi-day breaks lower cumulative fatigue and create opportunities for practice and recovery.

Worst Part: The main challenge is the cluster of 4-in-6 games concentrated from January through early March. These dense blocks overlap with already high travel demands, creating a tough mid-season window where player fatigue and minor injuries are most likely. Compared with past seasons where OKC struggled through mid-season travel grinds, this period represents the biggest risk for a performance dip.

Implications for Coaching and Staff:

  1. Expect high overall travel load across the season, even if spread out more evenly than past years.
  2. Pay extra attention to January–March, where the combination of travel and 4-in-6 density peaks.
  3. Use rotation depth and targeted rest during this window to manage workloads. Take advantage of the lighter travel and longer rest periods in April to prepare for the postseason.

Part 3 -- Modeling¶

Question 9¶

QUESTION: Please estimate how many more/fewer regular season wins each team has had due to schedule-related factors from 2019-20 though 2023-24. Your final answer should have one number for each team, representing the total number of wins (not per 82, and not a per-season average). You may consider the on-court strength of the scheduled opponents as well as the impact of travel/schedule density. Please include the teams and estimates for the most helped and most hurt in the answer key.

If you fit a model to help answer this question, please write a paragraph explaining your model, and include a simple model diagnostic (eg a printed summary of a regression, a variable importance plot, etc).

In [96]:
print(game_data.columns.tolist())
['season', 'gametype', 'nbagameid', 'gamedate', 'offensivenbateamid', 'off_team_name', 'off_team', 'off_home', 'off_win', 'defensivenbateamid', 'def_team_name', 'def_team', 'def_home', 'def_win', 'fg2made', 'fg2missed', 'fg2attempted', 'fg3made', 'fg3missed', 'fg3attempted', 'fgmade', 'fgmissed', 'fgattempted', 'ftmade', 'ftmissed', 'ftattempted', 'reboffensive', 'rebdefensive', 'reboundchance', 'assists', 'stealsagainst', 'turnovers', 'blocksagainst', 'defensivefouls', 'offensivefouls', 'shootingfoulsdrawn', 'possessions', 'points', 'shotattempts', 'andones', 'shotattemptpoints']
In [97]:
#STEP 1: CREATE AGG
def _haversine(lat1, lon1, lat2, lon2, miles=True):
    R_km = 6371.0088
    R = R_km * (0.621371 if miles else 1.0)
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat, dlon = lat2 - lat1, lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(dlon/2)**2
    return 2 * R * np.arcsin(np.sqrt(a))

def _four_in_six(date_series: pd.Series) -> pd.Series:
    d = pd.to_datetime(date_series, errors="coerce").values.astype("datetime64[D]")
    left = np.searchsorted(d, d - np.timedelta64(5, "D"), side="left")
    cnt = np.arange(len(d)) - left + 1
    return pd.Series(cnt >= 4, index=date_series.index)

# ADD SEASON
sched = schedule.copy()
sched["gamedate"] = pd.to_datetime(sched["gamedate"], errors="coerce")
sched["season"] = (sched["gamedate"] - pd.Timedelta(days=185)).dt.year
sched = sched.sort_values(["team", "gamedate"]).reset_index(drop=True)

# COMPUTE VENUE
loc = locations.set_index("team")[["latitude","longitude"]]
home_lat = sched["team"].map(loc["latitude"])
home_lon = sched["team"].map(loc["longitude"])
opp_lat  = sched["opponent"].map(loc["latitude"])
opp_lon  = sched["opponent"].map(loc["longitude"])

sched["venue_lat"] = np.where(sched["home"].eq(1), home_lat, opp_lat)
sched["venue_lon"] = np.where(sched["home"].eq(1), home_lon, opp_lon)

# TRAVEL MILES
sched["prev_lat"] = sched.groupby("team")["venue_lat"].shift(1)
sched["prev_lon"] = sched.groupby("team")["venue_lon"].shift(1)
sched["travel_miles"] = np.where(
    sched["prev_lat"].notna(),
    _haversine(sched["prev_lat"], sched["prev_lon"], sched["venue_lat"], sched["venue_lon"], miles=True),
    0.0
)

# B2B and REST DAYS
sched["rest_days"] = sched.groupby("team")["gamedate"].diff().dt.days
sched["is_b2b"] = sched["rest_days"].fillna(np.inf).eq(0)


sched["is_4in6"] = sched.groupby("team", group_keys=False)["gamedate"].apply(_four_in_six)

# BUILD AGG
agg = (sched.groupby(["team","season"])
       .agg(
           travel_miles=("travel_miles","sum"),
           b2b_games=("is_b2b","sum"),
           four_in_six=("is_4in6","sum"),
           games=("opponent","size")
       ).reset_index())
In [98]:
# STEP 2: CALCULATE ON-COURT STRENGTH OF SCHEDULED OPPONENTS USING SRS = MOV + SOS

# SETUP 
gd = game_data.copy()
TEAM_COL, OPP_COL, DATE_COL, GAME_ID, PTS_COL = "off_team", "def_team", "gamedate", "nbagameid", "points"
gd[DATE_COL] = pd.to_datetime(gd[DATE_COL], errors="coerce")

# BUILD OPP VIEW + TEAMS
opp_view = gd[[GAME_ID, TEAM_COL, "season", DATE_COL, PTS_COL]].rename(
    columns={TEAM_COL: OPP_COL, PTS_COL: "opp_points"}
)

# MERGE TEAMS
gd_pair = gd.merge(
    opp_view,
    on=[OPP_COL, "season", DATE_COL, GAME_ID],
    how="inner",
    validate="m:1"
)

# CALC MOV + WINS
gd_pair["mov"] = gd_pair[PTS_COL] - gd_pair["opp_points"]
gd_pair["win"] = (gd_pair["mov"] > 0).astype(int)

# TEAM-SEASON LEVEL STATS 
# Average MOV
mov_team = (gd_pair.groupby([TEAM_COL, "season"])["mov"]
            .mean().rename("avg_mov").reset_index())

# AVG TEAM WINS
wins_team = (gd_pair.groupby([TEAM_COL, "season"])["win"]
             .sum().rename("wins").reset_index())

# SOS proxy: average opponent MOV faced
gd_with_opp_mov = gd_pair.merge(
    mov_team.rename(columns={TEAM_COL: OPP_COL, "avg_mov": "opp_avg_mov"}),
    on=[OPP_COL, "season"], how="left"
)
sos_team = (gd_with_opp_mov.groupby([TEAM_COL, "season"])["opp_avg_mov"]
            .mean().rename("sos").reset_index())

# SRS = MOV + SOS
srs = mov_team.merge(sos_team, on=[TEAM_COL, "season"], how="left")
srs["srs"] = srs["avg_mov"] + srs["sos"]

# Combine MOV, SRS, and wins
srs = srs.merge(wins_team, on=[TEAM_COL, "season"], how="left")

print("\nSRS preview (with wins):\n", srs.head())

# MERGE INTO AGG
agg = agg.merge(
    srs[[TEAM_COL, "season", "srs", "wins"]].rename(columns={TEAM_COL: "team"}),
    on=["team", "season"],
    how="left"
)
if "srs_x" in agg.columns and "srs_y" in agg.columns:
    agg["srs"] = agg["srs_x"].combine_first(agg["srs_y"])
    agg = agg.drop(columns=["srs_x", "srs_y"])
    
SRS preview (with wins):
   off_team  season   avg_mov       sos       srs  wins
0      ATL    2014  5.426829 -0.585068  4.841761    60
1      ATL    2015  3.609756 -0.083730  3.526026    48
2      ATL    2016 -0.853659 -0.306811 -1.160470    43
3      ATL    2017 -5.487805  0.210738 -5.277067    24
4      ATL    2018 -6.048780  0.056663 -5.992118    29
In [99]:
#STEP 3: OLS TIME

features = ["srs", "travel_miles", "b2b_games", "four_in_six"]
model_df = agg.dropna(subset=["wins"] + features).copy()

Y = model_df["wins"].values.astype(float)
X = model_df[features].values
X = np.c_[np.ones(len(X)), X]   # add intercept


# 2. OLS Estimate 
B = np.linalg.pinv(X.T @ X) @ (X.T @ Y)

predictors = ["Intercept"] + features
print("OLS Regression Results \n")
for i, b in enumerate(B):
    print(f"{predictors[i]:>12}: {b: .4f}")

# Predictions
Y_hat = X @ B

# good of fit
mse = np.mean((Y - Y_hat)**2)
ss_total = np.sum((Y - Y.mean())**2)
ss_resid = np.sum((Y - Y_hat)**2)
R2 = 1 - ss_resid/ss_total
adj_R2 = R2 - (1 - R2) * (X.shape[1]-1)/(len(Y) - X.shape[1]-1)

print(f"\nR²: {R2:.3f}")
print(f"Adjusted R²: {adj_R2:.3f}")
print(f"MSE: {mse:.3f}")


# 3. Schedule effect: Actual vs Neutralized schedule
X_actual = model_df[features].values
X_neutral = X_actual.copy()

# Neutralize schedule factors (set to season avg)
for j, col in enumerate(features[1:], start=1):  # skip srs, keep only travel/b2b/4-in-6
    season_means = model_df.groupby("season")[col].transform("mean").values
    X_neutral[:, j] = season_means

X_actual = np.c_[np.ones(len(X_actual)), X_actual]
X_neutral = np.c_[np.ones(len(X_neutral)), X_neutral]

yhat_actual = X_actual @ B
yhat_neutral = X_neutral @ B
model_df["schedule_wins_delta"] = yhat_actual - yhat_neutral

totals = model_df.groupby("team")["schedule_wins_delta"].sum().sort_values()

print("\nMost hurt by schedule (wins lost):")
print(totals.head(5).round(2))

print("\nMost helped by schedule (wins gained):")
print(totals.tail(5)[::-1].round(2))
OLS Regression Results 

   Intercept:  35.7228
         srs:  2.4868
travel_miles:  0.0001
   b2b_games:  0.0000
 four_in_six:  0.0073

R²: 0.901
Adjusted R²: 0.900
MSE: 14.286

Most hurt by schedule (wins lost):
team
CLE   -8.44
IND   -6.29
DET   -6.17
TOR   -5.82
WAS   -5.29
Name: schedule_wins_delta, dtype: float64

Most helped by schedule (wins gained):
team
POR    8.98
GSW    5.89
MIN    4.60
SAC    4.12
MIA    4.08
Name: schedule_wins_delta, dtype: float64

Model Explanation

To estimate how much the schedule has affected team performance, I built a dataset at the team–season level covering 2019–20 through 2023–24. The idea was to separate the effect of schedule factors from the actual quality of the team. To control for team strength, I calculated each team’s Simple Rating System (SRS) directly from game results in the play-by-play data. SRS combines a team’s average margin of victory (MOV) with the average strength of its opponents (SOS), giving one number that summarizes how strong a team was on the court.

I then merged SRS with the schedule dataset I had already built, which included total travel miles, number of back-to-backs, and number of four-games-in-six-nights stretches for each team in each season. I also included the team’s actual total wins, which served as the outcome variable. With this dataset, I fit a linear regression model (ordinary least squares) where wins were predicted by SRS and the schedule factors. SRS controls for how good the team actually was, so the schedule variables capture the incremental effect of the schedule itself.

Once the model was trained, I used it to estimate wins under two situations. First, I predicted wins using each team’s real schedule. Second, I predicted wins again, but this time I kept each team’s SRS the same and set their schedule variables to the league average for that season. The gap between these two predictions represents how many wins a team gained or lost because of schedule difficulty. Summing these gaps across the five seasons gives a single number for each team that reflects how much the schedule helped or hurt them between 2019 and 2024.

ANSWER 9:

  • Most Helped by Schedule: POR (9.0 wins)
  • Most Hurt by Schedule: CLE (-8.4 wins)
In [ ]: