We will be exploring the number of “Quad 1” games remaining for every high-major team. If you are new to the NET and/or unsure about what defines a “Quad 1” game, please refer to the article.
Getting the data
First up, let’s load the necessary libraries.
library(tidyverse)
library(cbbdata)
library(cbbplotR)
library(glue)
library(hrbrthemes)
Per usual, this tutorial will make use of my {cbbdata} package. If you have not done so already, you will need to install it and register for an API key (entirely free). Please read on how to do so here.
Season schedule
Our first step is pulling the 2024 season schedule and filtering to include games played on January 19th or later (today to the end of the regular season). We further throw out all games played against non-Division 1 competition (where type does not equal “nond1”).
schedule <- cbd_torvik_season_schedule(year = 2024) %>%
filter(date >= Sys.Date() & type != 'nond1')
After, we need to adjust our data so that we get a “schedule” for each team. Only two factors affect quadrant boundaries, opponent NET and game location, so our team schedule only needs three columns: team, opponent, and game location. There are a myriad of possible ways to do this, but one of the shortest is by using mutate and setting the .keep argument to “none.”
Functionally, this is fairly similar to the superseded transmutate function, and if interested, you can read more about the slight discrepancies here. By setting .keep to “none,” our returned data only retains the columns that are specified in our mutate call (i.e., team, opp, and location). Again, we don’t care about the other stuff, so this is perfect.
plot_data <- schedule %>%
mutate(
team = home, opp = away, location = if_else(neutral, 'N', 'H'),
.keep = 'none'
)
If you are following along in your own session — which I do encourage if you aren’t fully comfortable in R — you will notice that the number of rows in plot_data matches the number found in the schedule frame…but this isn’t good! cbd_torvik_season_schedule returns one row per game, but if we want to calculate the number of Q1 games remaining for every high-major team, we naturally need two rows per game (one for each team).
Perhaps the easiest and most intuitive way to solve this is by simply doing the reverse and binding the resulting rows — Occam's razor and all.
You can think of the code below as a short nested function. The first operation performed will be the inner-most chain: We take our schedule data and do the same mutate as above while swapping the home and away logic (changes are bolded). Then, that new data is passed to bind_rows and is combined with the existing plot_data object.
plot_data <- plot_data %>%
bind_rows(
schedule %>%
mutate(
team = away, opp = home,
location = if_else(neutral, 'N', 'A'),
.keep = 'none'
)
)
Adding conference, NET, and quadrant boundaries
Our plot_data object now includes two rows per game, with the proper team/opp and game location assignment, so we can move onto adding quadrant boundaries.
Shipped with {cbbdata} v0.2 is a function called cbd_add_net_quad, which takes data with a similar structure — must have columns representing opponent and game location — and adds columns for opponent NET and quadrant definition.
If you are unsure about which {cbbdata} version you have, you can run this line of code to check and update if needed. This might require an R session restart (if so, don’t forget to reload your libraries).
if(!packageVersion('cbbdata') %in% c('0.2.0', '0.3.0')) {
pak::pak('andreweatherman/cbbdata')
}
We also need to add conference information and team NET for sorting purposes, so let’s do that here as well. We can grab conference information from the ratings endpoint and NET rankings with the current resume function.
plot_data <- plot_data %>%
cbd_add_net_quad() %>%
left_join(cbd_torvik_ratings(year = 2024) %>% select(team, conf),
by = 'team') %>%
left_join(cbd_torvik_current_resume() %>%
select(team, team_net = net), by = 'team')
Getting our data ready for plotting
Before we start plotting, we need to sort our data, calculate per-conference medians, and clean up some names.
I hope that nothing here is too confusing. Because we want to plot with more “readable” conference names, let’s create a named vector that we’re going to use for filtering and relabeling.
Then we count the number of remaining quad 1 games with sum(quad == “Quadrant 1”) and leave a column showing team NET (since the latter will not change across each team, we can simply take the first observation of team_net with first).
For plotting purposes, we want our data to be ordered in a specific way, which we handle in the third line. We want to order by number of Q1 games remaining, and if a tie exists, we then sort by NET ranking.
Our last line is a basic mutate that achieves a few purposes: calculates median conference NET, relabels each conference using the named vector and places that median value inside the label (so that it will show in our facet), and finally creates a team label that we will place inside each bar. If the team is the last row in their conference (most Q1 games), we include a slightly more verbose label for easier interpretation. (We further restrict the verbose label to only appear if that team has eight or more Q1 games, else it will run over the bar.)
What is fct_inorder?
When dealing with categorical variables, such as teams, you might notice that {ggplot2} reorders your data to be shown alphabetically. Sometimes that is fine, but in others, you might have a particular plotting order.
To address this, you’ll often want to convert those categorical variables into factors. If you already have a defined data order, like we do below, the easiest way is to simply use the fct_inorder function, which creates factors in the order in which they first appear.
conf_relabel <- c('ACC' = 'ACC', 'BE' = 'Big East', 'B10' = 'Big Ten',
'B12' = 'Big 12', 'SEC' = 'SEC', 'P12' = 'Pac-12')
plot_data <- plot_data %>%
filter(conf %in% names(conf_relabel)) %>%
summarize(Q1 = sum(quad == 'Quadrant 1'),
median_net_left = median(net),
team_net = first(team_net),
.by = c(team, conf)) %>%
arrange(Q1, desc(team_net)) %>%
mutate(avg_conf_net = mean(team_net),
median_conf_net = median(team_net),
conf = conf_relabel[conf],
conf = glue("{conf} (Med. NET {round(median(median_net_left), 0)})"),
team = fct_inorder(team),
label = if_else(row_number() == n() & Q1 >= 8,
glue('{Q1} Q1s left (NET {team_net})'),
glue('{Q1} ({team_net})')),
.by = conf
)
Plotting
Now that we have our data analyzed and reshaped, let’s throw it over to {ggplot2} for plotting.
Our code itself is fairly straightforward. We are only using two geom_X functions (col and text) and some colors that I thought looked cool. We are borrowing a general theme from {hrbrthemes} and doing a few extra things to it.
Why are we converting Q1 to a factor in `fill`?
Our Q1 variable is numeric; it’s simply the count of remaining Q1 games for each team. When you set fill or color to a numeric variable in {ggplot2}, the scale becomes continuous by default. This is perfect if your variable is a floating number (a double in R, e.g. 5.3 or 19.7), but when working with integers, it usually makes more sense to have a discrete color scale (i.e. a color per distinct integer).
The easiest way to accomplish this without adjusting your data is to simply convert those values to a factor inside your aes call. A factor will treat values as categorical ones (variables that have a fixed and known set of possible values).
Facets? What in the world is fct_reorder? free_y?
The key part of this plot is building “small multiples,” or in R-speak, facets. Well, I guess the technical R-speak is, “a matrix of panels defined by row and column faceting variables.” Basically, that’s a complicated way of saying, “Hey, let’s just break this plot into smaller ones based on some group.”
If you run the code prior to facet_wrap, you’ll see that we just get one long bar plot with a bunch of team names on the side. This isn’t very helpful. What facets allow us to do is to break our plot into smaller ones based on conferences. Now try running the same code but include the facet_wrap line. You’ll see the exact same plot but in a 2x3 matrix — with each entry representing a different conference. Neat!
fct_reorder allows us to arrange our grid based on some variable. In this case, we want to arrange our small multiples in order of lowest median_conf_net to highest.
Try removing scales = “free_y” and see what happens. You should notice that our bars, well, aren’t quite aligned correctly. That’s because when you build facets, scales are fixed by default — meaning that each facet has the same x- and y-axis scale. This should be evaluated on a case-by-case basis, and with this plot, we definitely do not want our y-axis to be fixed. Instead, we want to plot only the teams observed in that facet — and we can do this by setting scales = “free_y”.
plot <- plot_data %>%
ggplot(aes(Q1, team)) +
geom_col(color = 'white', aes(fill = factor(Q1))) +
geom_text(aes(label = label, x = Q1 - 0.25,
color = ifelse(Q1 >= 4, 'grey20', 'white')),
family = 'Roboto Condensed', fontface = 'bold',
hjust = 1, size = 3.5) +
scale_fill_manual(values = c('#2082E4', '#1C8FE7', '#169CE8',
'#0FB4EC', '#0AC1ED', '#06CDEF',
'#00DAF0', '#06DCD5', '#14DDAC')) +
scale_color_identity() +
facet_wrap(~ fct_reorder(conf, median_conf_net), scales = 'free_y') +
theme_modern_rc() +
theme(strip.text = element_text(color = 'white', face = 'bold'),
axis.text.y = element_cbb_teams(logo_type = 'dark'),
axis.title.x = element_text(vjust = -1.5),
plot.title.position = 'plot',
plot.caption.position = 'plot',
plot.subtitle = element_text(vjust = 2.7),
plot.caption = element_text(hjust = 0),
legend.position = 'none',
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(title = 'Number of Q1 games remaining for high-majors',
subtitle = 'NET rankings are current morning of Jan. 19, 2024. Conferences are sorted by median team NET.',
y = NULL,
x = 'Q1 games remaining',
caption = 'Data by cbbdata + cbbplotR\nViz. + Analysis by @andreweatherman')
Viewing the plot
One thing that seems especially pertinent to mention: If you are building plots with more than ~20 logos, use ggpreview. RStudio can be very slow to render plots with many logos, and it might even terminate your session.
To address this, simply store your plot as a variable and pass that to the ggpreview function in {cbbplotR}. The function will save a temporary image of your plot — where you can further set a width and height — and display that in the Viewer pane. I then recommend to open the plot in a new browser window (example shown below).
ggpreview(plot, w = 8, h = 9.5)
Saving the plot
To save the plot, we’ll use ggsave.
ggsave(plot = p1, 'q1_remaining_0119.png', w = 8, h = 9.5, dpi = 350)
Full Code
Loading libraries
library(tidyverse)
library(cbbdata)
library(cbbplotR)
library(glue)
library(hrbrthemes)
Getting the data
conf_relabel <- c('ACC' = 'ACC', 'BE' = 'Big East', 'B10' = 'Big Ten',
'B12' = 'Big 12', 'SEC' = 'SEC', 'P12' = 'Pac-12')
schedule <- cbd_torvik_season_schedule(year = 2024) %>%
filter(date >= Sys.Date() & type != 'nond1')
plot_data <- schedule %>%
mutate(
team = home, opp = away, location = if_else(neutral, "N", "H"),
.keep = "none"
) %>%
bind_rows(
schedule %>%
mutate(
team = away, opp = home,
location = if_else(neutral, "N", "A"),
.keep = "none"
)
) %>%
cbd_add_net_quad() %>%
left_join(cbd_torvik_ratings(year = 2024) %>% select(team, conf),
by = "team"
) %>%
left_join(cbd_torvik_current_resume() %>%
select(team, team_net = net), by = "team") %>%
filter(conf %in% names(conf_relabel)) %>%
summarize(
Q1 = sum(quad == "Quadrant 1"),
median_net_left = median(net),
team_net = first(team_net),
.by = c(team, conf)
) %>%
arrange(Q1, desc(team_net)) %>%
mutate(
avg_conf_net = mean(team_net),
median_conf_net = median(team_net),
conf = conf_relabel[conf],
conf = glue("{conf} (Med. NET {round(median(median_net_left), 0)})"),
team = fct_inorder(team),
label = if_else(row_number() == n() & Q1 >= 8,
glue("{Q1} Q1s left (NET {team_net})"),
glue("{Q1} ({team_net})")
),
.by = conf
)
Plotting
plot <- plot_data %>%
ggplot(aes(Q1, team)) +
geom_col(color = 'white', aes(fill = factor(Q1))) +
geom_text(aes(label = label, x = Q1 - 0.25,
color = ifelse(Q1 >= 4, 'grey20', 'white')),
family = 'Roboto Condensed', fontface = 'bold',
hjust = 1, size = 3.5) +
scale_fill_manual(values = c('#2082E4', '#1C8FE7', '#169CE8',
'#0FB4EC', '#0AC1ED', '#06CDEF',
'#00DAF0', '#06DCD5', '#14DDAC')) +
scale_color_identity() +
facet_wrap(~ fct_reorder(conf, median_conf_net), scales = 'free_y') +
theme_modern_rc() +
theme(strip.text = element_text(color = 'white', face = 'bold'),
axis.text.y = element_cbb_teams(logo_type = 'dark'),
axis.title.x = element_text(vjust = -1.5),
plot.title.position = 'plot',
plot.caption.position = 'plot',
plot.subtitle = element_text(vjust = 2.7),
plot.caption = element_text(hjust = 0),
legend.position = 'none',
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(title = 'Number of Q1 games remaining for high-majors',
subtitle = 'NET rankings are current morning of Jan. 19, 2024. Conferences are sorted by median team NET.',
y = NULL,
x = 'Q1 games remaining',
caption = 'Data by cbbdata + cbbplotR\nViz. + Analysis by @andreweatherman')
Viewing and saving
# view
ggpreview(plot, w = 8, h = 9.5)
# save
ggsave(plot = p1, 'q1_remaining_0119.png', w = 8, h = 9.5, dpi = 350)