Computer Ratings for the Upcoming Football Seasons
The algorithm for rating football teams and predictions for the opening weekend of the college football season
In previous seasons, I've used a collection of similarly designed rating systems to rank football and baseball teams, and to predict the outcomes of upcoming games. However, because I used different data sources depending on the sport and league, the actual code for each league was slightly different and built around the data source. It's inefficient, and I'd prefer to have a single unified rating system that works with many different types of data. With that in mind, let's build a better rating system. As usual, the code will be released on Github under an open source license. I’ve been tweaking the code and improving it right up until this morning, and I’ll need to do a bit of work commenting and organizing it better before it goes on Github.
As a reminder, there's quite a bit of time and effort needed to develop and test this software. Sometimes there are subtle issues that require many rounds of testing to diagnose and resolve. I don't use AI assistance to write my article, either. If you'd like to support continued open development of software like this, please consider subscribing, sharing my articles on social media, forwarding my emails to people who might be interested, and making a financial contribution.
Forward- and Backward-Looking Ratings
Before building a rating system, I need to decide what the purpose of the ratings are. Do I want to rank the quality of teams and predict future performance, or do I want to rank which teams have accomplished the most in the games they've played to date. In many cases, the ratings using the two approaches will be similar, but there can be big differences.
One approach is a backward-looking rating system, assessing what a team has accomplished during the games its played. We typically use this to determine things like qualifying for postseason play. The most basic form of this is the standings, the wins and losses, and that works pretty well in professional leagues where teams generally play schedules of similar difficulty. There are often larger differences in schedule strength in college sports, so we use more complex approaches to weigh a team's wins and losses while accounting for the difficulty of its schedule.
My ratings here are different, a forward-looking approach to measure the quality of teams, with the goal of predicting their future performance. This may sound similar to a backward-looking rating in that we still have to look at what teams have done in prior games. The difference is that forward-looking ratings tend to rely heavily on the score of a game instead of just wins and losses. There's a lot of luck in sports, and it doesn't necessarily even out over the course of a season. A team that loses a lot of close games might actually be a good team that's just been a bit unlucky in its prior games. But a team that's won a lot of close games might not be able to replicate that performance in the future.
Aside from the simple objectivity of looking at which teams have the most wins and fewest losses, there's a lot of subjectivity in which teams have accomplished the most. For example, do we want to give greater weight to wins over opponents with good records or to winning on the road? The question of which teams are most deserving is often a subjective question with room for differing opinions. There's no inherently best answer to which teams are most deserving. In contrast, there is an inherent best answer to which teams will perform best in future games. The answer is to play the games and observe which teams actually play best, and test which rating systems work best.
I'll revisit the topic of backward-looking ratings a few weeks into the college football season when we have enough data to debate which teams are most deserving of being in the college football playoff. Starting in October after the college football games conclude on Saturday evenings, I'll post a column late in the evening called "The Linked Letters After Dark" with a first look at how those week's games have affected the ratings and the playoff picture. But for now, let's look at the forward-looking ratings.
How the Ratings Work
The basic concept for my ratings is to begin by setting each team's rating and the overall home advantage to zero. My system iterates through all the past games that I want to consider and predicts their outcomes. The predictions come from the home team's rating, adding the home advantage, and subtracting the away team's rating. The predictions are compared to the actual outcome of the games. If a team consistently outperforms its predictions, it's a sign that the team is probably underrated, and its rating should be nudged upward. If the team underperforms its predictions, the team is probably overrated, and its rating should be lowered a bit. And if home teams consistently outperform their predictions, it's a sign that the home field advantage should be raised.
The system keeps track of the overall error, which is how well or how poorly the ratings predict the games for all teams. The goal is to find the combination of team ratings and home field advantage that minimizes this overall error. Once the ratings are adjusted, new predictions are made and evaluated. This process keeps repeating until it becomes very difficult to reduce the overall error. At that point, the ratings should good predictor of the quality and future performance of the teams.
My rating system is really an optimization problem to minimize the overall error. When my system adjusts the ratings up or down a bit, there's a small random component because I've found this is helpful in optimizing the ratings. But it also means that if I choose different random numbers, I'll get slightly different ratings for the teams. Because of this, I actually run the rating system many times, each with different random numbers, and I get different ratings each time. The final home advantage and team ratings are the median of all the rating attempts.
Typically, rating systems try to avoid rewarding teams for running up the score too much. Winning a game by three touchdowns is definitely more impressive than winning by one touchdown. And a five touchdown victory is more decisive than just winning by three touchdowns. But does it really make a big difference if a team wins by seven touchdowns instead of five? At some point, a blowout is a blowout, and there are diminishing returns to winning by larger margins. I assume the margin of victory for all games is normally distributed with a mean of zero, and I calculate the standard deviation. Using that, I calculate the cumulative probability for both the predicted and actual margins of victory. A larger difference in cumulative probabilities means the prediction for the game was worse. I prefer this approach because it doesn't excessively reward or punish teams for blowouts, but I also don't need to pick arbitrary thresholds like weighting victories less if the margin is above three touchdowns or five touchdowns.
Offense and Defense Ratings
Instead of using a single number to represent a team's quality, my system actually uses separate ratings for offense and defense. These are not efficiency ratings like some other systems. A positive offense rating means that a team tends to score more points than normal. And a positive defense rating means that a team allows fewer points than normal. These are used to predict the score of a game instead of the margin of victory or defeat.
Offense and defense ratings for each team start at zero. For each game, predictions are made for the home and away teams' scores. The home team's score is predicted as the home team's offense rating, plus half of the home advantage (or zero for a neutral site), minus the away team's defense rating, plus the average score for all teams and games. The away team's score is the away team's offense rating, minus half of the home advantage (or zero for a neutral site), minus the home team's defense rating, plus the average score for all teams and games. A negative score isn't realistic, so if the predicted scores are negative, they're just set to zero. The predicted margin of victory is then home team's predicted score minus the away team's predicted score. This is actually what my system uses to predict the outcome of games.
Weighting Games
College teams tend to play a lot of games within their conference and relatively few games against non-conference opponents. A typical schedule for an FBS team is eight or nine games inside their conference, two or three against non-conference FBS teams, and maybe a single game against an FCS team. There are relatively few games between teams in different divisions, like FBS and FCS teams, or FCS teams against lower divisions. These games are often blowouts, so it might seem like there's not a lot that can be learned from the outcome of these games. I disagree.
In my testing, when I give these games too little weight, it causes the top teams from lower divisions to be ranked alongside the best FBS teams. Although these teams are dominant in their division, and there definitely are a few dominant Division III teams, there's also a large disparity in the skill of players across divisions. In reality, a game between one of the top FBS teams and one of the best Division III teams would almost certainly not be competitive. These blowouts are actually useful for distinguishing the quality of teams in higher divisions from those in lower divisions. However, because there are relatively few of these games, I've found they actually need to be weighted more heavily to make the ratings more accurate. I use graph theory techniques to determine how much differently to weight these games.
In graph theory, each team can be treated as a node. Each game is a connection between two nodes, and that connection is referred to as an edge. There are lots of edges between conference teams, but relatively few edges between teams in different conferences, and even fewer edges across divisions. I use a metric called edge betweenness centrality to measure how dense or sparse the connections are between nodes. Non-conference games and especially games between teams in different divisions have a larger betweenness centrality and are weighted more. This does seem to be helpful in preventing the top teams in lower divisions from being ranked alongside the top FBS teams.
Probability of a Win, Loss, or Tie
When the overall ratings have been calculated, my system uses these ratings to predict the scoring margin for every game. The prediction error is the predicted margin minus the actual margin. These errors are calculated for every game, and the standard deviation is calculated. Assuming that the mean of the errors is zero and using the standard deviation of the errors, my system fits a t-distribution to the data. If there optimal fit has more than 50 degrees of freedom, my system assumes the error is very close to a normal distribution. If not, a t-distribution is probably a better fit.
When predicting future games, the margin is estimated as the home team's rating, plus the home advantage, minus the away team's rating. That predicted margin is also the center of the distribution. My system uses the same standard deviation and distribution type that was just calculated. If there are no ties, the probability the away team wins is the cumulative probability for the part of the distribution that is less than zero. Then the home team's probability of winning is the opposite.
If ties are possibility, that probability is a small sliver of the distribution on either side of zero. Using current NFL rules, the probability of a tie is around 0.38%. My system goes through all the past games and estimates how wide that sliver of the distribution would need to be to match the overall probability of a tie. For NFL games, this ends up being roughly plus or minus 0.07 points. If teams are more evenly matched, the probability of a tie is going to be a bit larger than 0.38% because it's more likely that the game will be close. If it's a lopsided matchup, the probability of a tie will be less than 0.38% because the game is more likely to be a blowout. There are no ties in college football, so this is just ignored.
Early Season Predictions
Unlike some other systems, I don't try to adjust the ratings based on how much production each team has returning. Instead, I just use ratings from the prior season as a starting point. For NFL games, preseason games during the current season do have some predictive power, so I include them. For college football games, it's just the previous season's ratings at the start of the season. As games start to be played in the new season, I start lowering the weight of games from the preseason and the prior season. For college football, my plan is to completely phase out games from the prior season around the sixth week of the season. For NFL games, I'll decrease the prior season's weight more slowly, phasing out preseason and prior season games out completely after the tenth week of the season. This is a somewhat arbitrary decision, and I might use a different approach for other sports or in future seasons.
An Ensemble of Ratings
As I wrote earlier, my system actually calculates many sets of ratings, each being unique and slightly different. This could be called an ensemble of ratings. How large does the ensemble need to be to get accurate ratings? I'm not entirely sure.
Even running the ratings a single time does a very good job of distinguishing good and bad teams. For the NFL, the best teams are rated about 20 points higher than the worst. For an ensemble of 100 ratings, a quick look at one team shows the highest rating is about a point above the lowest. If there are a few teams with similar ratings, the difference between the highest and lowest ratings might be enough to move a team up or down a couple of spots. However, because I use the median rating, increasing the ensemble size from 10 to 30, or 30 to 50, or 50 to 100, doesn't really shift the ratings up or down more than a small fraction of a point.
One thing I have noticed is that if I calculate the standard deviation of the ensemble of ratings for each team, the highest standard deviations are about twice as large as the lowest. This is the case even if I have an ensemble of 100 ratings, so I'm not convinced this is pure chance. I'm not certain, but if there's a larger standard deviation for a team's rating, it probably means there's more uncertainty about how good that team really is. This may be useful information, but I'm not sure yet.
Back to the question, how large of an ensemble of ratings do I really need? Larger is better, but I also need to get results in a timely manner. For the NFL ratings, there are a few hundred games, so they don't take especially long. I'll probably use an ensemble of 50 or 100 rating attempts, and I suspect that's more than enough. For college football, there are roughly than 10 times as many games as in the NFL data, so it requires a lot more computing time. I'll probably limit this to 10 or 15 attempts, sacrificing some of accuracy to make sure I get the results in a timely manner.
Preseason Ratings
As a reminder, these ratings are also the final 2024 ratings, because I don’t adjust them for the upcoming season based on returning players. Delaware and Missouri State are moving up to the FBS for 2025, so they’re included in these ratings.
Overall Ratings
Home advantage: 3.12 points
Mean score: 26.36 points
Rank Rating Team Offense Defense
1 78.10 Ohio State 37.33 40.69
2 75.04 Notre Dame 37.76 37.20
3 73.42 Ole Miss 36.71 36.56
4 73.02 Texas 34.95 38.33
5 71.72 Indiana 38.45 33.15
6 71.14 Alabama 35.82 35.35
7 70.04 Tennessee 34.45 35.53
8 69.66 Penn State 32.91 36.70
9 69.29 Georgia 34.61 34.70
10 67.98 Oregon 36.10 31.86
11 66.34 South Carolina 31.61 34.73
12 65.04 Miami 40.49 24.55
13 64.18 Louisville 35.32 28.82
14 63.44 Arizona State 32.56 30.85
15 63.26 SMU 34.80 28.71
16 62.48 LSU 32.55 29.91
17 62.43 BYU 30.26 32.23
18 62.27 Florida 30.11 32.15
19 62.25 USC 32.65 29.51
20 62.04 Colorado 32.07 30.01
Rank Rating Team Offense Defense
21 61.92 Clemson 33.46 28.19
22 61.34 Texas A&M 30.76 30.46
23 60.73 Minnesota 26.14 34.51
24 60.65 Iowa 29.17 31.48
25 60.02 Iowa State 29.44 30.56
26 59.66 Kansas State 30.14 29.30
27 58.98 Baylor 33.68 25.35
28 58.94 Michigan 24.86 34.25
29 58.83 Kansas 30.44 28.47
30 58.22 Missouri 26.68 31.47
31 57.92 Virginia Tech 27.26 30.75
32 57.65 Oklahoma 26.21 31.49
33 57.50 Boise State 31.42 26.07
34 57.24 Tulane 29.87 27.48
35 56.75 Vanderbilt 27.57 28.84
36 56.57 TCU 30.13 26.36
37 55.99 Arkansas 28.73 27.23
38 55.77 Illinois 27.77 28.01
39 55.68 Auburn 26.05 29.88
40 55.09 UNLV 29.23 25.84
Rank Rating Team Offense Defense
41 54.92 UCF 30.85 23.97
42 54.85 Nebraska 22.29 32.69
43 54.30 Georgia Tech 26.97 27.33
44 53.62 Utah 21.16 32.42
45 53.50 Washington 23.48 29.96
46 52.99 Texas Tech 34.13 18.79
47 52.89 Wisconsin 23.24 29.71
48 52.59 Boston College 26.27 26.41
49 52.57 Syracuse 30.69 21.78
50 52.51 Kentucky 21.23 31.20
51 52.42 Army 22.59 29.82
52 52.37 Cincinnati 24.03 28.17
53 51.81 Rutgers 28.64 23.13
54 51.08 Navy 25.62 25.17
55 50.70 Pittsburgh 28.16 22.76
56 50.40 UCLA 21.47 28.88
57 50.11 Marshall 25.46 24.46
58 49.08 California 20.52 28.54
59 48.88 Texas State 27.98 20.87
60 48.76 Memphis 26.78 22.05
Rank Rating Team Offense Defense
61 48.74 Duke 22.71 25.89
62 48.34 West Virginia 26.91 21.34
63 47.92 James Madison 23.13 24.79
64 47.67 North Carolina 27.21 20.23
65 47.58 Mississippi State 27.64 20.04
66 46.73 Ohio 20.25 26.44
67 46.64 Maryland 25.16 21.45
68 45.96 Washington State 28.06 17.87
69 45.67 Virginia 20.81 24.88
70 45.32 Houston 15.16 30.22
71 45.19 Oklahoma State 26.23 18.87
72 44.55 Old Dominion 23.20 21.47
73 44.49 South Alabama 24.50 20.21
74 44.21 UConn 22.08 22.09
75 43.99 Miami (OH) 17.69 26.30
76 43.99 NC State 24.99 19.24
77 43.92 Northwestern 18.81 25.10
78 43.76 Georgia Southern 22.50 21.08
79 43.71 Louisiana 22.30 21.49
80 43.62 Michigan State 17.86 25.76
Rank Rating Team Offense Defense
81 43.09 Jacksonville State 26.99 16.10
82 42.54 Northern Illinois 14.73 28.16
83 41.88 Toledo 20.78 21.20
84 41.85 Bowling Green 18.97 22.86
85 41.47 Arizona 20.43 20.96
86 40.53 South Florida 23.15 17.35
87 40.48 Stanford 20.92 19.49
88 40.25 Fresno State 19.38 20.68
89 39.60 UTSA 23.68 15.57
90 39.60 Wake Forest 21.50 18.10
91 39.55 Florida State 14.56 25.20
92 39.37 East Carolina 21.62 17.81
93 39.20 North Texas 25.53 13.52
94 39.01 San José State 20.86 18.06
95 38.63 Sam Houston 15.17 23.47
96 37.10 App State 21.03 16.08
97 36.92 Nevada 17.79 19.24
98 36.75 Coastal Carolina 21.14 15.54
99 36.59 Western Kentucky 16.86 19.67
100 36.53 Rice 15.77 20.85
Rank Rating Team Offense Defense
101 35.99 Buffalo 19.02 16.89
102 35.91 Colorado State 16.46 19.32
103 35.83 Troy 18.00 17.89
104 35.44 Oregon State 17.46 17.86
105 34.53 UL Monroe 15.08 19.27
106 34.26 Florida International 15.73 18.36
107 34.10 Western Michigan 20.77 13.18
108 33.95 Liberty 16.07 17.95
109 33.83 Utah State 25.60 8.11
110 33.78 Air Force 11.64 22.18
111 33.49 Georgia State 18.50 15.05
112 33.46 Wyoming 12.10 21.05
113 33.45 New Mexico 26.30 7.16
114 32.47 Arkansas State 18.31 14.42
115 32.05 Louisiana Tech 10.46 21.61
116 31.73 Charlotte 17.56 14.20
117 31.50 Eastern Michigan 16.18 15.14
118 31.49 Hawai'i 11.94 19.68
119 30.99 UAB 18.84 12.17
120 30.44 Purdue 17.92 12.55
Rank Rating Team Offense Defense
121 30.30 Missouri State 19.89 10.29
122 29.57 Florida Atlantic 15.86 13.83
123 29.53 Central Michigan 15.29 14.16
124 29.18 Akron 14.70 14.50
125 28.88 San Diego State 12.66 16.22
126 27.97 Delaware 16.57 11.43
127 26.23 Ball State 18.95 7.15
128 25.97 Massachusetts 16.77 9.04
129 25.35 Temple 11.67 13.68
130 22.77 UTEP 12.43 10.37
131 20.32 New Mexico State 12.47 7.85
132 20.14 Middle Tennessee 10.33 9.73
133 19.90 Kennesaw State 7.41 12.53
134 18.82 Southern Miss 9.12 9.71
135 15.19 Tulsa 13.47 1.61
136 10.06 Kent State 7.80 2.25
Weekend Game Predictions
There are five games this weekend, so here are predictions for the five games using the computer ratings.
After the team name, there are two numbers. The first is the team’s expected margin of victory (positive) or defeat (negative). The second number is the team’s probability of winning the game.
The estimated score is based on the offense and defense ratings and may not exactly match up with the predicted margin of victory, though it’s very close. The main idea is to give a rough approximation of what score might be expected based on the quality of the two teams, and whether it will be a high or low scoring game.
I try to estimate the quality of the game based on the ratings of the two teams and the likelihood of a competitive game. This is maximized when two highly-rated teams play and are evenly matched. Finally, I show the probability of a blowout, a close game, a high scoring game, and a low scoring game, with the thresholds based on previous games used to train the rating system.
The games are listed in order of their expected quality, not the start time.
1. Iowa State (0.36, 51.17%) vs. Kansas State (-0.36, 48.83%)
Estimated score: 26.50 - 25.94, Total: 52.44
Quality: 97.48%, Team quality: 96.25%, Competitiveness: 99.98%
Blowout probability (margin >= 34.0 pts): 0.54%
Close game probability (margin <= 8.5 pts): 51.29%
High scoring probability (total >= 79.1 pts): 23.53%
Low scoring probability (total <= 26.4 pts): 24.00%
2. Sam Houston (-1.08, 46.48%) at Western Kentucky (1.08, 53.52%)
Estimated score: 20.30 - 21.31, Total: 41.61
Quality: 90.97%, Team quality: 86.86%, Competitiveness: 99.80%
Blowout probability (margin >= 34.0 pts): 0.56%
Close game probability (margin <= 8.5 pts): 51.14%
High scoring probability (total >= 79.1 pts): 15.51%
Low scoring probability (total <= 26.4 pts): 33.98%
3. Stanford (5.86, 68.42%) at Hawai'i (-5.86, 31.58%)
Estimated score: 26.03 - 20.38, Total: 46.41
Quality: 88.47%, Team quality: 85.69%, Competitiveness: 94.31%
Blowout probability (margin >= 34.0 pts): 1.12%
Close game probability (margin <= 8.5 pts): 46.54%
High scoring probability (total >= 79.1 pts): 18.82%
Low scoring probability (total <= 26.4 pts): 29.36%
4. Fresno State (-21.71, 3.79%) at Kansas (21.71, 96.21%)
Estimated score: 15.70 - 37.69, Total: 53.40
Quality: 70.65%, Team quality: 91.85%, Competitiveness: 41.81%
Blowout probability (margin >= 34.0 pts): 15.73%
Close game probability (margin <= 8.5 pts): 13.33%
High scoring probability (total >= 79.1 pts): 24.34%
Low scoring probability (total <= 26.4 pts): 23.21%
5. Idaho State (-38.98, 0.07%) at UNLV (38.98, 99.93%)
Estimated score: 16.11 - 55.13, Total: 71.25
Quality: 30.68%, Team quality: 81.77%, Competitiveness: 4.32%
Blowout probability (margin >= 34.0 pts): 65.81%
Close game probability (margin <= 8.5 pts): 0.63%
High scoring probability (total >= 79.1 pts): 41.60%
Low scoring probability (total <= 26.4 pts): 11.21%
After this weekend’s games, I’ll update the ratings and predict the remaining week 1 games. I probably won’t keep posting new articles with ratings every weekend during the season, and I’ll instead add a static page that will always have updated ratings. But I will post an article with predictions for the remainder of week 1, and I’ll do the same for the first week of NFL action.
This system is still a work in progress, and I intend to verify its accuracy later in the season when the ratings aren’t based heavily on last year’s games. Like I said earlier, it’s also a work in progress, and I’ll post my code for both college football and NFL data on Github once I get it cleaned up and properly commented.
Bring on some football!
These ratings and predictions are based on data from collegefootballdata.com.