How to create a betting model using expected goals data

Preface

This post was only ever intended to be a brief summary but has ultimately turned into a story of my journey with the betting model interwoven. It’s ended up at over 6,000 words and is an open article of my inspirations, data sources and formulas used. The purpose is to cover everything needed for a like-minded individual to start on their own journey, so here goes…

Introduction to me

With the lockdown resulting in a temporary shutout of many sports, it has provided me with an opportunity for reflection, development and knowledge sharing – the aim of this blog post.

Search Twitter nowadays and there are an abundance of football analysts, some successfully making a living out of it and some who do it for the love it. I lack a scouting or coaching background and have a perception that the industry is difficult to crack so a year or two ago I started on a different path to see if I could use expected goals data to create a betting model.

First and foremost I’m a sports fan. I’ve always been interested in numbers and as I have become older this has naturally progressed into a fascination of data. With an abundance of resources available nowadays I’ve spent many hours manually collating and creating spreadsheets looking for interesting trends and patterns generally to help make predictions, sometimes with a financial investment attached.

I like to think I have a rough grasp of odds offered by bookmakers, be it a good price or a bad price. I’ve never been one to do an accumulator with several odds-on favourites at home, inevitably it would be let down by at least one and didn’t feel like great value, but I had no way of telling.

Moneyball and Mayhew

For a sports fan who loves data, the 2011 film Moneyball about the world of Baseball and advanced analytics would have seemed like a natural fit but I’ll be honest and say I don’t think it even registered on my radar. It wasn’t until 2017 that I can remember first watching the film with one particular scene lodging firming in my memory. 

For those who haven’t seen the film, the scene revolves around the Oakland Athletics general manager Billy Beane played by Brad Pitt. The problem facing them summarised in one simple quote:

“There are rich teams, and there are poor teams. Then there’s 50 feet of crap. And then there’s us”

Billy Beane

With a limited budget available, Beane brings in assistant general manager Paul DePodesta to help build a roster of players using new sophisticated analytical metrics to identify undervalued talent often against the advice of the experienced traditional scout. If you haven’t watched and you think you may be like minded then watch the film, it’s highly recommended and is the inspiration for my journey.

I hadn’t the first idea about Baseball but assumed that something must be transferable to football. This was when I stumbled across a book by James Tippett called The Football Code: The Science of Predicting the Beautiful Game. It was a great introduction to a new concept to me, expected goals, and the use in the real world at SmartOdds and Brentford through owner Matthew Benham.

For those not familiar with expected goals (or xG for short) it is a metric to monitor the quality of a goalscoring chance. A value between 0 and 1 is assigned based on the probability the chance will result in a goal. A 1 in 20 long range shot will have a probability of 5% (an xG of 0.05) whereas a penalty roughly has a 3 in 4 expectancy and therefore an xG of 0.75.

Inspired by the idea of a new advanced metric I searched and read any related article I could find. It was at this point my searches led me to Ben Mayhew, who through his Twitter profile @experimental361, was creating data visualisations using expected goals data.

His website, www.experimential361.com, is full of great content and my interest was piqued further with the match xG timelines. Using a couple of recent examples from matches just before the lockdown in March shows how useful they are to provide a quick snapshot of how a match unfolded.

Firstly, a match between Preston North End and Queens Park Rangers could be summarised as: Preston scoring with their first real good chance with little happening in the subsequent 40 minutes. QPR equalised with their best chance of the match but were somewhat fortunate to win the game from thereafter with Preston creating the better chances.

QPR scored 3 goals but were expected to score 1.1 whereas Preston scored 1 goal but were expected to score 1.6. These are the type of games where a 1-1 draw or 2-1 home win would have felt a fairer representation to me based on the chances but also how the game panned out.

The match between Stoke City and Hull City is more straightforward in that Stoke were totally dominant from the start but are arguably flattered by the scoreline with the hosts scoring 5 goals but were expected to score 2.7.

Intrigued to find out more and keen to have a play around with some new data I wondered if it was possible to create expected goals myself and so I reached out to Ben to support.

Finding Data

I can’t remember whether I was expecting to receive a reply from Ben, and although he understandably didn’t share all his secrets, it was enough information to point me in the right direction and provide motivation to dive right in.

I wasn’t looking to invest financially into any data so it was important that the data was free, consistent and easily accessible. Those who follow the live text commentary from BBC Sport or Sky Sports website will notice that they are typically uniform in nature. Perfect to extract the information required with the text tending to be in a set format.

Each line of text describes an event that has occurred in the match with a couple of examples below from my preferred source, the SportingLife website.

A goal is typically structured as:

A non-scoring attempt structured as:

By recognising the set structure of the various types of events this can be manipulated through use of formulas in Excel or any other preferred coding language into something more useful:

MinuteEventAttempt PlayerTeamAttempt TypeShot LocationShot PlacementAssist PlayerAssist Type
19GoalDaniel JohnsonPreston North EndLeft footed shotPenaltyBottom right corner  
25Attempt MissedJordan HugillQueens Park RangersHeaderCentre of the BoxMisses to the leftRyan ManningCorner

This is obviously just two incidents within one match but collating this for numerous events across numerous leagues across numerous seasons quickly builds a database containing thousands of records.

Not keen to manually copy the information I gathered there must be a smarter way. A quick Google search identified a free software called R which could extract the data needed from the websites in bulk. For a beginner Stack Overflow was a great tool to help me pull the data I needed. The time invested was definitely worthwhile and has saved a lot of time in the long term.

The Expected Goals Model

If you are still reading at this point, thanks! The next section and lynchpin for the whole article is creating the expected goals model. The first step is to have a large database of events, the more the better ideally. As shown in the table previously you can extract numerous data items such as the attempt type, shot location, shot placement and assist type. For my model I use just two pieces of information:

– Attempt Type – namely was the attempt a shot or a header

– Shot Location – where on the pitch was the attempt taken from

Now it’s important at this stage to highlight that the quality of an expected goals model is dependent on the quality of the data used. My model is at the simpler end as it’s using free basic descriptive data. Every other model will use different input data in a different way to calculate an expected goals value for an attempt. This is why values from different providers have different values.

The data I use is suitable for my needs as it is free, easy attainable and allows myself to be in control of the calculation. For those looking to find data, FBref, WhoScored and Infogol are three data suppliers who have more detailed data than mine and are a good source for information.

Anyway back to the data I use. By linking the different attempt type and shot location provides various combinations detailing the attempt. For each of these you will be able to calculate the number of times that attempt combination occurred and how often it resulted in a goal. This is the basis of the expected goals formula:

xG value = The number of goals scored / The number of attempts taken

From my database of around 150,000 attempts and just under 17,000 goals I have the following percentages for each type of attempt and the corresponding expected goals value:

Attempt TypeAttemptsGoalsGoal %xG Value
Penalty2869215875.2%0.752
Shot from Very Close Range3170173854.8%0.548
Header from Very Close Range253688534.9%0.349
Shot from Side of 6 Yard Box309868422.1%0.221
Shot from Centre of Box29592514917.4%0.174
Free Kick271239514.6%0.146
Header from Side of 6 Yard Box262436213.8%0.138
Header from Centre of Box1827015668.6%0.086
Shot from Difficult Angle25762088.1%0.081
Shot from Side of Box2187615277.0%0.070
Shot from Long Range1930965.0%0.050
Shot from Outside of Box5451019483.6%0.036
Header from Side of Box1058302.8%0.028
Header from Outside of Box7922.5%0.025
Header from Difficult Angle28231.1%0.011

It’s at this point the realisation of how infrequent long range goals are scored may refrain a few from shouting “Shooooot” the next time a player has the ball around 25 yards out.

Once each event has been assigned an expected goals value then the possibilities are endless. You can calculate the expected goals for both teams in any given match, the expected goals a player should have scored over a season or the data I use for my betting model: the expected goals scored and conceded by each team over a rolling seasons period.

There’s no wrong way to measure team strength. I’ve chosen a seasons period as I feel it gives a truer reflection of a team’s ability. Shorter 6/10 game periods are useful context and reflect the latest information more quickly but can be biased due to the fixture strength experienced.

Expected Goals in the Real World

Enough of the theory, Here’s an example of the expected goals data I have calculated, in this instance the final table for the 2018-19 Championship table. An important finding to note is that the spread in expected goals scored (xGF) and expected goals conceded (xGA) is a lot narrower than the actual goals scored and conceded.

The ability of the teams within the league are closer than people think. In most cases the teams at the top are good but overperforming somewhat. Think of it as those teams who seem to win lots of games by a single goal when it probably should have been a draw.

RankTeamGFGAGDPtsxGFxGAxGDxPtsxPts Rank
1Norwich City93573694805921753
2Sheffield United78413789764630802
3Leeds United73502383814338831
4West Bromwich Albion87622580756411686
5Aston Villa82612176766313695
6Derby County695415745965-66116
7Middlesbrough494187367635669
8Bristol City5953670646226511
9Nottingham Forest61547666168-76117
10Swansea City6562365716011678
11Brentford73591464665413694
12Sheffield Wednesday6062-2645866-85918
13Hull City6668-2626069-95720
14Birmingham City6458661605916412
15Preston North End67670616168-65819
16Blackburn Rovers6469-5606668-26213
17Stoke City4552-7555458-36215
18Wigan Athletic5164-13526970-16214
19Queens Park Rangers5371-1851666236610
20Reading4966-17475085-354523
21Millwall4864-164469609677
22Rotherham United5283-31406479-155521
23Bolton Wanderers2978-49324470-264822
24Ipswich Town3677-41314582-374524

Summarising the table above can help identify the following based on the underlying expected goals numbers:

– Sheffield United were the strongest team promoted.

– Leeds United were unlucky not be promoted and were the strongest team remaining in the league.

– Derby County overachieved to reach the playoffs and appear to be of mid-team quality.

– Brentford, Swansea City and almost relegated Millwall were superior to their finishing positions and were of a playoff pushing quality.

– Hull City and Reading were the weakest two teams to remain in the league.

– Rotherham United were the strongest team to be relegated.

Fast forward to this season and it’s striking how many of those have come to realisation. From my experience I have found expected goals to be a much better indicator of future performance than actual goals. This is the main reason why I place some much value in the use of this particular metric.

Expected Points

For those interested in expected points, labelled as xPts in the table, this is an additional metric using expected goals. My method is to look at the difference in expected goals of the two teams for a particular match.

Using one of matches highlighted earlier of Preston North End 1 (1.6) – Queens Park Rangers 3 (1.1), provides a xG difference of +0.5 for Preston and -0.5 for QPR. The next step is to look at how often a team actually wins, draws or loses with this difference and multiplying this by the points earned for each outcome.

This is a simplistic approach as it just looks at the total xG not the number and quality of the individual chances which would impact the xPts. There are calculators available online to plug in the attempts to provide the probability but in absence of doing this in bulk I have devised this methodology.

For example if a team with +0.5 xG difference wins half of the matches, draws 30% of the time and loses the remainder this could be calculated as:

Expected points for a team with +0.5 xG difference

= (Team wins 50% of the time * 3 points for a win) + (Team draws 30% of the time * 1 point for a draw) + (Team loses 20% of the time * 0 points for a loss)

= (50% * 3) + (30% * 1) + (20% * 0)

= 1.8

Expected points for Preston North End would be 1.8.

On the contrary this would mean the team with a -0.5 xG difference would lose half of the matches, draw 30% of the time and win the remainder. This would be calculated as:

Expected points for a team with -0.5 xG difference

= (Team wins 20% of the time * 3 points for a win) + (Team draws 30% of the time * 1 point for a draw) + (Team loses 50% of the time * 0 points for a loss)

= (20% * 3) + (30% * 1) + (50% * 0)

= 0.9

Expected points for Queens Park Rangers would be 0.9.

The combined expected points won’t add up to 3 points as while a win distributes a total of 3 points, drawn games only distribute a total of 2 points.

Due to the nature of the calculations it is best to group together similar values to ensure each banding has significant volume and also helps create a smooth curve to ensure the xPts increases as the xG difference increases.

The table below shows the values I use and show Preston’s xPts to be 1.77 and QPR’s xPts to be 0.95 for the match in question.

xG DifferencexPts Value
>3.202.78
>2.70 to 3.202.62
>2.10 to 2.702.45
>1.50 to 2.102.28
>1.00 to 1.502.11
>0.75 to 1.001.94
>0.45 to 0.751.77
>0.30 to 0.451.60
>0.00 to 0.301.43
>-0.30 to 0.001.27
>-0.45 to -0.301.11
>-0.75 to -0.450.95
>-1.00 to -0.750.80
>-1.50 to -1.000.66
>-2.10 to -1.500.52
>-2.70 to -2.100.39
>-3.20 to -2.700.26
<= -3.200.15

Calculating Score Probabilities

Now the expected goals data are summarised for each team can be used to predict the outcome of a future match. This is done by calculating the average projected goals using a Poisson distribution. A Poisson distribution is used as the shape of the distribution closely follows the distribution of goals scored in football matches. For those looking for a little bit more detail then the article written below on the Pinnacle website is helpful.

https://www.pinnacle.com/en/betting-articles/Soccer/how-to-calculate-poisson-distribution/MD62MLXUMKMXZ6A8

To demonstrate the calculation for a match I will use my version of the 2018-19 Championship table shown earlier in the article to project the outcome of a fictional match between Preston North End and QPR assumed to be on the first day of the 2019-20 Championship season.

The first step is to calculate the average expected goals scored and expected goals conceded for the Championship. Across the whole season there were 1542 expected goals according to my model (1473 actual goals scored), 857 expected for the home team (836 actual) and 686 expected for the away team (637 actual) with the 24 championship teams playing 23 times at home (552 home teams in total) and 23 times away from home (552 away teams in total). These numbers can be used to calculate a number of formulas.

Average expected goals scored by the home team = 857 / 552 = 1.55 goals per match

Average expected goals conceded by the home team = 686 / 552 = 1.24 goals per match

Average expected goals scored by the away team = 686 / 552 = 1.24 goals per match

Average expected goals conceded by the away team = 857 / 552 = 1.55 goals per match

To calculate the values for specific teams we need the expected goals data split by home and away performance. My data for the 2018-19 Championship table is shown below with values at 2 decimal places.

TeamHxGFHxGAHxPtsAxGFAxGAAxPts
Aston Villa44.4530.8138.6731.4432.1330.59
Birmingham City32.7425.0135.8527.6634.0728.10
Blackburn Rovers37.1528.3335.7829.1639.7426.57
Bolton Wanderers21.4431.7525.1122.5637.8722.83
Brentford40.5322.1641.1625.6531.3428.31
Bristol City32.4625.2635.7031.1536.4828.86
Derby County34.6824.2237.6123.8940.5423.23
Hull City32.5529.5032.7627.1739.6823.84
Ipswich Town25.6339.1224.8219.7743.0719.72
Leeds United46.0421.8044.6234.8021.4838.27
Middlesbrough39.2829.3737.2027.9033.3028.91
Millwall35.9827.0536.9532.9033.1730.18
Norwich City42.0323.7842.0238.1035.2032.98
Nottingham Forest33.6129.2234.5627.4638.7326.18
Preston North End36.7932.2333.2224.4035.3724.55
Queens Park Rangers38.0527.9237.0627.6834.4328.88
Reading24.2336.7625.4225.3648.0019.66
Rotherham United37.4140.4529.7626.5638.1824.99
Sheffield United37.5320.7340.7838.8325.3738.79
Sheffield Wednesday33.3629.4633.7324.9937.0324.78
Stoke City29.3924.0034.7525.0633.9227.28
Swansea City41.4927.1838.3429.1632.8328.64
West Bromwich Albion44.7329.4239.5930.6934.9328.48
Wigan Athletic35.1930.1933.7133.3839.8628.32
League Total857686686857
League Average35.7028.5728.5735.70
Match Average1.551.241.241.55

In the hypothetical example of Preston North End v QPR we will need to calculate the average expected goals specific to both teams by assessing their attacking strength, the opponent’s defending strength and league average performance using the following formulas:

Preston’s average expected goals at home to QPR

= Preston’s home attacking strength x QPR’s away defending strength x average home goals scored

Preston’s home attacking strength

= Preston’s HxGF / League Average HxGF

= 36.79 / 35.70

= 1.031

Anything over 1 implies better than the league average, or in this case Preston are expected to score 3.1% more at home than an average Championship team

QPR’s away defending strength

= QPR’s AxGA / League Average AxGA

= 34.43 / 35.70

= 0.964

Anything under 1 implies better than the league average, or in this case QPR are expected to concede 3.6% fewer away than an average Championship team

Preston’s average expected goals at home to QPR

= Preston’s home attacking strength x QPR’s away defending strength x average home goals scored

= 1.031 x 0.964 x 1.55

= 1.543

QPR’s average expected goals away to Preston

= QPR’s away attacking strength x Preston’s home defending strength x average away goals scored

QPR’s away attacking strength

= QPR’s AxGF / League Average AxGF

= 27.68 / 28.57

= 0.969

Anything over 1 implies better than the league average, or in this case QPR are expected to score 3.1% fewer away than an average Championship team

Preston’s home defending strength

= Preston’s HxGA / League Average HxGA

= 32.23 / 28.57

= 1.128

Anything under 1 implies better than the league average, or in this case Preston are expected to concede 12.8% more away than an average Championship team

QPR’s average expected goals away to Preston

= QPR’s away attacking strength x Preston’s home defending strength x average away goals scored

= 0.969 x 1.128 x 1.24

= 1.351

To conclude we would expect the average scoreline to be:

Preston North End 1.543 – Queens Park Rangers 1.351

Obviously teams do not score a decimal amount of goals therefore we need to distribute this average using Excel’s Poisson formula. The formula is structured in the form of

= POISSON(x, mean, cumulative)

You can calculate the probability the team scores a specific amount of goals by replacing the x with the number of goals, replacing the mean with the average expected goals calculated above and setting cumulative to false.

For example, the probability of Preston scoring 0 goals at home to QPR can be calculated as:

=POISSON(0, 1.543, FALSE)

=0.214

=21.4%

Repeating this for both teams up to 5 goals will produce the following values

Team012345
Preston North End21.4%33.0%25.4%13.1%5.0%1.5%
Queens Park Rangers25.8%34.9%23.7%10.7%3.6%1.0%

Calculating Match Probabilities

My model assumes that goals are scored independently of each other and therefore the probability of specific scorelines can be calculated by multiplying the two scores together. A 0-0 scoreline would have a probability of 5.5% (21.4% x 25.8%) whereas a 1-1 will have a probability of 11.5% (33.0% x 34.9%).

Calculating the probability of a Preston win is simply a case of adding together all of the favourable scorelines (1-0, 2-0, 2-1, 3-0, 3-1, 3-2 etc.) which comes out as 41.8%. The probability of any draw is 24.7% and a QPR win is 33.5%.

Armed with probabilities for the theoretical match, the next step is to compare these with the bookmakers odds to highlight if there are any differences and by how much. Bookmakers odds are traditionally either shown as fractions or decimals and some hypothetical odds for the match would be shown as the following.

Fractional Odds

Preston North EndDrawQueens Park Rangers
11/105/211/4

Decimal Odds

Preston North EndDrawQueens Park Rangers
2.13.53.75

To be able to assess where the model differs from the bookmakers odds it is important to convert the odds back to probabilities. This can be done using the following formulas for either of the odds to calculate the probabilities.

Match OutcomeFractional OddsFormulaDecimal OddsFormulaProbability
Preston North End11/10=10/(11+10)2.1=1/2.147.6%
Draw5/2=2/(5+2)3.5=1/3.528.6%
Queens Park Rangers11/4=4/(11+4)3.75=1/3.7526.7%

You may have noted that the bookmakers probabilities total more than 100%, or 102.9% in this case. This is called the overround and is always above 100% to ensure the bookmakers make an overall profit on the match assuming they are able to obtain a fair split of bets across the various outcomes according to the probabilities.

With the odds shown in probabilities this can then be compared to the probabilities from the expected goals model to see where there are any discrepancies.

Match OutcomeModelled ProbabilityBookmakers ProbabilityDifference
Preston North End41.8%47.6%-5.8%
Draw24.7%28.6%-3.9%
Queens Park Rangers33.5%26.7%+6.8%

Both the model and the bookmakers believe the most likely outcome is a Preston North End win but the bookmakers believe it is more likely to occur than the model. A Queens Park Rangers win is the only outcome the model estimates to be more likely than the bookmakers, and although the likelihood is lower than Preston win, this would be the value selection to make in this scenario.

To highlight why it is the selection think of the scenario as a roll of a fair dice where you could bet on the following outcomes: 1, 2 or 3; 4 or 5; and 6. We know each individual numbers are equally likely to appear so betting on the outcome should be solely based on the odds offered.

OutcomeModelled ProbabilityBookmakers Odds (and Probability)Difference
1, 2 or 350.0%4/5 (55.5%)-5.5%
4 or 533.3%9/4 (30.7%)+2.6%
616.6%4/1 (20.0%)-3.4%

We all know that a 1, 2 or 3 is the most likely outcome but the odds available represent poor value and so over time we should expect to lose money betting on this outcome. It is important to recognise we are not expected to win every bet but ensure we are betting on outcomes that are more likely to occur than the bookmakers odds suggest, as highlighted by a positive difference value.

Staking Strategy

Once we have identified the bets to place, a Queens Park Rangers win in the hypothetical example, the final step is to place the bets.

The last inspiration was to read a book by Joe Peta called “Trading Bases: How a Wall Street Trader Made a Fortune Betting on Baseball”. The title is a perfect synopsis of the book but one of the useful sections highlights how a bigger stake should be placed on bets with a bigger difference between the modelled probability and bookmakers probability, the margin. The logic makes perfect sense in that the bigger the perceived error in the bookmakers odds the bigger the stake should be to capitalise on it.

My staking place roughly follows the approach he adopted in the book and is detailed in the table below.

Margin between Modelled Probability and Bookmaker Probability% of Bank StakedStake for a 100pt Bank
>15 %2.0%2pt
>13-15 %1.5%1.5pt
>11-13 %1.0%1pt
>9-11 %0.5%0.5pt
>6-9 %0.4%0.4pt
>3-6 %0.2%0.2pt

The hypothetical bet on Queens Park Rangers with a +6.8% margin means the selection would have been a 0.4pt win. For the dice roll, the margin of +2.6% would not have met the minimum threshold I use of 3%.

Paper Trading

This now brings the story up to the start of the 2019/20 season where I thought it would be a good idea to paper trade the selections (i.e. record the outcome of the selections identified but with no bets placed) to assess the volume of bets selected, outcome of the bets and the time needed to follow the model.

The first stumbling block was how to assess promoted/relegated teams in terms of their expected goals quality. Obviously using the expected goals from the Championship table for Rotherham United, Bolton Wanderers and Ipswich Town, the three relegated teams, in the League One fixtures would have underestimated their actual quality as their values were achieved against a higher calibre of opposition.

The only sensible and suitable solution I could see was to use cup fixtures between different leagues to estimate the adjustments required for promoted and relegated teams. Obviously teams at not always at full strength but the numbers provided gave an appropriate outcome.

Essentially this means that relegated teams had their xGF increased and xGA reduced to estimate what this performance would have equated to if playing in the division below. The reverse of this is done for promoted teams.

Promoted Teams

Previous Season LeagueNext Season LeagueHome xGF AdjustmentHome xGA AdjustmentAway xGF AdjustmentAway xGA Adjustment
ChampionshipPremier Leaguex 0.704x 1.426x 0.713x 1.467
League OneChampionshipx 0.764x 1.389x 0.761x 1.433
League TwoLeague Onex 0.815x 1.296x 0.815x 1.259

Relegated Teams

Previous Season LeagueNext Season LeagueHome xGF AdjustmentHome xGA AdjustmentAway xGF AdjustmentAway xGA Adjustment
Premier LeagueChampionshipx 1.411x 0.702x 1.403x 0.677
ChampionshipLeague Onex 1.296x 0.705x 1.287x 0.691
League OneLeague Twox 1.236x 0.779x 1.239x 0.800

The numbers show that there is a bigger difference between leagues the higher the pyramid you go. An example of how this is applied for one of the relegated teams, Rotherham United, is shown below:

ScenarioHxGFHxGAAxGFAxGAPerformance
2018-19 Championship Performance37.4140.4526.5638.1821st in Championship
2018-19 Championship Performance Adjusted to League One Standard48.48 (37.41 x 1.296)28.52 (40.45 x 0.705)34.28 (26.56 x 1.287)26.38 (38.18 x 0.691)3rd in League One (or 1st with Luton and Barnsley promoted)

This adjustment calculated Rotherham United to be the strongest team in League One for the following season aided by the fact the two teams of a higher standard, Luton Town and Barnsley, were both promoted to the Championship. Ipswich Town and Bolton Wanderers were equivalent to mid table teams in League One.

It’s also important to highlight that I use a rolling seasons data for the calculation of the teams strength so it is a case of replacing the oldest game in the 46 game period with the new one each game week to ensure the data was always up to date.

The additional adjustments caused a slight delay meaning paper trading didn’t actual begin until October 2019 and here are my results to date…

Is it Successful?

The first factor I found is that the model throws up a lot of selections. Across the top four leagues in England the model was highlighting a selection for 60% of the matches. So a typical full weekend schedule of 10 Premier League matches, 12 Championship matches, 11 League One matches (due to no Bury) and 12 League Two matches would highlight around 25-30 teams to bet on. A lot more than I was expecting.

Secondly, the model didn’t select odds on selections very often. This inevitably meant the model fancied outsiders which made sense as I had often read that favourites are typically underpriced due to their popularity in the Saturday accumulators. This meant the strike rate would be lower than expected and that I would need a constant supply of odds against selections to win to ensure it remained profitable.

Five and a half months in and the model is indeed showing a profit. From a starting bank of 100 points it would now stand at 120.64 points at the point of lockdown. One detail that has surprised me is the consistency of the results.

All individual months have shown a profit bar one and the amount has been roughly the same aided somewhat due to a fairly uniform win percentage.

Bet History by Month

DateTotal GamesBetsBets %WinsWins %StakeReturnProfitROIBank
Oct-1919513368%4635%62.5067.585.088%105.08
Nov-1916611167%3632%62.1061.56-0.54-1%104.54
Dec-1924314158%4633%80.9084.433.534%108.08
Jan-2021612658%4032%60.6065.925.329%113.39
Feb-2026516060%4931%79.2084.295.096%118.49
Mar-20694362%1126%23.7025.862.169%120.64
Total119971460%22832%369.00389.6420.646%

Now this still feels like a small sample and is only really half a season so I’m not sure if this is down to luck, expected goals data not fully factored into bookmaker odds yet or a combination of the two. I’m not entirely sure at what point I will know if this is not luck (perhaps someone reading will be able to help) but I know nobody likes to follow a losing model for too long at too much of an expense.

To provide further context of the results to date:

Bet History by League

– League Two has the highest strike rate but is the only league not to make a profit.

– The Premier League has the lowest strike rate and minimal profit. The league probably the most wagered on in the world, particularly with large syndicates, and therefore the odds should be the most accurate and toughest to profit from.

LeagueTotal GamesBets MadeBets Made %WinsWins %StakeReturnProfitROI
Premier League23915565%4227%84.0085.281.272%
Championship36022763%6830%128.70141.5612.8610%
League One29216055%5233%73.7086.2612.5617%
League Two31217456%6739%83.2076.92-6.28-8%
Total120371660%22932%369.60390.0120.416%

Bet History by Result

– Away wins are by far the most profitable outcome for the model. This reiterates the initial idea with outsiders tending to be the away team with home teams often over bet and providing poor value

– The model very rarely selects a draw

ResultBets MadeWinsWins %StakeReturnProfitROI
Home38914337%216.10192.62-23.48-11%
Draw12217%3.303.950.6520%
Away3158427%150.20193.4443.2429%
Total71622932%369.60390.0120.416%

Bet History by Model Percentage

– A rough correlation between the modelled percentage and the win percentage but surprising low for the most likely outcomes.

– The 20%-50% modelled probability section the most successful for profits.

Model PercentageBets MadeWinsWins %StakeReturnProfitROI
70%+9444%6.204.65-1.55-25%
60-70%392359%35.7041.315.6116%
50-60%1205042%81.2066.56-14.64-18%
40-50%1927539%101.40110.509.109%
30-40%1774727%83.6080.32-3.28-4%
20-30%1412518%51.1074.7623.6646%
10-20%38513%10.4011.901.5014%
0-10%000%0.000.000.00 
Total71622932%369.60390.0120.416%

Bet History by Odds

– Odds on selections are rarely highlighted but do turn a slight profit

– The model is profitable for any selection priced at 6/4 or higher with strong crossover from the Away win population.

OddsBets MadeWinsWins %StakeReturnProfitROI
Odds On533158%20.2022.121.9210%
Evens – <6/41425639%85.0069.56-15.45-18%
6/4 – <2/11285543%61.8075.0313.2321%
2/1 – <3/11585032%88.40104.1415.7418%
3/1+2353716%114.20119.174.974%
Total71622932%369.60390.0120.416%

Bet History by Stake

– Probably the most important one to consider. Disappointingly the selections with the biggest margin, and therefore biggest stake, have a low strike rate and return a loss.

– Beyond that and it shows the benefit of the staking strategy. The second most confident bucket provide a large profit with the lowest confident bucket the only other one providing a loss.

– Interesting to highlight that a flat staking plan would have shown a loss.

% of Bank StakedBets MadeWinsWins %StakeReturnProfitROI
2.0%391128%78.0067.70-10.30-13%
1.5%351646%52.5083.4530.9559%
1.0%491531%49.0049.950.952%
0.5%1053937%52.5056.023.527%
0.4%2006231%80.0080.860.861%
0.2%2888630%57.6052.04-5.56-10%
Total71622932%369.60390.0120.416%

What’s Next?

The next step was to finish paper trading for this season and then to to start financial investing in the model for the 2021/22 football season. Unfortunately the Coronavirus has played havoc with that and with matches being behind closed doors it is unknown how home advantage will be affected. The model is built on data with football played under circumstances so betting on games with no fans seems unsuitable. It will be interesting to see how home advantage is impacted for the remaining games this season to help direct the plan for next season.

That’s everything. Over 6,000 words and numerous tables/formulas. It’s an article a younger me would have loved to read at the start of my journey. I hope someone has managed to get to the end and found it enjoyable, helpful or mildly interesting.

3 thoughts on “How to create a betting model using expected goals data

Leave a comment