key: cord-0199999-egls5o4n authors: Ekstrom, Claus Thorn; Jensen, Andreas Kryger title: Having a Ball: evaluating scoring streaks and game excitement using in-match trend estimation date: 2020-12-22 journal: nan DOI: nan sha: 32eb0cb0858d2bf22ff02ef2193a71a9d3af4393 doc_id: 199999 cord_uid: egls5o4n Many popular sports involve matches between two teams or players where each team have the possibility of scoring points throughout the match. While the overall match winner and result is interesting, it conveys little information about the underlying scoring trends throughout the match. Modeling approaches that accommodate a finer granularity of the score difference throughout the match is needed to evaluate in-game strategies, discuss scoring streaks, teams strengths, and other aspects of the game. We propose a latent Gaussian process to model the score difference between two teams and introduce the Trend Direction Index as an easily interpretable probabilistic measure of the current trend in the match as well as a measure of post-game trend evaluation. In addition we propose the Excitement Trend Index - the expected number of monotonicity changes in the running score difference - as a measure of overall game excitement. Our proposed methodology is applied to all 1143 matches from the 2019-2020 National Basketball Association (NBA) season. We show how the trends can be interpreted in individual games and how the excitement score can be used to cluster teams according to how exciting they are to watch. Sports analytics receive increasing attention in statistics and not just for match prediction or betting but also for game evaluation, in-game and post-game coaching purposes, and for setting strategies and tactics in future matches. Many popular sports such as football (soccer), basketball, boxing, table tennis, volleyball, American football, and handball involve matches between two teams or players where each team have the possibility of scoring points throughout the match. Several research papers seek to predict the end match result (e.g., Karlis and Ntzoufras (2003) ; Groll et al. (2019) ; Gu and Saaty (2019) ; Cattelan, Varin, and Firth (2013) ) in order to infer the match winner and potentially the winner of a tournament ; Baboota and Kaur (2018) ). While the overall match result is highly interesting it conveys very little information about the individual development and trends throughout the match and modeling approaches that allow finer granularity of the running score difference throughout the match are needed. The trend in the score difference between the two teams is a proxy for their underlying strengths. In particular, sustained periods of time where the score difference increases suggest that one team outperforms the other whereas periods where the teams are constantly catching up to each other suggest that the teams' strengths in those periods are similar. Modeling the local trend of the score difference will therefore reflect several aspects of the game, in particular, the team strengths and game dynamics and momentum as they develop through the match. running score difference shows that the Lakers pulled ahead until the third quarter where Miami Heat started to keep up the scoring pace before overtaking the Lakers and reducing the lead. In this paper, we will consider the score difference between two teams as a latent Gaussian process and use the Trend Direction Index (TDI) from Jensen and Ekstrøm (2020b) as a measure to evaluate the local probability of the monotonicity of the latent process at a given time point during a match. The Trend Direction Index uses a Bayesian framework to provide a direct answer to questions such as "What is the probability that the latent process is increasing (i.e., that one team is doing better than another) at a given time-point?". This will allow real-time evaluation of the score difference trend at the current time-point in-game and will provide post-game inference about the "hot" periods of a match where one team out-performed the other. Furthermore, we present the Excitement Trend Index (ETI) as an objective measure of spectator excitement in a given match. The ETI is defined as the expected number of times that the score difference changes monotonicity during a match. If the score difference changes monotonicity often then that echos a game where both teams frequently score whereas a game with a low ETI will represent a one-sided match where one team is doing consistently better than the other over sustained periods of time. Other authors have considered using continuous processes to model the score difference of matches. Gabel and Redner (2012) shows that NBA basketball score differences are well described by a continuous-time anti-persistent random walk which suggests that a latent Gaussian process might be viable. Chen, Dawson, and Müller (2020) consider a functional data model for dynamic behavior of cross-sectional ranks over time. While this approach can disentangle the individual and population effect on the ranks of the individual teams over time its setup is not really geared towards analyzing single matches. We use an idea similar to Chen and Fan (2018) but they do not have the same underlying Gaussian intensity process that enables us to make various Bayesian probabilistic statements throughout and after the game. The paper is structured as follows. In the next section we introduce trend modeling of score differences through a latent Gaussian process and define the Trend Direction Index and the Excitement Trend Index that capture the local trends in monotonicity and game excitement, respectively. In Section 4 we apply our proposed methodology to analyze both the final match of the playoff as well as evaluating the game excitement distribution of the season by considering the ETIs from all 1143 matches from the 2019-2020 NBA season. We show how this distribution can be used to assess relative match excitement and how the ETI can be used to classify teams according to their average level of match excitement. We conclude with a discussion in Section 5. Materials to reproduce this manuscript and its analyses can be found at Jensen and Ekstrøm (2020a) . Our model is based on the observed score differences D m (t) in a given match indexed by m and time t. For each match we observe the random variables D m = (t mi , D mi ) 0 0 means that the away team is leading at time t. We assume that the observed data from a given match are noisy realizations of a latent smooth, random function defined in continuous time and evaluated at the random time points where scorings occur. Let d m be the latent function from which the realizations D m are generated. Our objective is to infer d m and its time dynamics from D m . In pursuance of this ambition we propose the following model where d m is a Gaussian process defined on a compact subset of the real line I m corresponding to the duration of the m'th game, and the observed data conditional on the scoring times and the values of the latent process at these times are independently normally distributed random variables with a match specific variance. This model can be stated hierarchically as where m = (β m , θ m , σ 2 m ) is a vector of hyper-parameters governing the dynamics of the latent Gaussian process with a prior distribution H indexed by parameters Ψ m , and t m = (t m1 , . . . , t mJm ) is the vector of time points where scorings occur in the match. The functions µ βm on I m and C θm on I m × I m are the prior mean and covariance functions of the latent Gaussian process, and σ 2 m is the variance characterizing the magnitude of the deviations between for the observed score differences and the values of the latent process. A Gaussian process is characterized by the multivariate joint normality of all of the joint distributions resulting from evaluating the process at any finite set of time points (Rasmussen and Williams (2006) ). Specifically, for any finite set t * ⊂ I m it follows that the vector d m (t * ) | m is distributed as N (µ βm (t * ), C θm (t * , t * )) where µ βm (t * ) is the vector generated by evaluating the prior mean function µ βm (t) at t * and C θm (t * , t * )) is the covariance matrix generated by evaluating the prior covariance function C θm (s, t) at t * × t * . Using the properties of multivariate normal distributions, the posterior distribution d m (t * ) | D m , m is also multivariate normal. This facilitates Bayesian estimation of the distribution of the latent process governing the score difference given the observed data from each match. In addition to obtaining inference for the latent process we may also estimate its time dynamics. This follows since a Gaussian process along with its time derivatives (provided they exist) are distributed as a multivariate Gaussian process (Cramer and Leadbetter (1967) ). We may therefore augment the hierarchical model in Equation (1) with an additional latent structure of the first and second derivatives of d m with respect to time as where and denote the first and second time derivatives and ∂ k j is the k'th order partial derivative with respect to the j'th variable. Combining the models in Equations (1) and (2) we obtain explicit expressions for the posterior distributions d m | D m , m and d m | D m , m . Specifically, the posterior joint distributions of the latent processes is the following multivariate Gaussian process where explicit expressions for the posterior mean and covariance functions are given in the online Supplementary Material. Consequently, we can sample from this posterior joint distribution at any finite number of time points as it corresponds to sampling from a certain high-dimensional normal distribution. We utilize the posterior samples of the first and second time derivatives of the latent process to characterize the dynamical properties of each match through the Trend Direction Index and the Excitement Trend Index. We define the Trend Direction Index (TDI) of a particular match m as the local posterior probability that d m is an increasing function at any time point t ∈ I m . Under our model this is equal to du is the error function and µ d m , and Σ d m are the posterior mean and covariance functions of the time derivative defined in Equation (3). The interpretation of the TDI is that it quantifies the probability that one team is currently increasing the differences in scores or equivalently that they are changing the trend in their favor. A TDI equal to 50% means that the game is in a stagnant state. We note that the TDI is symmetric with respect to the reference team in the definition of the score difference. If the reference team is switched, then the TDI changes to 1 − TDI. For each match we assign its Excitement Trend Index, ETI m , as a global measure of game excitement. The index is defined as the expected number of changes in monotonicity of the posterior distribution of d m which is equivalent to the expected number of zero-crossings of the posterior distribution of d m . We hence define is the standard normal density function, and λ m , ω m and ζ m are defined as The derivation of the expression of dETI m can be found in the online Supplementary Material to Jensen and Ekstrøm (2020b) . While no closed-form expression for ETI m | m seems to exist, the integration can be performed numerically. We note that the ETI is also invariant with respect to the choice of reference team in the definition of the score differences as it is defined as the expected number of both up-and down-crossings at zero of the posterior trend. Both TDI and ETI as defined in Equations (4) and (5) A completion of the model in Equation (1) requires a specification of the prior mean and covariance functions for the latent process. The choice of these are application specific and can be based on prior knowledge of the game dynamics. We refer to the discussion in Jensen and Ekstrøm (2020b) for more information on such choices. In our application we used a constant prior mean and the squared exponential covariance function given by These assumptions ensure well-defined and infinitely differentiable sample paths of d m . For the hyper-parameters m we used independent, heavy-tailed distribution with a moderate variance centered at the marginal maximum likelihood estimates of the form where T df denotes a location-scale T distribution with df degrees of freedom, T + df denotes the same distribution but truncated to the positive real line, and ML m notes the marginal maximum likelihood estimate of the corresponding hyper-parameter. By the properties of the model in Equation (1) the marginal maximum log-likelihood function has the following closed-form expression and the marginal maximum likelihood estimates ETI m | m can then be calculated and reported from the posterior samples as e.g., the mean or median along with (1 − α)100% credible intervals. To illustrate the applicability of our proposed methodology we apply it to data from all regular games from the 2019-2020 NBA basketball season. The data was obtained from Sports Reference LLC (2020) and is provided in Jensen and Ekstrøm (2020a) . A lot of points are scored during a basketball match so it is easy to see the development of the score difference in a single match. The 2019-2020 NBA season was suspended mid-March due to COVID-19 but it was resumed again in July 2020. There were a total of 1059 regular season matches. The subsequent playoffs comprised 84 matches including the final for a grand total of 1143 matches. For ease of comparison we are only considering the first 48 regular minutes of each match -any part of a match that goes into overtime will be disregarded, and we hence let I m = [0; 48] minutes. We wish to use the trend analysis proposed for three purposes: 1) to show how the TDI can be used to infer real-time and post-game evaluation of the trends in a match. 2) To evaluate the ETI for all 1143 matches in the 2019-2020 season to provide background and reference information about the matches, and 3) to summarize ETI at the team level in order to identify groups of teams more/less likely to give an exciting game. For each match we used the prior specification from Section 3. We ran four independent chains for 50,000 iterations each with half of the iterations used for warm-up and evaluated the posterior distribution of TDI m (t | m ) and dETI m (t | m ) on an equidistant grid of 241 time points in [0; 48]. The posterior distribution of ETI m | m was calculated by numerical integration using the trapezoidal method. For our first purpose of analyzing trends in individuals matches we consider the final match of the 2019-2020 season. The raw data for the running score difference between LA Lakers and Miami Heat was shown in Figure 1 . Figure 2 shows the results from the post-game analysis. Evaluating the game trends from the TDI in Figure 2 shows that LA Lakers had control of most of the match since the posterior probability of a positive trend was high throughout most of the match. Only towards the end of the third quarter did Miami Heat gain the upper hand and had a period where they fought back. In the 4th quarter from around 38 minutes to 43 minutes we can see that mean TDI increases to over 80% but the probability interval is very wide reflecting that it is difficult to say whether the trend is increasing or if it might as well just be random fluctuations in scores. Similarly for the first half of the 3rd quarter. Teams wishing to evaluate the match should primarily concentrate on periods where the latent trend and its probability interval is either close to 50% or when the trend is disadvantageous for the team. The spikes observed in the local ETI (lower right plot of Figure 2 ) indicate the time points where the monotonicity of the underlying trend is changing sign, with higher values of dETI representing more steep changes. To evaluate the overall distribution of the ETIs we fitted our model to each of the 1143 matches during the season and estimated the ETI for each. The results are summarized in the left panel of Figure 3 showing We wished to examine if there was a calendar time effect on game excitement as the season progressed in order to investigate if we would find that games became more exciting as the teams fought to stay in the competition to enter the final playoffs or if we could detect fatigue over the season. The right panel of Figure 3 shows the median posterior ETIs as a function of calendar time at which the matches were played. Besides illustrating the gap from the COVID-19 hiatus, the figure shows that the excitement indices are relatively evenly distributed throughout the season. When matches are ranked from lowest to highest median posterior ETI, we can extract the individual analyses for matches representing the full span of the ETI range. Figure 4 shows the analysis results of our proposed method for the matches with minimum, 1st, 2nd, 3rd quantile, and maximum median posterior ETI. It is clear from the observed running score differences, posterior trends, and the TDIs that these five matches represent substantially different game experiences. Trend Indices corresponding to the 0%, 25%, 50%, 75%, and 100% percentiles of the distribution of all games in the season. Gray regions depict 50% and 95% point-wise credible intervals and 95% posterior prediction intervals. The first row of Figure 4 for Dallas Mavericks vs the Golden State Warriors show a very one-sided match leading to the minimum ETI during the season. The third row of Figure 4 for the LA Clippers vs Charlotte Hornets match indicates that while the Hornets did lead throughout most of first quarter, the Clippers reversed the game towards the end of first quarter and kept the lead through most of 2nd quarter. After that the Hornets started to keep the lead and the Clippers never managed to make a proper comeback and the game trend was rather flat in the last two quarters of the game since the two teams more or less kept the pace with each other except for the effort shown midway in quarter four. In contrast, the match between New Orleans Pelicans vs Utah Jazz (5th row in Figure 4 ) showed trends that varied direction frequently and where the TDI showed alternating periods of scoring bursts making it a very exciting and unpredictable game. ). The small fluctuation of the averages suggests that the teams are comparable in terms of excitement when averaging across all their games during the season, and that the major source of variation in excitement during the season (as seen in Figure 3) is governed by the specific matches. Although the team averages in Table 1 show limited variability it is of interest to estimate a number of subgroups among the teams that exhibited similar degree of excitement on average during the season -effectively clustering the teams. This would enable fans, promoters, and sponsors to infer which teams were more likely to partake in an exciting game. The problem is mathematically equivalent to looking at the relationship between the median posterior ETIs as the outcome in a linear regression model where the explanatory categorical variable ranges over the set of all partitions of the 30 teams, and as the objective we seek the smallest number of partitions that best explains the observed outcome by comparing all possible splits of the ranked teams for a given number of partitions. This will then define subgroups of teams. As the optimization criterion for the subgroup identification we used the root mean squared error of prediction based on leave-one-out cross-validation, denoted RMSEP C=c LOO-CV where c is the number of subgroups. Our sequential optimization procedure showed that the optimization criterion stabilized at four subgroups: RMSEP C=2 LOO-CV = 4.493, RMSEP C=3 LOO-CV = 4.49, RMSEP C=4 LOO-CV = 4.489, and subsequently for C = 5, . . . , 8 it remained at the same value. The labels of these groups are shown in the rightmost column in Table 1 . The result is thus an identification of three change-points in the ranking of the teams according to median posterior ETI averaged across the season. The most noticeable result is that Charlotte Hornets constitute a singleton since that team has substantial lower ETI than the team with the second lowest ETI. To maximize the probability of seeing an exciting game it would thus have be wise to avoid matches in which the Hornets were playing. Figure 5 shows the estimated linear association between the seasonal standard deviations and average of the median posterior ETIs for each team. The teams follow a straight line through the origin fairly well, suggesting that the coefficient of variation is fairly constant across the teams. This also suggests that teams with large seasonal average excitement are also more likely to be part of really spectacular matches. Teams consistently contributing with highly exciting matches during the season should therefore be represented in the bottom right part of the plot. Assessing which teams that where generally most exiting to watch during the season is thus a bias-variance trade-off decision. While the New Orleans Pelicans were most exciting on average, the Washington Wizards where second-most exciting on average but also had a notably lower seasonal standard deviation. This means that although the latter team was less exciting on average during the season, their level of excitement was more stable throughout. On the other hand, the San Antonio Spurs had the third largest average excitement but simultaneously also a much larger standard deviation making the excitement of their individual matches much less reliable throughout the season. We have introduced the Trend Direction Index as a measure to estimate and evaluate the trends in running score difference in sports. The Trend Direction Index is based on a latent Gaussian process model and enables us to make Bayesian in-game and postgame evaluations about the underlying trends in scoring patterns and to attach easily interpretable probability statements to the results. In addition we have presented the Excitement Trend Index as the expected number of monotonicity changes in the underlying trend and we showed how it can be used to gauge how exciting a match will be: if one team is consistently outperforming the other then the match quickly becomes one-sided and less exciting. Both indices have intuitive interpretations that are easily conveyed to non-statisticians, coaches, players, and commentators. We have showed how the proposed method can be used to analyze single matches in order to determine strategies to identify periods throughout the game where the momentum of the game changes. The model utilizing the latent trend enables a highly detailed modeling approach where game development can be followed from minute to minute. This will facilitate and improve post-game coaching and influence future game tactics. Our analysis of all matches in the 2019-2020 NBA season showed that different values of the ETI captured games with vastly different features and showed that it could be used as a tool to discriminate the teams. In this case, the latent Gaussian process benefits greatly from the large numbers of scorings that are typical in basketball, since that means that there will be a large and frequent number of observations within each match that the model can utilize. In contrast, sports with few scorings such as soccer may prove a more difficult task simply because there is very few changes in the running score throughout a match. There are a couple of future research ideas that could extend our current approach of using the latent Gaussian trend to infer measures for game excitement. One idea is to define a weighted version of the Excitement Trend Index, wETI m , so that changes in monotonicity of the score differences are i) weighted higher towards the end of the game and ii) weighted lower if one team is already far away of the other team as measured by the absolute value of the posterior mean µ dm . This motivates a modification of the definition of the ETI in Equation (5) to the following weighted form where w : I m × R ≥0 → R ≥0 is a weight function that is increasing in its first variable and decreasing in its second variable. Such weight functions could be constructed as a product of two kernel functions defined on their individual domains and with bandwidths based on studies of psychological perception. Another approach for quantifying excitement would be to define it at the team-level instead of at the match level. In that case one could define team-specific Trend Excitement Indices nested with a match by looking at both the up-and downcrossing of df m at zero. This would result in two excitement indices for each match, (ETI am , ETI bm ) for teams a and b which would reflect how exciting each team were in match m with respect to chancing the sign of the score differences in their favor. In conclusion, we have provided an analytical framework for analyzing the trend in the running score difference in sports matches. The latent Gaussian process model requires very few assumptions which makes the modeling approach very flexible and applicable to a multitude of sports. Predictive Analysis and Modelling Football Results Using Machine Learning Approach for English Premier League Stan: A Probabilistic Programming Language Dynamic Bradley-Terry Modelling of Sports Tournaments A Functional Data Approach to Model Score Difference Process in Professional Basketball Games Rank Dynamics for Functional Data Stationary and Related Stochastic Processes -Sample Function Properties and Their Applications Evaluating One-Shot Tournament Predictions Random Walk Picture of Basketball Scoring A hybrid random forest to predict soccer matches in international tournaments Predicting the Outcome of a Tennis Tournament: Based on Both Data and Judgments Quantifying the Trendiness of Trends Analysis of Sports Data by Using Bivariate Poisson Models Gaussian Processes in Machine Learning mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models Basketball Reference