pattern recognition in dynamic sport matches...bsc artificial intelligence bachelor thesis pattern...

BSC ARTIFICIAL INTELLIGENCE

BACHELOR THESIS

Pattern Recognition inDynamic Sport Matches

by AREND VAN DORMALEN

10615199

June 24, 2016

Credits: 18 ECUniversity of Amsterdam

Faculty of ScienceScience Park 904

1098XH Amsterdam

Supervisor:dhr. P.J.J.P VERSTEEG, MSc

University of AmsterdamFaculty of ScienceScience Park 904

1098XH Amsterdam

Abstract

This project aims to find a flexible method that can detect patterns in high dynamicsport matches, such as soccer. For this purpose, data of player positions and ballevents in a set of matches by professional soccer club Villareal C.F. has been madeavailable. A set of patterns has been identified by the coach and the data has beenlabeled by a rule-based system using this information.

A supervised classification learning algorithm is used to detect classify pat-terns from the data. Additionally, insight in the reasoning is necessary to explicatethe rules that define a pattern. Decision trees are evaluated in this thesis, as theyare classification algorithms that output a hierarchy of rules. Hellinger distancedecision tree (HDDT) is a type of decision tree that uses Hellinger distance forfeature selection. HDDT has two major advantages; it is skew-insensitive and itcan handle non-binary input. Its performance was tested on three of the patternsand its results are compared with the results from two commonly used decisiontree-related methods: C4.5 and random forest.

Results show that simple patterns can be accurately detected, while complexpatterns are more often misclassified. HDDTs achieved better results than ran-dom forests and similar results to C4.5. These results seem promising, but higherperformance on complex patterns is necessary for a real-world product.

Keywords

Machine Learning, Sports Analysis, Supervised Learning, Classification, DecisionTree, Hellinger Distance

1

Contents

1 Introduction 31.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Research Question and Goals . . . . . . . . . . . . . . . . . . . . 5

2 Background 62.1 Previous Work on Soccer Data . . . . . . . . . . . . . . . . . . . 62.2 Complex Patterns in American Football . . . . . . . . . . . . . . 72.3 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . 7

3 Data 83.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Method and Approach 114.1 Data Merging and Cleaning . . . . . . . . . . . . . . . . . . . . . 114.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4 Optimizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Results 145.1 Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Conclusion 166.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Appendices 20A Matches in Training Set . . . . . . . . . . . . . . . . . . . . . . . 20B Matches in Test Set . . . . . . . . . . . . . . . . . . . . . . . . . 21C List of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2

1 Introduction

The professional sports industry is a booming business. Professional clubs canhave multimillion-dollar revenues and the biggest teams are valued at over a bil-lion dollar1. A club’s income is heavily subject to its performance. For instance,TV rights are often distributed based on the final position in a league, and goodperformance draws more fans to the stadium. These developments have increasedthe demand for digital analysis tools that help to improve the performance of ateam.

The demand for digital analysis tools has created new markets. GPS trackingof players to measure their individual statistics is an example of such a market23.Additionally, companies that deliver a full set of analysis tools have emerged, serv-ing most of the major professional sport teams 45. However, these tools provide itsusers with the same type of data. There remains a need for adaptive analysis tools.

Since a coach tries to increase the performance of his team by applying tactics,a perfect analysis tool would be able to adapt to the tactics and needs of a coach, andshould be able to check if players are adhering to these tactics. Players must followtactics especially in crucial moments in a match. Detecting and classifying thesemoments is therefore an important task. Currently, this task is performed manuallyas the definition of what is deemed crucial is subjective and heavily dependent ontactics. This manual classification is done by rewatching a match and labeling theimportant moments. This method is inefficient and time-consuming. Automationof this task could save time and increase accuracy.

Automation can be achieved through the use of machine learning. MachineLearning is a subfield of computer science that explores the uses of algorithms thatcan make predictions from data and thus can learn from this data. It has success-fully been applied to classify important moments in American football matches.However, this sport has a moderate dynamic character [Mitchell et al., 1994]. Thisthesis attempts to explore the feasibility of machine learning on high dynamicsports, such as soccer.

1http://www.forbes.com/soccer-valuations/list/2http://gpsports.com/3http://statsports.com/4http://www.instatfootball.com/5http://www.soccerlab.com/index.php

3

1.1 Terminology

In this thesis some terms related to soccer data will be used. These terms require aclear definition in order to reduce ambiguity and clarify the data and methods usedin this project.

• Possession: A possession is defined as the time span in which one of theteams controls the ball. A soccer match is thus split in three categories,namely possession for the home team, possession for the away team and nopossession. This last category occurs whenever the ball is out of play, orwhen the match is paused for a free kick, injury treatment or another reason.

• Event: Events are basic actions that occur in a match. A distinction is madebetween three event categories. The first category contains events that starta possession, such as set piece or recovery. The second category containsevents that end a possession, such as ball out or ball lost to opponent. Thethird category contains events during a possession. In this thesis the onlyevent in this category is pass.A possession always contains one event out of the first two categories, andcan contain zero, one or multiple passes. Additionally, an event can containa subevent. For example, a pass can contain cross as subevent. Additionally,in events labeled as ”set piece” the type of set piece is defined. A possessioncan thus contain zero, one or multiple subevents.

• Trajectories: Trajectories consist of player positions. Positions are definedas the normalized x- and y- coordinates in a frame during a possession. Tra-jectory data is generated for each player by connecting all positions in apossession. In this thesis, the positions for all field players of the home teamare available. A possession thus contains ten trajectories.

• Pattern: Patterns are series of actions in a match that have a high level oftactical involvement. Examples of pattern types are counter attack, orga-nized attack, and applying pressure after losing the ball. Each type of pat-tern is defined by a sports coach or analyst. There are substantial differencesbetween the types of patterns. Offensive patterns occur during possessionfor the home team, while defensive patterns occur during possession for theaway team. Moreover, the complexity between patterns can differ signifi-cantly.A possession can contain zero, one, or multiple patterns. However, a pos-session can not contain the same type of pattern more than once. Predictingwhether a possession will contain one or more patterns is the main goal ofthis thesis.

4

1.2 Research Question and Goals

As mentioned previously, this thesis attempts to explore the feasibility of machinelearning on high dynamic sports. In order to do so, various machine learning meth-ods will be applied to a data set consisting of player positions and ball events fromprofessional soccer matches. By using the supplied data, the following researchquestion was formed: ”Is it possible to develop a flexible method that can detectand classify patterns based on the trajectory of the players and the ball events?”

In order to find a thorough answer to this question the following sub-questionsneed to be answered:

• Is it possible to develop an algorithm that can explicitly explain why a pat-tern has been assigned to a class? An algorithm that gives insight into theclassification progress is preferred over an algorithm that does not. With thisinsight, implicit knowledge of sports coaches about their tactics can be madeexplicit.

• What amount of data is required as training set for classifying patterns witha high performance? The minimal amount of required data is very importantto a coach as tactics tend to develop and change over time. Moreover, a coachwould prefer to use a tool instantly after acquiring it.

• What role does complexity play in relation to performance? In exploring thefeasibility of pattern recognition in soccer matches, it is important to discoverwhat the current maximum complexity is. By answering this question it willbe clear if highly complex combinations of actions can be identified or ifresearch should first focus on identifying simpler patterns.

A future aim would be to explicate a coach’s knowledge about what defines a pat-tern. However, this is beyond the scope of this project. This thesis focuses onrecovering explicit knowledge that has been expressed by the coach in a rule-basedsystem. This system will be explained thoroughly in section 3.2.

This thesis will be structured as follows. Section 2 will explore what researchhas been conducted in the domains of machine learning on soccer data and patternrecognition on American football data. Additionally, various supervised classifica-tion learning algorithms will be explained in this section. In section 3 the data usedin this project will be thoroughly explained. Furthermore, the methods utilized togenerate this data are discussed. In section 4 the pipeline of this project will beclarified and the design choices will be explained. Section 5 will review the resultsof the algorithms on the data. In section 6 conclusions will be drawn from theseresults, the research questions will be answered and suggestions for future researchwill be provided.

5

2 Background

The domain of machine learning in sports analytics is a relatively new domain andhas therefore not yet been thoroughly explored. For this reason, there is a limitednumber of resources on the topic of applying machine learning on sports data. Themain limitation is the restricted availability of annotated data on ball behavior andpossession. Additionally, research has taken an interest in classifying broadcastvideo segments instead of classifying patterns in panoramic videos. This is causedby the wide availability of broadcasts, whereas panoramic videos are less common.However, useful previous research exists and can be expanded on in this project.

2.1 Previous Work on Soccer Data

The GPS tracking named in section 1 is not yet accurate enough for correctly track-ing players in a video. Various methods and algorithms can be applied in the fieldof object tracking [Yilmaz et al., 2006]. Among these methods, some could provesuitable for tracking soccer players or any sports players in general. Tracking ofplayer positions on the field can be used to acquire player trajectory data. Thistracking can be accomplished by applying foreground detection and measurementnormalization on a multiview camera system[Xu et al., 2004]. Another methodof tracking players is by using iterated closest points and supervised learning[Luet al., 2013]. The generated trajectory data enables the acquisition of player-basedfeatures over time. Examples of these features are absolute positions on the fieldand distances between players. In addition to this trajectory data, event-like datain the form of webcast text can be used to detect and classify patterns [Chen andChen, 2014]. This data can generate information on for instance ball behavior.These two types of data can be combined in a single framework. A frameworkprovided by Xu et al. is able to annotate sports video semantically and can person-alize the retrieval of video based on queries entered by the user [Xu et al., 2008].However, the moments captured in this framework are simple events, such as goalsand shots on target. Complex patterns are not present in this framework.Research conducted by Beetz et al. is similar to the scope of this project [Beetzet al., 2009]. In the Automated Sports Game Analysis Model (ASPOGAMO)project, efforts have been made to develop an automated method for annotatingsport games. In this project classification of various types of passes is attemptedthrough the use of a combination of ontologies and decision trees.

6

2.2 Complex Patterns in American Football

Classification of plays in American football is an additional source of informationfor the classification of patterns in soccer matches. These sports are somewhatsimilar as both are team sports and highly dependent on tactic. However, the staticnature of American football simplifies detecting the start of a play. This remains aproblem in dynamic sports, such as soccer. Despite these differences, attempts indetecting play are examples of recognizing patterns in a multi-agent system.Various attempts have been made to classify the several types of plays. These playshave a higher complexity than the single passes detected by Beetz et al. [Beetzet al., 2009]. Belief networks can provide a probabilistic framework to compareplays to a description [Intille and Bobick, 1999]. Additionally, Hidden MarkovModels [Swears and Hoogs, 2012] and Multiple Kernel Learning [Siddiquie et al.,2009] have been posed as high-performace solutions to this problem. These re-sults show that machine learning is capable of detecting and classifying complexpatterns in sport data. However, these solutions provide little insight into the rea-soning of the classifier, which is one of the goals of this project.

2.3 Classification Algorithms

The problem of detecting patterns is a supervised classification learning problem.Thus, a classification algorithm could solve this problem. C4.5 is a very com-monly used classification algorithm that has been named in the top 10 algorithmsin Springer and has been praised for its flexibility [Wu et al., 2008]. C4.5 usesentropy to find the most significant splits for the classes [Quinlan, 1996].Random Forest is another common classification algorithm. It is an ensemblingmethod that uses multiple decision trees generated from randomized subsets thedata. This algorithm is able to handle various types of features and is fairly invul-nerable to overfitting [Breiman, 2001]. An additional advantage of this algorithmis the high level of insight generated by the output, since the output provides rulesfor classification. As opposed to other conventional algorithms such as neural net-works, the resulting tree from random forest could clarify the definition of a pattern.However, random forest is sensitive to skewed data and irrelevant features.Hellinger distance decision trees (HDDT) are less sensitive to skew [Cieslak et al.,2012]. In addition, it utilizes the Hellinger distance as splitting criterion insteadof entropy. This has the advantage of omitting parameter selection, allowing forirrelevant variables to exist.The performance of these three algorithms in soccer pattern classification is re-searched in this thesis.

7

3 Data

The data in used in this project has been provided by Metrica Sports and consistsof 23 matches of Villarreal C.F., a Spanish professional football team. MetricaSports is a sports data company that created and maintains FootMapp, a socceranalysis tool. This tool is used by Villarreal to improve the overall performanceof the first team. For this reason, not every match is selected for analysis. Firstly,matches with non-standard tactics are omitted, as no useful comparison to othermatches can be made. Secondly, matches with a lot of second team players areomitted, as no individual progress can be shown and second team players mightbe unfamiliar with the tactics or analysis software. Lastly, matches in stadiumsthat do not allow for accurate recording are omitted, as this would cause noise inthe positional data. The typical tactics consist of a 4-4-2 formation and has beendescribed as ”defensively strong” 67. The matches in this project thus typicallyfollow these tactics.

The training set is composed of matches that were available at the start ofthis project. These matches are played in several competitions in the period fromSeptember 2015 until February 2016 8. However, the data set is a live data set andmore matches were added until the end of the soccer season in May 2016. Eightmatches that were played after February 2016 were used as test set9.

3.1 Data Acquisition

The data for every match consists of three separate parts. These are trajectories,ball events and patterns. Each are generated in a different way.

• Trajectories are created by combining positional data. An example of a tra-jactory can be seen in figure 1a. This positional data is acquired by a semi-automatic computer vision process, in which mistakes by the algorithm arecorrected manually. The manual supervision ensures a near-perfect accu-racy on this data. The data consists of the normalized coordinates of theball and the ten field players of Villarreal. The positions of goalkeepers andopponents are not recorded. It is important to note that the solution in thisthesis would behave correctly on any normalized coordinates. Therefore, thecomputer vision method used to acquire the data is not relevant.

6http://spielverlagerung.com/2016/04/30/villarreal-liverpool/7http://anfieldindex.com/22457/villarreal-vs-liverpool-depth-tactical-preview.html8See Appendix A9See Appendix B

8

• Events data is recorded manually. This data contains the several events asdescribed in section 1.1, including the coordinates at which the event oc-curred. Passes have two sets of coordinates, one for both the start and endposition of the ball.

• Patterns are generated from applying static rules on the positional and eventsdata. Labels are extracted from these files. The exact contents are describedin the following subsection.

3.2 Patterns

Fourteen patterns are classified by FootMapp, with some patterns containing sub-patterns. These sub-patterns include small variations on the patterns. For example,the pattern ”pressure after losing the ball” contains the sub-patterns ”right” and”wrong”. The first requires possession to be recovered within four seconds withoutthe opponent having more than four passes. A ”wrong” sub-pattern thus occurswhenever the recovery of possession is too slow or the opponent has made toomany passes. Including these sub-patterns, 30 classes have been defined in total.

There is a substantial variation between these 30 classes. This variation is dis-played in the complexity of a sub-pattern and its number of occurrences. Whereassome patterns contain just a few rules, others are more complex and are bounded bymore variables. Accurately predicting and thoroughly researching all 30 classes isbeyond the scope of this project and thus a subset of these classes has been chosen.These classes are:

• Counter Attack: Whenever Villarreal recovers the ball and progresses over20 meters towards the goal of the opponent. This may not take over tenseconds and there may not be more than seven passes to achieve this.

• Defend Organized Attack: Whenever the opponent has possession and hasmade over four passes. An example of this pattern is visualized in figure1b. Red dots represent Villarreal players, blue dots represent players of theopposing team and black lines represent passes made.

• Defensive Shift Central Defenders: Whenever the opponent has possessionand has made over four passes in the center area, and the distance betweenthe two central defenders is over 15 meters for over two seconds.

These patterns have been selected for a number of reasons. Firstly, these patternshave a varying complexity. Due to this variation, the relation between complexityand performance can be described. Secondly, these patterns have relatively com-prehensive descriptions and are easily understood. Thirdly, these patterns have a

9

relatively high occurrence, as opposed to some other patterns that occur once ortwice per match. This enables testing of the minimal size of the data set. Lastly,these patterns are very disjoint. Since a possession can contain multiple patterns,there is a certain degree of overlap. For example, the pattern ”Counter Attack” onlyoccurs in possessions that also contain the pattern ”Keeping Ball after Recovery”,and thus there is a complete overlap between the two classes. Preventing this typeof overlap will reduce the amount of noise and increase overall performance.

(a) Trajectory of a single player (b) Pattern: Defense Organized

Figure 1: Data Visualization

10

4 Method and Approach

This section will cover the conducted research and explain the reasoning for thedesign choices that were made.

4.1 Data Merging and Cleaning

Before any classification algorithms could be applied to the data, the data from thethree files had to be merged and cleaned. Possessions were formed based on thetimestamps of the starting and ending events in the event files. These timestampswere converted to the corresponding frame numbers in order to match positionaldata to the event data. Labels from the pattern data were subsequently added to thismerged file.

After merging the files, the data was cleaned. Possessions missing either astarting or ending event were discarded. Additionally, possessions with a negativeduration were removed along with other outliers. Finally, possessions with missingvalues on one or more of its features. Cleaning and merging of data was done inPython 3.5.110.

4.2 Feature Selection

Feature selection is a crucial aspect of classification methods. In this particularproblem, it is important to prevent selection of features based on the definition ofpatterns, as in a real-world situation the exact definition of a pattern is unknown.Additionally, features must be selected that can have relevance on all patterns,including patterns that have not yet been defined. This will ensure the flexibility ofthe algorithm.

Another problem within feature selection is discretizing the trajectory data.Trajectory data for each player consists of a x- and y-coordinate for that playerfor each frame within a possession. Creating a feature for the position in eachframe or sampled over a number frames would generate a lot of noise in mostcases and thus not be efficient. There are various ways to reduce a trajectory toa set of single values. In this project, the maximum Euclidean distance betweenplayers was derived from the trajectory data. Whenever this distance was negativeor greater than possible, the possession containing this distance was discarded.

A set of 59 features has been extracted from the data in which an effort hasbeen made to ensure flexibility. The full list of features can be found in AppendixC.

10Python Software Foundation. Python Language Reference, version 3.5.1. Available athttp://www.python.org

11

4.3 Classification

First experiments on the data revealed a significant amount of overlap between theclasses. Moreover, some patterns seemed to be subclasses of other patterns andonly occurred in a possession simultaneously. Due to this overlap multiclass clas-sification with a binary output will not have a high accuracy, as smaller classeswould get assigned to the bigger overlapping classes. As a solution to this prob-lem, a conversion to one-vs-one or binary classification was made. By using binaryclassification, the algorithm assigns a possession to either ”pattern X” or ”not pat-tern X”. In the training file, possessions with ”pattern X” had their other patternsremoved in order to prevent the same data point from occurring in both classes.This removal drastically reduced the amount of noise in the data.

Although binary classification solves the problem of overlap, a different prob-lem is intensified. In a multiclass system, most possessions have the class ”nopattern”. By switching to a binary classification method this skewness in thedata is increased, as even more possessions are assigned to the class ”not pat-tern X”. Skewed data greatly reduces the performance of most prevalent classifi-cation method. A classification method that is skew-insensitive is thus preferred.As mentioned in section 2, Hellinger Distance Decision Trees (HDDTs) are skew-insensitive. The Hellinger distance is described as follows:

dH(P,Q) =

√∫(√P −

√Q)2dλ (1)

In this formula, P and Q are the probability distributions of the classes, while λis the considered feature. The importance of feature λ is thus derived from theprobability distributions of the classes. Due to this, the feature selection of HDDTsis able to deal with irrelevant features to a certain extent. This algorithm thus sug-gested to be effective for this problem. Additionally, HDDT is a solution to the firstsub-question in 1.2, as it outputs a hierarchy of rules that explicate the reasoningof the algorithm.These advantages led to the hypothesis that HDDT will perform better than morecommonly used decision tree-related methods in the described domain. This hy-pothesis was tested by implementing this method in Weka 3.7.1411 and comparingits results to these alternatives.

4.4 Optimizing

In order to improve the results optimization is necessary. This optimization helps toprevent overfitting. In the case of decision trees, reduced error pruning is a simple

11Weka Data Mining Software. Available at http://www.cs.waikato.ac.nz/ml/weka/

12

and effective method against overfitting. The pseudocode of this pruning algorithmcan be found in Algorithm 1 [Furnkranz and Widmer, 1994].

Reduced error pruning replaces a node in the decision tree with its most popularclass. This replacement starts at the leaves of the decision tree and works bottom-up. The change is kept if the replacement increases the accuracy of the tree. In thisway, splits that are caused by noise and create overfitting are discarded.

Algorithm 1 Incremental Reduced Error Pruning

1: procedure PRUNING(Pos,Neg, SplitRatio)2: Clauses = ∅3: while Pos 6= ∅ do4: SplitExamples(SplitRatio, Pos, PosGrow, PosPrune)5: SplitExamples(SplitRatio,Neg,NegGrow,NegPrune)6: Clause = ∅7: while NegGrow 6= ∅ do8: Clause = Clause∪ FindLiteral(Clause, PosGrow,NegGrow9: PosGrow = Cover(Clause, PosGrow)

10: NegGrow = Cover(Clause,NegGrow)11: Clause = PruneClause(Clause, PosPrune,NegPrune)12: if Accuracy(Clause) ≤ Accuracy(fail) then13: return(Clauses)14: else15: Pos = Pos− Cover(Clause, Pos)16: Neg = Neg− Cover(Clause,Neg)17: Clauses = Clauses ∪ Clause18: return(Clauses)

13

5 Results

This section will explain the choices of the performance measure and present theresults of applying the algorithms on the data.

5.1 Performance Measure

A commonly used performance measure for evaluating classification errors is accu-racy. However, the skewness in the class distribution of the data calls for a differentperformance measure. If all possessions are assigned to the biggest class, a highaccuracy will be achieved while the outcome is very unfavorable. For this reasonanother performance measure is needed.

Since a sports coach needs to have access to all crucial moments, a high recallis more beneficial than a high precision. However, assigning all possessions to apattern would lead to a recall of 100%. In order to solve this problem, a weightedversion of the F -score was used in which recall was valued double over precision.This measure is called the F2-score.The formula for calculating a F -score is as follows:

Fβ = (1 + β2) ∗ precision · recall(β2 · precision) + recall

(2)

Which leads to the F2 score being:

F2 = 5 ∗ precision · recall(4 · precision) + recall

(3)

Precision is defined as: true positivestrue positives+false positives .

Recall is defined as: true positivestrue positives+false negatives .

The variable β is a scaling factor.

5.2 Results

In this work HDDT was compared to two other classification algorithms; C4.5and random forest. Tests were made with various training set sizes. These sizeswere one match, two matches, five matches, ten matches and all 23 matches asin appendix A. The smaller sets consisted of matches taken from the largest set.Matches that were part of one of the smaller sets were not part of other small sets.The test set consisted of all eight matches as in appendix B. The results are visiblein figure 2, where the green line corresponds with HDDT.

The results showed a sizable difference between the three patterns that wereresearched in this project. The pattern Defense Organized was detected with a very

14

high f2-score. Both HDDT and C4.5 classified this pattern near-perfectly. The pat-terns Counter and Defensive Shift had a lower f2-score with all tested classificationalgorithms. This difference was likely caused due to the difference in complexitybetween the patterns. Whereas Counter and Defensive Shift depend on four andfive variables relatively, Defense Organized depends on only two variables.

(a) Counter

(b) Defense Organized

(c) Defensive Shift

Figure 2: Results on various patterns

This figure shows that random foresthad the worst performance on the data, es-pecially whenever a small amount of datawas present in the training set. This was ex-pected as random forest is generally quitesensitive to skewed class distributions. Al-though performance improved when moretraining data was made available, its per-formance was significantly lower than theperformance of C4.5 and HDDT.

C4.5 and HDDT returned very simi-lar results on the test data. Whereas C4.5seemed to perform slightly better overall,HDDT had higher performances on smalltraining sets. This could be due to the factthat C4.5 relies on entropy, which makes itheavily effected by the difference in classdistribution between the training and testset. Whenever the test set is small, its classdistribution is likely to have a sizable de-viation from the average class distribution.This deviation would cause errors in theclassification as the algorithm attempts toachieve a similar class distribution in thetest set.

One of factors limiting performance isoverfitting. Although pruning was applied,there is still some evidence of overfitting.For instance, the performance of C4.5 isdecreasing when increasing the size of thetraining set from ten to 23 matches. Thisunexpected decrease could be explained byoverfitting. Additionally, the high variancebetween matches can be a cause for thepeaks in the results.

15

6 Conclusion

This project aimed to find a flexible method to classify patterns in high dynamicsports. Research suggested high performance for Hellinger distance decision trees(HDDTs). HDDT’s performance in classifying patterns was tested and comparedagainst the performance of C4.5 and random forest on a set of data from matchesby Villarreal.

The results of this test suggest that HDDT indeed has the flexibility that isdesired from a classification method in the case of high dynamic sports. Its perfor-mance was significantly better than random forest and comparable to C4.5. How-ever, these results show that this algorithm is not yet applicable in a real productor application. Whereas the performance on a simple pattern is near perfect, theperformance on more complex patterns is lower and has a significant amount ofmisclassifications.

The subquestions posed in section 1.2 can be answered from the results of thisresearch.

• Firstly, insight in the reasoning of the classifier was preferred. Both HDDTand C4.5 give insight into the reasoning behind the classification choices.For this reason, a decision tree or other algorithm providing insight wouldbe suitable for a final product.

• Secondly, the relation between training set size and performance was tested.The results of this research show that increases in performance after twomatches are quite small for the tested algorithms. However, this increasecould change if a different algorithm were to be used in a final product.

• Lastly, the relation between complexity and performance was explored. Re-sults showed that complexity is a variable with high effects on the perfor-mance. Patterns with a low complexity were detected accurately after evenone match, while performance on complex patterns was lower after using allmatches as training set.

In conclusion, the results show that classifying patterns in high dynamic sportsthrough the use of machine learning is indeed possible. However, additional re-search needs to be conducted in order to improve the performance on highly com-plex patterns.

16

6.1 Future Research

Although results are encouraging, the performance needs to be increased before afully functional product can be created. This section will cover some suggestionsfor future research which will improve this performance.

The flexibility of the approach needs to be further tested. This can be done onseveral scales. Firstly, new patterns can be classified in the same matches. Therewere 30 available patterns available. However, researching and classifying all ofthese patterns was beyond the scope of this project. A final product should be ableto detect any type of pattern. Secondly, the approach can be tested on matches ofdifferent soccer teams using different tactics and formations. In this way, it will bevisible if there are sizable differences in performance between teams. Lastly, theapproach can be tested on matches of a different sport with high dynamics, such aslacrosse or (field-)hockey [Mitchell et al., 1994].

Furthermore, more research is needed into the features that are to be used. Newfeatures could be generated from the data and its effect on the overall performancecould be measured. Zoning of the pitch could generate extra information, for ex-ample by including the number of players on a section of the pitch as a feature.Additionally, the events data could be more specific. For instance, pass could besplit in the categories high pass, low pass and header. Naturally, what features canbe generated depends on what data is stored.

Lastly, classification can be conducted with other classification algorithms. Bi-nary classification was used in this project due to the overlap between classes.HDDT was chosen as solution for this overlap. However, other solutions for theproblem of overlap exist. In binary classification, skewness could be solved byapplying Tomek links to delete instances of the majority class that are widely sep-arated from the instances from the minority class [Kubat et al., 1997]. Moreover,binary classification could be ignored. Multi-label classification could be usedto assign multiple patterns to a single possession. Likewise, confidence ratingscombined with a threshold might solve this problem. Furthermore, an algorithmthat generates a smaller amount of insight into its reasoning might classify with ahigher performance. Neural Networks could prove to be a viable alternative to thedecision trees tested in this research.

By researching these topics a final product could be created which would beable to automatically detect and classify patterns in high dynamic sports.

17

References

[Beetz et al., 2009] Beetz, M., von Hoyningen-Huene, N., Kirchlechner, B.,Gedikli, S., Siles, F., Durus, M., and Lames, M. (2009). Aspogamo: Auto-mated sports game analysis models. International Journal of Computer Sciencein Sport, 8(1):1–21.

[Breiman, 2001] Breiman, L. (2001). Random forests. Machine learning,45(1):5–32.

[Chen and Chen, 2014] Chen, C.-M. and Chen, L.-H. (2014). A novel approachfor semantic event extraction from sports webcast text. Multimedia tools andapplications, 71(3):1937–1952.

[Cieslak et al., 2012] Cieslak, D. A., Hoens, T. R., Chawla, N. V., and Kegelmeyer,W. P. (2012). Hellinger distance decision trees are robust and skew-insensitive.Data Mining and Knowledge Discovery, 24(1):136–158.

[Furnkranz and Widmer, 1994] Furnkranz, J. and Widmer, G. (1994). Incrementalreduced error pruning. In Proceedings of the 11th International Conference onMachine Learning (ML-94), pages 70–77.

[Intille and Bobick, 1999] Intille, S. S. and Bobick, A. F. (1999). A framework forrecognizing multi-agent action from visual evidence. AAAI/IAAI, 99:518–525.

[Kubat et al., 1997] Kubat, M., Matwin, S., et al. (1997). Addressing the curseof imbalanced training sets: one-sided selection. In ICML, volume 97, pages179–186. Nashville, USA.

[Lu et al., 2013] Lu, W.-L., Ting, J.-A., Little, J. J., and Murphy, K. P. (2013).Learning to track and identify players from broadcast sports videos. PatternAnalysis and Machine Intelligence, IEEE Transactions on, 35(7):1704–1716.

[Mitchell et al., 1994] Mitchell, J. H., Haskell, W. L., and Raven, P. B. (1994).Classification of sports. Journal of the American College of Cardiology,24(4):864–866.

[Quinlan, 1996] Quinlan, J. R. (1996). Bagging, boosting, and c4. 5. InAAAI/IAAI, Vol. 1, pages 725–730.

[Siddiquie et al., 2009] Siddiquie, B., Yacoob, Y., and Davis, L. S. (2009). Rec-ognizing plays in american football videos. University of Maryland, Tech. Rep,111.

18

[Swears and Hoogs, 2012] Swears, E. and Hoogs, A. (2012). Learning and rec-ognizing complex multi-agent activities with applications to american footballplays. In Applications of Computer Vision (WACV), 2012 IEEE Workshop on,pages 409–416. IEEE.

[Wu et al., 2008] Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda,H., McLachlan, G. J., Ng, A., Liu, B., Philip, S. Y., et al. (2008). Top 10algorithms in data mining. Knowledge and information systems, 14(1):1–37.

[Xu et al., 2008] Xu, C., Wang, J., Lu, H., and Zhang, Y. (2008). A novel frame-work for semantic annotation and personalized retrieval of sports video. Multi-media, IEEE Transactions on, 10(3):421–436.

[Xu et al., 2004] Xu, M., Orwell, J., and Jones, G. (2004). Tracking football play-ers with multiple cameras. In Image Processing, 2004. ICIP’04. 2004 Interna-tional Conference on, volume 5, pages 2909–2912. IEEE.

[Yilmaz et al., 2006] Yilmaz, A., Javed, O., and Shah, M. (2006). Object tracking:A survey. Acm computing surveys (CSUR), 38(4):13.

19

Appendix

A Matches in Training Set

Table 1: Matches in Training Set

Date Opponent Home/Away Competition17 September 2015 Rapid Wien Away Europa League20 September 2015 Athletic Bilbao Home Liga BBVA26 September 2015 Atletico Madrid Home Liga BBVA4 October 2015 Levante Away Liga BBVA18 October 2015 Celta de Vigo Home Liga BBVA25 October 2015 Las Palmas Away Liga BBVA31 October 2015 Sevilla Home Liga BBVA8 November 2015 Barcelona Away Liga BBVA22 November 2015 Eibar Home Liga BBVA29 November 2015 Getafe Away Liga BBVA6 December 2015 Rayo Vallecano Home Liga BBVA13 December 2015 Real Madrid Home Liga BBVA20 December 2015 Real Sociedad Away Liga BBVA3 January 2016 Deportive la Coruna Away Liga BBVA10 January 2016 Sporting Gijon Home Liga BBVA16 January 2016 Real Betis Home Liga BBVA23 January 2016 Espanyol Away Liga BBVA30 January 2016 Granada Home Liga BBVA6 February 2016 Athletic Bilbao Away Liga BBVA13 February 2016 Malaga Home Liga BBVA18 February 2016 Napoli Home Europa League21 February 2016 Atletico Madrid Away Liga BBVA28 February 2016 Levante Home Liga BBVA

20

B Matches in Test Set

Table 2: Matches in Test Set

5 March 2016 Las Palmas Home Liga BBVA13 March 2016 Sevilla Away Liga BBVA17 March 2016 Bayer Leverkusen Away Europa League7 April 2016 Sparta Prague Home Europa League10 April 2016 Getafe Home Liga BBVA14 April 2016 Sparta Prague Away Europa League24 April 2016 Real Sociedad Home Liga BBVA28 April 2016 Liverpool Home Europa League

C List of Features

• Event starting possession

• Number of passes

• Event ending possession

• Duration of possession

• Team in possession

• Five Booleans: Does the possession contain a corner kick/ free kick/ throwin/ cross/ goal kick?

• Ball difference x: Difference between the x-coordinate at the start of a pos-session and the highest x-coordinate in that possession

• Ball speed x: Ball difference x divided by its duration

• Ball difference y: Difference between the y-coordinate at the start of a pos-session and the highest y-coordinate in that possession

• Ball speed y: Ball difference y divided by its duration

• Forty-five features containing the maximum Euclidean distances between theten field players.

21

pattern recognition in dynamic sport matches...bsc artificial intelligence bachelor thesis pattern...

Documents