data abastraction using r
DESCRIPTION
Data Modelling and Data Modification Using RTRANSCRIPT
Abstract:
Introduction:
In any sport grouping the best team is the prominent period of success. The team
formation and selection is the vital problem which is commonly accepted approach to build a
good team. For the research paper I have selected the game of Cricket as an illustration to find a
solution on issue of team building from a group of players by using the recorded statistics of
them. Cricket is game where two teams which consists of eleven members, where one team bats
and other team bowls and vice versa. The batting team will try to score as many runs on the score
board and the bowling team will restrict the batting team to score as many runs by dismissing the
batsmen one after another. The win percentage of the two teams depends on the bowling, batting
strengths and many other objectives like fielding, captaincy etc.
Team selection is done by different factors which are past experiences and some basic
techniques. The simple and basic technique is to select 3 to 4 best performance batsmen and
bowler and the rest of positions are filled by players which are suitable for the remaining budget.
This method of the team selection will not give the best solution why because the matches are
won by the whole team effort and not by taking some best batsmen and bowlers in the team.
The aim of this research paper is to suggest the formation of the team which gives the
best results depends on multiple criteria. Let’s discuss some of the factors or criteria in
developing a team, best batting and bowling performances of a team are vital and this factors are
conflicting to each other that the team should have the best batting or best bowling performance.
Some of the minor factors are fielding, captaincy and wicket keeping, and consistent form of the
players. Formation of high-performing teams having the best trade-off between important criteria
is a challenging task.
Existing methods of Cricket team Selection:
Cricket is game which players statistics have many criteria like numbers of matchers played,
number of runs made, batting average, batting strike rate, bowling average, number of wickets
taken, number of over bowled etc. The main objective of owner or franchise of the team is to
select the team of eleven players with the minimum bowling, batting and fielding performances
and this has to be done with in the budget. Rules for the franchise to select the team are at least
one player as captain, one player as wicket keeper.
Considering the batting average and bowling average of the payers from the last four
T20 cricket matches as the part of their performance in batting and bowling. Each player is
assigned a tag indicating the player’s unique identity. Using the available data, Formulating the
team selection problem as a multi-objective optimization problem as follows:
Notice that the wicket-keeper (w) does not affect the bowling performance of a team. The team
is restricted to the following constraints:
T is taken as the team which consists of c (the tag of the captain of the team chosen from a
captain list (CL) having of 10 names, presented in Table 6), w (the tag of the wicket-keeper of
the team from a wicket-keeper list (WL) having 15 names, presented in Table 6), and p1, · · · ,
p9 (tags of nine other players of the team chosen from 129 total names arranged in a ranked list
(RL), presented in Tables 4 and 5, excluding the chosen captain and wicket-keeper) [1]. For
selecting a set of eleven distinct players in a team no two players can be same.
According to the tournament which Indian Premier league IPL players were also
tagged as foreigner player and Indian player. The IPL rules says that the team
should not have 4 foreign players in the team and the total tem should be with in
the specified budget. The procedure for computing the batting performance,
bowling perornance and fielding performance are given below:
Batting Performance:
Players batting average is the total number of runs he has scored divided by the number of
times he has been out [1]. To calculate the teams net batting performance, the batting average of
the players identified as the designated batsmen who have scored 300 runs in T20 international
matches and only these batsmen’s average is considered for teams batting average. The above
condition is used because to maximize the teams batting average by excluding the negative
batting average of the batsmen.
Bowling Performance:
A bowler’s bowling average is defined as the total number of runs conceded by the bowler
divided by the number of wickets taken by the bowler [1]. If the bowling average if the bowler is
low the better the performance. Qualified bowlers average is taken into account of the net teams
bowling average. The qualified bowlers who have taken at least 20 wickets in T20 international
matches. Bowling average of the team is a measure of bowling performance of the bowlers. The
net bowling average is low then the bowling performance will be high.
Fielding Performance:
:
Team’s net fielding performance is summation of all individual players fielding performance.
The number of stumpings by a wicket-keeper is taken as his wicket-keeper’s performance
measure.
Multi-Objective Formulation and NSGS- II:
For construction of the team in NSGS II with the three constraints are
g1(t) c 2 Captain list,
g2(t) w 2 Wicket-keeper list,
g3(t) No two players are identical in a team,
to do that each player is given a tag. To give a tag to different players, sorting should be done
according to the players cost and assigning a unique integer number indicating the
player’s rank with in the range 1129. The sorted and ranked list is called Ranked List RL. From
the sorted list captains are identified, marked as C and arranged in ascending order according to
the cost now the list of captains is called Captains list CL. Using same procedure wicket keepers
are identified and put in a ascending order of their price list and list is called Wicket Keepers list
WL.
There may be some players in the CL may refer to the same player with the value which is equal
to WL and all the values in CL and WL correspond to players represents in RL.
Construction procedure of population member in the initial generation is discussed
below:
Every population member is represented with two vectors: (i) a code-vector and (ii) a variable
vector.
Procedure for creating a code vector:
The code vectors first element is the random from CL named c, second element is the wicket
keeper from the WL named w and the remaining nine players are selected at the RL list so that
there no repetition among the nine elements. Thus they are in nine unique integers in the range
1129. If C or W are identical of the nine team members, another team member is chosen at
random and process is continued till the nine members are different from the C and W. The tags
of nine members are arranged in ascending order according to their values.
Procedure for creating the variable vector:
Variable vector is created by the code vector by copying the tag number of every
player one by one starting from the tag of the captain, wicket keeper and the team
of nine members. Difference between the code and the variable vector is in its first
two elements. In the code vector ranks of the captain and wicket keeper
respectively. In the variable vector they are the tags of the captain and wicket
keeper.
The solution is given below:
There is chance of getting the captain and wicket keeper as the same person, in that case there
will be only 10 players in the team. That is the first two elements in the code vector will be same.
The variable vector will indicate the same tags and there will be only 10 players in the team
which is not acceptable. For this problem there is solution just to add second element of the
variable-vector with a random integer from RL and ensure that it is not identical to the 10 players
already present in variable-vector.
The captain and wicket keeper is the same player, even though the code vector has 11
unique elements but there are only 10 players in the team. To solve this issue, the second element
in the variable vector is changed by the other random tag from the RL which is not similar to the
other team palyer tags. The reviswed varaiable vector is formed and used for computing
objectives.
By considering practical example, captain Dhoni of Inidan cricket team is a wicket
keeper but the chosen wicket-keeper is a different person (say Brendon McCullum or Parthiv
Patel), we still consider the situation as if the wicket-keeper is identical to captain (that is, the
captain will also perform the job of wicket-keeping) and replace the chosen wicket-keeper with a
random player (but non-identical to any other already chosen players) from RL.
By the above procedure all the three constraints are satisfied with all population
members. The variable vector is used to compute the other vectors such as budget and remaining
constraints. By counting the number of foreign players in the team and calculate the total budget
for hiring them. The formula for the budget analysis is given below which tells us that which is
feasible and which is non feasible.
Multi-Objective Optimization Results:
The following graph shows the trade-off front created by the changed NSGS II. Each point on
the trade-off front represents a team of 11 players. A few teams corresponding to the trade-off
points marked on the figure are shown in Table 1. Each team has a captain, a wicket-keeper and
at most four foreign players. The team satisfies the budget constraints. For comparing the results,
considering the real tem of 11 cricketers who played for the team name called Chennai Super
Kings in the IPL tournament took place in India during April 2011 to may 2011 and won the
trophy. The batting and bowling performance of the team are calculated by the same method
which is mentioned above. The total cost of the CSK team in this tournament is 7.5million $
which is not feasible to the representation. Still taken into consideration, CSK team has bowling
and batting performances that are worse than a number of teams found by our NSGA II. The
graph says that the CSK team is non optimal and costlier too.
Disadvantages of the existed system:
This approach may not give an optimal or a near-optimal solution. Teams are not balanced.
Proposed Method for selection of Cricket Team:
The proposed method should generate the maximum number of teams with best performance and
all the teams should be balanced.
Main Modules for Proposed Method:
The main modules of the proposed system which are
Clustering Sorting Binding
Clustering:
Clustering is a type of analysis which finds the clusters of data objects which are similar to each
other. The data objects of the cluster are more like each other than they are the members of the
clusters. Clustering technique is a Data Mining Technique used to group the data elements into
related groups which are having similar properties. Clustering is a type of problem which is
considered as the Unsupervised Learning. The definition of the clustering would be the “The
process of organizing objects into groups whose members are similar in some way”. In general
the cluster is a collection of the objects which are similar between them and are dissimilar to the
objects belonging to the other clusters.
Giving some useful information about what is clustering and how it is used is given below:
Data Clustering: 50 Years Beyond K-Means [2]:
In present days due to the increase in the high volumes of data in many types such as structured,
unstructured and raw data. To understand the data and to summarize the data is called Data
Analysis. Data Analysis is of two types Exploratory or Descriptive and Confirmatory.
“exploratory or descriptive, meaning that the investigator does not have pre-specified models or
hypotheses but wants to understand the general characteristics or structure of the high
dimensional data”[2]. “confirmatory or inferential, meaning that the investigator wants
to confirm the validity of a hypothesis or model or a set of assumptions given the
available data”[2].
In recognition of the patterns data analysis is considered as the predictive modeling with
the trained data. Predicting the untrained data and this part is called as Learning. Problem of
learning is divided into i) Supervised that is Classification ii) Unsupervised that is Clustering.
Webster [Merriam-Webster Online Dictionary, 2008] defines cluster analysis as “a statistical
classification technique for discovering whether the individuals of a population fall into different
groups by making quantitative comparisons of multiple characteristics”. Clustering example is
given below with figures:
The seven clusters in (a) (denoted by seven different colors in 1(b)) differ in shape, size, and
density.
Why Clustering:
Clustering is used for analysis if the data with many variants which are grouped together.
Clustering is used for mainly 3 purposes:
Underlying structure: to gain insight into data, generate hypotheses, detect anomalies, and
identify salient features.
Natural classification: to identify the degree of similarity among forms or organisms
(phylogenetic relationship).
Compression: as a method for organizing the data and summarizing it through cluster prototypes.
There are many clustering algorithms, according to the research some of the clustering
algorithms are
K means Bayesian
K- Means Clustering algorithm:
Experimental study of Data clustering using k-Means and modified algorithms [3]:
K Means algorithm is the prominent algorithm, the main idea is to classify the data objects in K
clusters where K is the number of the clusters which should be specified before the iterative
relocation technique which converges to local minimum.
K- Means clustering consists of two separate phases: “First phase is to determine k centers at
random one for each cluster. Next phase is to determine distance between data points in Dataset
and the cluster centers and assigning the data point to its nearest cluster”. New centers are
calculated by taking the mean of the points of the clusters, the clusters are created before centers
are calculated. This has to be done why because the new points may change the centers of the
clusters. The iteration continues until the centers are not getting updated. The square error
criterion is used which is defined by equation 1:
Pseudo code for k-Means algorithm is as follows:
Input: Dataset of n data points di (i = 1 to N)
Desired number of clusters = k
Output: N data points clustered into k clusters.
Steps:
1. Randomly select k data objects from dataset D as the initial cluster centers.
2. Repeat
3. Calculate the distance between each data point di (i =1to N) and all k cluster centers Cj
(j= 1 to k) and assign the data object di to the nearest cluster j.
4. For each cluster j, recalculate the cluster center.
5. Until no changing of cluster centers.
Pros of K Means Algorithm:
Simple.
Fast for low dimensional data.
It can find pure sub clusters if large number of clusters is specified.
Cons of K Means Algorithm:
K-Means cannot handle non-globular data of different sizes and densities.
K-Means will not identify outliers.
K-Means is restricted to data which has the notion of a center (centroid).
Bayesian Algorithm:
Efficient Bayesian Methods for Clustering [4]:
Bayesian is an approach which is based on the mathematical functions proposed by Bayes and
Laplace.
Bayes Rules States that:
X may be the data point and theta is model parameters. Probability of theta is the
probability of theta depicts the probability of the theta before observing the
information about x. Probability of x and theta is the probability of x conditioned on
theta and is depicted as the likelihood. Probability of the theta and x is the posterior
probability of theta after observing x and probability of x is the normalizing the
constant.
Bayesian Mixture Model:
To avoid the over fitting and estimating the parameters of theta, integrating the probability of
theta is taken into account. Since theta is the unknown value the average of the over all possible
values of the model are taken into account. This is equation which is often called as the
computing the marginal likelihood. The equation of Bayesian Mixture model is given below:
Selection of the Clustering Model:
Sorting:
Sorting is the arranging the data items according to certain sets in which all the data sets
will have a common thing. Sorting the data sets in Ascending order is the best and possible way
for the cricket selection.
Binding from the clusters:
The data sets which are clustered and sorted are in multiple locations, the form a team
combining them before forwarding to graphing. The data sets are from the clusters are added to
the matrix rows and columns.
Proposed Model Architecture explanation:
The architecture of the
Use case Diagram of the Proposed System:
The client gives the data of the players with the all the statistics needed. Admin takes the data
from the client and loads into the R- Tool. By using the K- Means Clustering algorithm in the R
–Studio clusters are created according the correspondent statistics of the data given by the client.
The cluster data is sorted and used for forming the teams. The generated teams are used to plot
the graph of the batting performance and bowling performance.