data abastraction using r

22
Abstract: Introduction: In any sport grouping the best team is the prominent period of success. The team formation and selection is the vital problem which is commonly accepted approach to build a good team. For the research paper I have selected the game of Cricket as an illustration to find a solution on issue of team building from a group of players by using the recorded statistics of them. Cricket is game where two teams which consists of eleven members, where one team bats and other team bowls and vice versa. The batting team will try to score as many runs on the score board and the bowling team will restrict the batting team to score as many runs by dismissing the batsmen one after another. The win percentage of the two teams depends on the bowling, batting strengths and many other objectives like fielding, captaincy etc. Team selection is done by different factors which are past experiences and some basic techniques. The simple and basic technique is to select 3 to 4 best performance batsmen and bowler

Upload: shiva-bandaru

Post on 13-Apr-2016

8 views

Category:

Documents


0 download

DESCRIPTION

Data Modelling and Data Modification Using R

TRANSCRIPT

Page 1: data abastraction using R

Abstract:

Introduction:

In any sport grouping the best team is the prominent period of success. The team

formation and selection is the vital problem which is commonly accepted approach to build a

good team. For the research paper I have selected the game of Cricket as an illustration to find a

solution on issue of team building from a group of players by using the recorded statistics of

them. Cricket is game where two teams which consists of eleven members, where one team bats

and other team bowls and vice versa. The batting team will try to score as many runs on the score

board and the bowling team will restrict the batting team to score as many runs by dismissing the

batsmen one after another. The win percentage of the two teams depends on the bowling, batting

strengths and many other objectives like fielding, captaincy etc.

Team selection is done by different factors which are past experiences and some basic

techniques. The simple and basic technique is to select 3 to 4 best performance batsmen and

bowler and the rest of positions are filled by players which are suitable for the remaining budget.

This method of the team selection will not give the best solution why because the matches are

won by the whole team effort and not by taking some best batsmen and bowlers in the team.

The aim of this research paper is to suggest the formation of the team which gives the

best results depends on multiple criteria. Let’s discuss some of the factors or criteria in

developing a team, best batting and bowling performances of a team are vital and this factors are

conflicting to each other that the team should have the best batting or best bowling performance.

Some of the minor factors are fielding, captaincy and wicket keeping, and consistent form of the

Page 2: data abastraction using R

players. Formation of high-performing teams having the best trade-off between important criteria

is a challenging task.

Existing methods of Cricket team Selection:

Cricket is game which players statistics have many criteria like numbers of matchers played,

number of runs made, batting average, batting strike rate, bowling average, number of wickets

taken, number of over bowled etc. The main objective of owner or franchise of the team is to

select the team of eleven players with the minimum bowling, batting and fielding performances

and this has to be done with in the budget. Rules for the franchise to select the team are at least

one player as captain, one player as wicket keeper.

Considering the batting average and bowling average of the payers from the last four

T20 cricket matches as the part of their performance in batting and bowling. Each player is

assigned a tag indicating the player’s unique identity. Using the available data, Formulating the

team selection problem as a multi-objective optimization problem as follows:

Notice that the wicket-keeper (w) does not affect the bowling performance of a team. The team

is restricted to the following constraints:

Page 3: data abastraction using R

T is taken as the team which consists of c (the tag of the captain of the team chosen from a

captain list (CL) having of 10 names, presented in Table 6), w (the tag of the wicket-keeper of

the team from a wicket-keeper list (WL) having 15 names, presented in Table 6), and p1, · · · ,

p9 (tags of nine other players of the team chosen from 129 total names arranged in a ranked list

(RL), presented in Tables 4 and 5, excluding the chosen captain and wicket-keeper) [1]. For

selecting a set of eleven distinct players in a team no two players can be same.

According to the tournament which Indian Premier league IPL players were also

tagged as foreigner player and Indian player. The IPL rules says that the team

should not have 4 foreign players in the team and the total tem should be with in

the specified budget. The procedure for computing the batting performance,

bowling perornance and fielding performance are given below:

Batting Performance:

Players batting average is the total number of runs he has scored divided by the number of

times he has been out [1]. To calculate the teams net batting performance, the batting average of

the players identified as the designated batsmen who have scored 300 runs in T20 international

matches and only these batsmen’s average is considered for teams batting average. The above

Page 4: data abastraction using R

condition is used because to maximize the teams batting average by excluding the negative

batting average of the batsmen.

Bowling Performance:

A bowler’s bowling average is defined as the total number of runs conceded by the bowler

divided by the number of wickets taken by the bowler [1]. If the bowling average if the bowler is

low the better the performance. Qualified bowlers average is taken into account of the net teams

bowling average. The qualified bowlers who have taken at least 20 wickets in T20 international

matches. Bowling average of the team is a measure of bowling performance of the bowlers. The

net bowling average is low then the bowling performance will be high.

Fielding Performance:

:

Team’s net fielding performance is summation of all individual players fielding performance.

The number of stumpings by a wicket-keeper is taken as his wicket-keeper’s performance

measure.

Multi-Objective Formulation and NSGS- II:

For construction of the team in NSGS II with the three constraints are

g1(t) c 2 Captain list,

g2(t) w 2 Wicket-keeper list,

g3(t) No two players are identical in a team,

to do that each player is given a tag. To give a tag to different players, sorting should be done

according to the players cost and assigning a unique integer number indicating the

Page 5: data abastraction using R

player’s rank with in the range 1129. The sorted and ranked list is called Ranked List RL. From

the sorted list captains are identified, marked as C and arranged in ascending order according to

the cost now the list of captains is called Captains list CL. Using same procedure wicket keepers

are identified and put in a ascending order of their price list and list is called Wicket Keepers list

WL.

There may be some players in the CL may refer to the same player with the value which is equal

to WL and all the values in CL and WL correspond to players represents in RL.

Construction procedure of population member in the initial generation is discussed

below:

Every population member is represented with two vectors: (i) a code-vector and (ii) a variable

vector.

Procedure for creating a code vector:

The code vectors first element is the random from CL named c, second element is the wicket

keeper from the WL named w and the remaining nine players are selected at the RL list so that

there no repetition among the nine elements. Thus they are in nine unique integers in the range

1129. If C or W are identical of the nine team members, another team member is chosen at

random and process is continued till the nine members are different from the C and W. The tags

of nine members are arranged in ascending order according to their values.

Procedure for creating the variable vector:

Variable vector is created by the code vector by copying the tag number of every

player one by one starting from the tag of the captain, wicket keeper and the team

Page 6: data abastraction using R

of nine members. Difference between the code and the variable vector is in its first

two elements. In the code vector ranks of the captain and wicket keeper

respectively. In the variable vector they are the tags of the captain and wicket

keeper.

The solution is given below:

There is chance of getting the captain and wicket keeper as the same person, in that case there

will be only 10 players in the team. That is the first two elements in the code vector will be same.

The variable vector will indicate the same tags and there will be only 10 players in the team

which is not acceptable. For this problem there is solution just to add second element of the

variable-vector with a random integer from RL and ensure that it is not identical to the 10 players

already present in variable-vector.

The captain and wicket keeper is the same player, even though the code vector has 11

unique elements but there are only 10 players in the team. To solve this issue, the second element

in the variable vector is changed by the other random tag from the RL which is not similar to the

Page 7: data abastraction using R

other team palyer tags. The reviswed varaiable vector is formed and used for computing

objectives.

By considering practical example, captain Dhoni of Inidan cricket team is a wicket

keeper but the chosen wicket-keeper is a different person (say Brendon McCullum or Parthiv

Patel), we still consider the situation as if the wicket-keeper is identical to captain (that is, the

captain will also perform the job of wicket-keeping) and replace the chosen wicket-keeper with a

random player (but non-identical to any other already chosen players) from RL.

By the above procedure all the three constraints are satisfied with all population

members. The variable vector is used to compute the other vectors such as budget and remaining

constraints. By counting the number of foreign players in the team and calculate the total budget

for hiring them. The formula for the budget analysis is given below which tells us that which is

feasible and which is non feasible.

Multi-Objective Optimization Results:

The following graph shows the trade-off front created by the changed NSGS II. Each point on

the trade-off front represents a team of 11 players. A few teams corresponding to the trade-off

points marked on the figure are shown in Table 1. Each team has a captain, a wicket-keeper and

at most four foreign players. The team satisfies the budget constraints. For comparing the results,

considering the real tem of 11 cricketers who played for the team name called Chennai Super

Page 8: data abastraction using R

Kings in the IPL tournament took place in India during April 2011 to may 2011 and won the

trophy. The batting and bowling performance of the team are calculated by the same method

which is mentioned above. The total cost of the CSK team in this tournament is 7.5million $

which is not feasible to the representation. Still taken into consideration, CSK team has bowling

and batting performances that are worse than a number of teams found by our NSGA II. The

graph says that the CSK team is non optimal and costlier too.

Disadvantages of the existed system:

This approach may not give an optimal or a near-optimal solution. Teams are not balanced.

Proposed Method for selection of Cricket Team:

The proposed method should generate the maximum number of teams with best performance and

all the teams should be balanced.

Main Modules for Proposed Method:

The main modules of the proposed system which are

Clustering Sorting Binding

Page 9: data abastraction using R

Clustering:

Clustering is a type of analysis which finds the clusters of data objects which are similar to each

other. The data objects of the cluster are more like each other than they are the members of the

clusters. Clustering technique is a Data Mining Technique used to group the data elements into

related groups which are having similar properties. Clustering is a type of problem which is

considered as the Unsupervised Learning. The definition of the clustering would be the “The

process of organizing objects into groups whose members are similar in some way”. In general

the cluster is a collection of the objects which are similar between them and are dissimilar to the

objects belonging to the other clusters.

Giving some useful information about what is clustering and how it is used is given below:

Data Clustering: 50 Years Beyond K-Means [2]:

In present days due to the increase in the high volumes of data in many types such as structured,

unstructured and raw data. To understand the data and to summarize the data is called Data

Analysis. Data Analysis is of two types Exploratory or Descriptive and Confirmatory.

“exploratory or descriptive, meaning that the investigator does not have pre-specified models or

hypotheses but wants to understand the general characteristics or structure of the high

dimensional data”[2]. “confirmatory or inferential, meaning that the investigator wants

to confirm the validity of a hypothesis or model or a set of assumptions given the

available data”[2].

Page 10: data abastraction using R

In recognition of the patterns data analysis is considered as the predictive modeling with

the trained data. Predicting the untrained data and this part is called as Learning. Problem of

learning is divided into i) Supervised that is Classification ii) Unsupervised that is Clustering.

Webster [Merriam-Webster Online Dictionary, 2008] defines cluster analysis as “a statistical

classification technique for discovering whether the individuals of a population fall into different

groups by making quantitative comparisons of multiple characteristics”. Clustering example is

given below with figures:

Page 11: data abastraction using R

The seven clusters in (a) (denoted by seven different colors in 1(b)) differ in shape, size, and

density.

Why Clustering:

Clustering is used for analysis if the data with many variants which are grouped together.

Clustering is used for mainly 3 purposes:

Underlying structure: to gain insight into data, generate hypotheses, detect anomalies, and

identify salient features.

Natural classification: to identify the degree of similarity among forms or organisms

(phylogenetic relationship).

Compression: as a method for organizing the data and summarizing it through cluster prototypes.

There are many clustering algorithms, according to the research some of the clustering

algorithms are

K means Bayesian

Page 12: data abastraction using R

K- Means Clustering algorithm:

Experimental study of Data clustering using k-Means and modified algorithms [3]:

K Means algorithm is the prominent algorithm, the main idea is to classify the data objects in K

clusters where K is the number of the clusters which should be specified before the iterative

relocation technique which converges to local minimum.

K- Means clustering consists of two separate phases: “First phase is to determine k centers at

random one for each cluster. Next phase is to determine distance between data points in Dataset

and the cluster centers and assigning the data point to its nearest cluster”. New centers are

calculated by taking the mean of the points of the clusters, the clusters are created before centers

are calculated. This has to be done why because the new points may change the centers of the

clusters. The iteration continues until the centers are not getting updated. The square error

criterion is used which is defined by equation 1:

Pseudo code for k-Means algorithm is as follows:

Input: Dataset of n data points di (i = 1 to N)

Desired number of clusters = k

Output: N data points clustered into k clusters.

Steps:

Page 13: data abastraction using R

1. Randomly select k data objects from dataset D as the initial cluster centers.

2. Repeat

3. Calculate the distance between each data point di (i =1to N) and all k cluster centers Cj

(j= 1 to k) and assign the data object di to the nearest cluster j.

4. For each cluster j, recalculate the cluster center.

5. Until no changing of cluster centers.

Pros of K Means Algorithm:

Simple.

Fast for low dimensional data.

It can find pure sub clusters if large number of clusters is specified.

Cons of K Means Algorithm:

K-Means cannot handle non-globular data of different sizes and densities.

K-Means will not identify outliers.

K-Means is restricted to data which has the notion of a center (centroid).

Bayesian Algorithm:

Efficient Bayesian Methods for Clustering [4]:

Bayesian is an approach which is based on the mathematical functions proposed by Bayes and

Laplace.

Bayes Rules States that:

Page 14: data abastraction using R

X may be the data point and theta is model parameters. Probability of theta is the

probability of theta depicts the probability of the theta before observing the

information about x. Probability of x and theta is the probability of x conditioned on

theta and is depicted as the likelihood. Probability of the theta and x is the posterior

probability of theta after observing x and probability of x is the normalizing the

constant.

Bayesian Mixture Model:

To avoid the over fitting and estimating the parameters of theta, integrating the probability of

theta is taken into account. Since theta is the unknown value the average of the over all possible

values of the model are taken into account. This is equation which is often called as the

computing the marginal likelihood. The equation of Bayesian Mixture model is given below:

Selection of the Clustering Model:

Sorting:

Sorting is the arranging the data items according to certain sets in which all the data sets

will have a common thing. Sorting the data sets in Ascending order is the best and possible way

for the cricket selection.

Binding from the clusters:

Page 15: data abastraction using R

The data sets which are clustered and sorted are in multiple locations, the form a team

combining them before forwarding to graphing. The data sets are from the clusters are added to

the matrix rows and columns.

Proposed Model Architecture explanation:

The architecture of the

Use case Diagram of the Proposed System:

Page 16: data abastraction using R

The client gives the data of the players with the all the statistics needed. Admin takes the data

from the client and loads into the R- Tool. By using the K- Means Clustering algorithm in the R

–Studio clusters are created according the correspondent statistics of the data given by the client.

The cluster data is sorted and used for forming the teams. The generated teams are used to plot

the graph of the batting performance and bowling performance.