Getting Started with Data set
Data set used in this project is downloaded from Cricsheet. I am planning to do the analysis in two phases. Phase 1 is based on Team wise evaluation and Phase 2 is player wise evaluation. First we will focus on team wise evaluation. Chosen team for the evaluation is Sri Lanka. You can download the data set used in this tutorial here (See diagram below).
In any kind of data mining analysis, the very first process is preparing the data. Missing values, repeated data, erroneous data, data type mismatch are some of the common things to be considered when working with raw data. So in this post I will guide you through in the process of getting ready with our data.
Our data set is in YAML format (Cricsheet provides experimental CSV and XML formats at the time of this post being written). As they state their data is experimental, I recommend you guys to work with YAML format. However there is no functionality to directly load YAML data set in R. As a solution I have written a Java program which converts above YAML data set into CSV format. You can download converted CSV file here. If someone is interested in the Java program, please find it through my Github Profile.
Steps in Preparing the data
- Load the CSV file.
- Convert loaded file into a data frame.
- Convert columns into appropriate data type (Character, Numeric, Boolean).
- Fill empty cell with values
- (There are some professional methods to deal with missing data, however in this data set we cannot use something like average value or mean value due to the nature of the problem domain. Therefore missing values are represented as "unknown").
- Remove white spaces within data sets (This is not necessary in a general data preparing process. By removing white spaces it was able to make "Sri Lanka" and " Sri Lanka" into one attribute value. You can see that the latter term is containing a space before, probably a typo. If we have not performed this step, attribute value "Sri Lanka" is different from value " Sri Lanka" which might cause erroneous results).
- Remove repeated data.
Herewith I have attached my R Script.
Congratulations!!! You are about to start the Cricket Sabermetrics analysis.

Comments
Post a Comment