Posts

What is this Blog about?

This blog will describe on how to apply Sabermetrics approach in Cricket using machine learning techniques. This will be an easy step by step guide on applying machine learning concepts to model the Cricket game. Language used for programming and analysis is R .  Ultimate goal of this project is proposing a data mining system for selecting a Winning Cricket Team!

What is Cricket Sabermetrics?

In the 1980s, work by Bill James on statistical analysis of Baseball named Sabermetrics gained a major popularity. Sabermetrics is study of the game baseball through scientific and objective analysis. The term was derived from the acronym SABR which stands for Society of American Baseball Research. Recognition of this work led to a revolution in the game of Baseball. Oakland Athletics baseball team went on to set the record for twenty consecutive wins in the 2002 season using Sabermetrics. Many teams and clubs adapted this approach in the following years and got succeeded. Many sports including cricket tend to follow similar approaches in their games after witnessing the transformation of game baseball through sabermetrics and its positive outcomes.  Nowadays analysis on a particular subject area is often interrelated with data science and machine learning.  Modern Sabermetrics can be defined as application of machine learning in Cricket data. I recommend to watch criti...

Getting Started with Data set

Image
Data set used in this project is downloaded from Cricsheet . I am planning to do the analysis in two phases. Phase 1 is based on Team wise evaluation and Phase 2 is player wise evaluation. First we will focus on team wise evaluation. Chosen team for the evaluation is Sri Lanka. You can download the data set used in this tutorial here (See diagram below). In any kind of data mining analysis, the very first process is preparing the data. Missing values, repeated data, erroneous data, data type mismatch are some of the common things to be considered when working with raw data. So in this post I will guide you through in the process of getting ready with our data. Our data set is in YAML format (Cricsheet provides experimental CSV and  XML formats at the time of this post being written). As they state their data is experimental, I recommend you guys to work with YAML format. However there is no functionality to  directly load YAML data set in R. As a solution I ...

Plot and Play : Finding Correlation between Attributes

Its always good to getting know about our data. It is a great way to visualize our data intuitively. When it come to visualization, plots and graphs will be our first choice. In statistical analysis, people call this procedure as Plot and Play. I am going to use ggplot package in R for basic plots. In order to install ggplot in R, you can simply type install.packages("ggplot") . Minimum use of domain knowledge is always good in statistical analysis. But in practice, it is really hard to purely analyze without domain knowledge. In following section I am going to see what kind of relations exist between chosen attributes. These attributes are chosen based on domain knowledge, which means I think they can possible be interrelated. But you are free to select any pair of attributes and identify patterns. I created a function called corPlot to plot graphs using ggplot given a data frame and  set of attributes. corPlot <- function (dataFrame, attribute_x, attribute_y, xN...

Assoication Rules

If you are someone who doesn't know about Association Rules, don't worry. You will get to know about it in a more easy way by reading this post. Association Rules Learning is also known as Association Rule Mining. It is simply finding association between attributes. For example, when  you buy something on an online store, it will prompt  as people who bought also bought this...... How do you think they give recommendations like this. The magic behind this is basically association rule mining. In this post we are going to find association relationship between attributes. But please note that, the time taken for the analysis is considerably high. I was using an i3 machine with 4GB RAM, which ran out of memory most of the time. Therefore, I recommend you guys to use a more advanced machine.  Following code find the association rules between the 11 attributes. This code will write generated a .CSV file of association rules. #loading necessary libraries library(ar...