The more I write code in R, the more I am impressed with the facilities that the language provides which is perfectly tailored to cleaning and tidying data, one of the most crucial steps in statistical analysis. As you may also admit it, failing at choosing the right tools in development context, might end up with a futile endeavour. R is in that sense absolute the right sieve for data miner.
The operations like selecting, filtering, etc. and combining them in nested function calls while cleaning your data, might become a very tedious programming style. Not only will the readability of your code suffer from that, but also your R programs will become more fragile quickly. Fortunately, there are some libraries out there which make such chaining data processes much easier for you.
Let’s start with an example in which we are going to work on the following data set of cars:
which has some features like, mpg for miles per gallon, cyl for cylinder, hp for horsepower, gear for gear.
Our first task is to select the cars which have 4-cyl:
With the raw selector form and a conditional, we can select the cars out which have exactly 4-cylinders. However, this form of filtering requires a weird expression and redundant usage of the name mtcars what might tend to be noisy.
It will even become worse, if you have more conditional expressions e.g:
which filters the cars out with 4-cylinder and at least 100 hp.
Let’s try the same expression with the dplyr
.
If you haven’t installed it yet,
dpylr
package does provide some utility functions like select, mutate, filter, etc. which give you more convenient way of juggling with your data. Have a look at the following one:
much easier to read and more intuitive. Let’s take it a step further and use a sort of projection on our data and return only those columns which we are interested in, cyl and hp.
We use select
for that. However, nesting expressions might also be very cumbersome, if the operations you perform on data, get longer. We need some sort of pipe operator to forward the output of one function to an another one like the pipe operator |
in UNIX systems. Fortunately, the dplyr
has something similar: %>%
:
You don’t even need to pass the name of the data set in the second function which will be inferred from the context.
Erhan