R Programming Quick Notes :: Part

R Programming Quick Notes :: Part - 4

In Part - 3, we explored factor type, date and time types, control structures, and functions in R.

In this part, we will explore how to read data from different sources and how to use the apply functions.

Hands-on With R - IV

Data can be in different forms such as textual or binary. In addition, data can come from different sources such as files or network.

Most often data comes in a textual form stored in a .csv format.

Let us create two csv sample data files named sample-data.csv and sample-data-2.csv respectively.

The following is the sample-data.csv file:

And, the following is the sample-data-2.csv file:

Also, create a gzipped version of sample-data.csv called sample-data.csv.gz. For convenience, we have uploaded both the csv and the gzipped file at PolarSPARC.

Let us create the following R script named read_csv_data.R in RStudio:

Execute the R script read_csv_data.R in RStudio and the following is the output:

Output (read_csv_data.R)

        > a <- read.table('sample-data.csv', header=TRUE, sep=',')
        > str(a)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(a)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > b <- read.table('sample-data-2.csv', header=TRUE, sep=',', skip=3, blank.lines.skip=TRUE)
        > str(b)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(b)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > c <- read.csv('sample-data.csv')
        > str(c)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        Warning message:
        closing unused connection 3 (sample-data.csv.gz)
        > print(c)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > # Use connection to file
        > d <- file('sample-data.csv', open='r')
        > e <- read.csv(d)
        > str(e)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(e)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > # Use connection to compressed file
        > f <- gzfile('sample-data.csv.gz', open='r')
        > g <- read.csv(f)
        Warning messages:
        1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
          seek on a gzfile connection returned an internal error
        2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
          seek on a gzfile connection returned an internal error
        > str(g)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(g)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > # Use connection to file
        > h <- url('file://./sample-data.csv')
        > i <- read.csv(h)
        > str(i)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(i)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > # Use connection to a website
        > j <- url('http://www.polarsparc.com/data/sample-data.csv')
        > k <- read.csv(j)
        > str(k)
        'data.frame':	10 obs. of  6 variables:
         $ age      : int  29 25 26 28 21 23 24 27 30 22
         $ color    : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5
         $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2
         $ employed : logi  NA TRUE NA NA NA NA ...
         $ height   : num  5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9
         $ weight   : int  150 170 170 150 160 150 NA NA 190 150
        > print(k)
           age  color education employed height weight
        1   29    Red       PHD       NA    5.1    150
        2   25 Orange        BS     TRUE    5.1    170
        3   26   Blue        BS       NA    5.0    170
        4   28 Yellow        MS       NA    5.9    150
        5   21   Blue        MS       NA    5.1    160
        6   23   Blue        MS       NA    5.9    150
        7   24 Orange        HS    FALSE    5.0     NA
        8   27  Green        BS     TRUE    5.5     NA
        9   30    Red        BS    FALSE    5.9    190
        10  22 Yellow        HS       NA    5.9    150
        >
        > rm(list=ls())

The read.table() function reads data from a specified file and returns the data as a data.frame. The following are some of the important arguments to the function:

file - used to specify either the name or the connection to the input file
header - set to TRUE if the first line of the input file is a header line
sep - used to specify the separator character
skip - used to specify the count of lines to skip from the beginning of the input file
blank.lines.skip - set to TRUE if the input file has empty (or blank) lines that need to be skipped

The read.csv() function is similar to the read.table() function except that many of the arguments have default values such as the header which is set to TRUE, the sep which is set to comma (,), etc.

The file() function returns a connection to the specified input file. The open argument specifies the mode of operation on the input file. The value r specifies the read mode.

The gzfile() function returns a connection to the specified compressed file. The open argument specifies the mode of operation on the compressed file. The value r specifies the read mode.

The url() function returns a connection to the specified URL. The URL schemes supported are http, ftp and file.

Assume we have a list with 3 elements, where each element is a vector with 5 elements. We desire to find the max value for each element of the list. We can use the for-loop statement to achieve the result. But there is much easier way to achieve the same using the apply family of functions in R.

Let us dive right in to explore the lapply and the sapply functions in R.

Let us create the following R script named apply_functions-1.R in RStudio:

Execute the R script apply_functions-1.R in RStudio and the following is the output:

Output (apply_functions-1.R)

        > a <- list(x = round(rnorm(5, mean=5, sd=1), digits=1),
        +           y = round(runif(5, min=1, max=10), digits=0),
        +           z = sample(LETTERS, 5))
        > print(a)
        $x
        [1] 4.9 4.6 5.4 4.6 4.9

        $y
        [1] 9 3 3 7 7

        $z
        [1] "Z" "X" "F" "V" "T"

        >
        > b <- vector('list', length(a))
        > c <- c('', '', '')
        > names(b) <- names(a)
        > names(c) <- names(a)
        > for (i in seq_along(a)) {
        +   v <- a[[i]]
        +   m <- v[1]
        +   for (j in seq_along(v)) {
        +     if (v[j] > m) {
        +       m <- v[j]
        +     }
        +   }
        +   b[[i]] <- m
        +   c[i] <- m
        + }
        >
        > print(b)
        $x
        [1] 5.4

        $y
        [1] 9

        $z
        [1] "Z"

        >
        > lapply(a, max)
        $x
        [1] 5.4

        $y
        [1] 9

        $z
        [1] "Z"

        >
        > print(c)
            x     y     z
        "5.4"   "9"   "Z"
        >
        > sapply(a, max)
            x     y     z
        "5.4"   "9"   "Z"
        >
        > rm(list=ls())

As is evident from the R script above, using the for-loop to find the max of elements is not simple.

The names() function gets or sets the names of the elements in a collection.

The seq_along() function generates the indices for the elements in a collection.

The lapply() function stands for 'list apply'. It iterates over a list of elements and applies the specified function to each element of the list. The value returned is a list.

The sapply() function stands for 'simplify list apply'. It iterates over a list of elements and applies the specified function to each element of the list. The value returned is a simplified vector if each result element is of length 1. Else returns a list.

Assume we have the data.frame loaded from sample-data.csv. We desire to solve the following problems on the data.frame:

Find the average height and weight for all the rows
Find the average weight grouped by the color
Find the average height and weight grouped by the education
Find the average height grouped by the education

Let us create the following R script named apply_functions-2.R in RStudio:

Execute the R script apply_functions-2.R in RStudio and the following is the output:

Output (apply_functions-2.R)

        > a <- read.csv('sample-data.csv')
        >
        > b <- apply(a[c('height', 'weight')], 2, mean, na.rm=TRUE)
        > print(b)
        height weight
          5.44 161.25
        >
        > c <- split(a[,'weight'], a$color)
        > print(c)
        $Blue
        [1] 170 160 150

        $Green
        [1] NA

        $Orange
        [1] 170  NA

        $Red
        [1] 150 190

        $Yellow
        [1] 150 150

        >
        > lapply(c, mean, na.rm=TRUE)
        $Blue
        [1] 160

        $Green
        [1] NaN

        $Orange
        [1] 170

        $Red
        [1] 170

        $Yellow
        [1] 150

        >
        > sapply(c, mean, na.rm=TRUE)
          Blue  Green Orange    Red Yellow
           160    NaN    170    170    150
        >
        > d <- split(a[,c('height', 'weight')], a$education)
        > print(d)
        $BS
          height weight
        2    5.1    170
        3    5.0    170
        8    5.5     NA
        9    5.9    190

        $HS
           height weight
        7     5.0     NA
        10    5.9    150

        $MS
          height weight
        4    5.9    150
        5    5.1    160
        6    5.9    150

        $PHD
          height weight
        1    5.1    150

        >
        > lapply(d, colMeans, na.rm=TRUE)
        $BS
          height   weight
          5.3750 176.6667

        $HS
        height weight
          5.45 150.00

        $MS
            height     weight
          5.633333 153.333333

        $PHD
        height weight
           5.1  150.0

        >
        > sapply(d, colMeans, na.rm=TRUE)
                     BS     HS         MS   PHD
        height   5.3750   5.45   5.633333   5.1
        weight 176.6667 150.00 153.333333 150.0
        >
        > e <- tapply(a[,'height'], a[,'education'], mean)
        > print(e)
              BS       HS       MS      PHD
        5.375000 5.450000 5.633333 5.100000
        >
        > rm(list=ls())

The apply() function evaluates the specified function on either the rows or columns of the specified tabular collection (matrix or data.frame).

The first argument specifies the tabular data to operate on
The second argument indicates on what dimension (row or column) to apply the function to. A value of 1 indicates rows, while a value of 2 indicates columns
The third argument specifies the function to be applied
The arguments beyond the third are passed as arguments to the specified function. The argument na.rm=TRUE indicates ignore NA values from the collection

The split() function segregates the specified collection into groups based on the specified set of factors.

The colMeans() function is the shorthand form for the function apply(data, 2, func), which computes the average on the columns of data.

The tapply() function is similar to the combination of split() and sapply() functions working together.

More to come in Part-5 ...