R Programming Quick Notes :: Part - 4
Bhaskar S | 04/16/2017 |
Overview
In Part - 3, we explored factor type, date and time types, control structures, and functions in R.
In this part, we will explore how to read data from different sources and how to use the apply functions.
Hands-on With R - IV
Reading Data
Data can be in different forms such as textual or binary. In addition, data can come from different sources such as files or network.
Most often data comes in a textual form stored in a .csv format.
Let us create two csv sample data files named sample-data.csv and sample-data-2.csv respectively.
The following is the sample-data.csv file:
And, the following is the sample-data-2.csv file:
Also, create a gzipped version of sample-data.csv called sample-data.csv.gz. For convenience, we have uploaded both the csv and the gzipped file at PolarSPARC.
Let us create the following R script named read_csv_data.R in RStudio:
Execute the R script read_csv_data.R in RStudio and the following is the output:
> a <- read.table('sample-data.csv', header=TRUE, sep=',') > str(a) 'data.frame': 10 obs. of 6 variables: $ age : int 29 25 26 28 21 23 24 27 30 22 $ color : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2 $ employed : logi NA TRUE NA NA NA NA ... $ height : num 5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9 $ weight : int 150 170 170 150 160 150 NA NA 190 150 > print(a) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 Green BS TRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > b <- read.table('sample-data-2.csv', header=TRUE, sep=',', skip=3, blank.lines.skip=TRUE) > str(b) 'data.frame': 10 obs. of 6 variables: $ age : int 29 25 26 28 21 23 24 27 30 22 $ color : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2 $ employed : logi NA TRUE NA NA NA NA ... $ height : num 5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9 $ weight : int 150 170 170 150 160 150 NA NA 190 150 > print(b) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 Green BS TRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > c <- read.csv('sample-data.csv') > str(c) 'data.frame': 10 obs. of 6 variables: $ age : int 29 25 26 28 21 23 24 27 30 22 $ color : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2 $ employed : logi NA TRUE NA NA NA NA ... $ height : num 5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9 $ weight : int 150 170 170 150 160 150 NA NA 190 150 Warning message: closing unused connection 3 (sample-data.csv.gz) > print(c) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 Green BS TRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > # Use connection to file > d <- file('sample-data.csv', open='r') > e <- read.csv(d) > str(e) 'data.frame': 10 obs. of 6 variables: $ age : int 29 25 26 28 21 23 24 27 30 22 $ color : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2 $ employed : logi NA TRUE NA NA NA NA ... $ height : num 5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9 $ weight : int 150 170 170 150 160 150 NA NA 190 150 > print(e) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 Green BS TRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > # Use connection to compressed file > f <- gzfile('sample-data.csv.gz', open='r') > g <- read.csv(f) Warning messages: 1: In read.table(file = file, header = header, sep = sep, quote = quote, : seek on a gzfile connection returned an internal error 2: In read.table(file = file, header = header, sep = sep, quote = quote, : seek on a gzfile connection returned an internal error > str(g) 'data.frame': 10 obs. of 6 variables: $ age : int 29 25 26 28 21 23 24 27 30 22 $ color : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2 $ employed : logi NA TRUE NA NA NA NA ... $ height : num 5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9 $ weight : int 150 170 170 150 160 150 NA NA 190 150 > print(g) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 Green BS TRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > # Use connection to file > h <- url('file://./sample-data.csv') > i <- read.csv(h) > str(i) 'data.frame': 10 obs. of 6 variables: $ age : int 29 25 26 28 21 23 24 27 30 22 $ color : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2 $ employed : logi NA TRUE NA NA NA NA ... $ height : num 5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9 $ weight : int 150 170 170 150 160 150 NA NA 190 150 > print(i) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 Green BS TRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > # Use connection to a website > j <- url('http://www.polarsparc.com/data/sample-data.csv') > k <- read.csv(j) > str(k) 'data.frame': 10 obs. of 6 variables: $ age : int 29 25 26 28 21 23 24 27 30 22 $ color : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 1 1 2 $ employed : logi NA TRUE NA NA NA NA ... $ height : num 5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9 $ weight : int 150 170 170 150 160 150 NA NA 190 150 > print(k) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 Green BS TRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > rm(list=ls())
The read.table() function reads data from a specified file and returns the data as a data.frame. The following are some of the important arguments to the function:
file - used to specify either the name or the connection to the input file
header - set to TRUE if the first line of the input file is a header line
sep - used to specify the separator character
skip - used to specify the count of lines to skip from the beginning of the input file
blank.lines.skip - set to TRUE if the input file has empty (or blank) lines that need to be skipped
The read.csv() function is similar to the read.table() function except that many of the arguments have default values such as the header which is set to TRUE, the sep which is set to comma (,), etc.
The file() function returns a connection to the specified input file. The open argument specifies the mode of operation on the input file. The value r specifies the read mode.
The gzfile() function returns a connection to the specified compressed file. The open argument specifies the mode of operation on the compressed file. The value r specifies the read mode.
The url() function returns a connection to the specified URL. The URL schemes supported are http, ftp and file.
Apply Functions
Assume we have a list with 3 elements, where each element is a vector with 5 elements. We desire to find the max value for each element of the list. We can use the for-loop statement to achieve the result. But there is much easier way to achieve the same using the apply family of functions in R.
Let us dive right in to explore the lapply and the sapply functions in R.
Let us create the following R script named apply_functions-1.R in RStudio:
Execute the R script apply_functions-1.R in RStudio and the following is the output:
> a <- list(x = round(rnorm(5, mean=5, sd=1), digits=1), + y = round(runif(5, min=1, max=10), digits=0), + z = sample(LETTERS, 5)) > print(a) $x [1] 4.9 4.6 5.4 4.6 4.9 $y [1] 9 3 3 7 7 $z [1] "Z" "X" "F" "V" "T" > > b <- vector('list', length(a)) > c <- c('', '', '') > names(b) <- names(a) > names(c) <- names(a) > for (i in seq_along(a)) { + v <- a[[i]] + m <- v[1] + for (j in seq_along(v)) { + if (v[j] > m) { + m <- v[j] + } + } + b[[i]] <- m + c[i] <- m + } > > print(b) $x [1] 5.4 $y [1] 9 $z [1] "Z" > > lapply(a, max) $x [1] 5.4 $y [1] 9 $z [1] "Z" > > print(c) x y z "5.4" "9" "Z" > > sapply(a, max) x y z "5.4" "9" "Z" > > rm(list=ls())
As is evident from the R script above, using the for-loop to find the max of elements is not simple.
The names() function gets or sets the names of the elements in a collection.
The seq_along() function generates the indices for the elements in a collection.
The lapply() function stands for 'list apply'. It iterates over a list of elements and applies the specified function to each element of the list. The value returned is a list.
The sapply() function stands for 'simplify list apply'. It iterates over a list of elements and applies the specified function to each element of the list. The value returned is a simplified vector if each result element is of length 1. Else returns a list.
Assume we have the data.frame loaded from sample-data.csv. We desire to solve the following problems on the data.frame:
Find the average height and weight for all the rows
Find the average weight grouped by the color
Find the average height and weight grouped by the education
Find the average height grouped by the education
Let us create the following R script named apply_functions-2.R in RStudio:
Execute the R script apply_functions-2.R in RStudio and the following is the output:
> a <- read.csv('sample-data.csv') > > b <- apply(a[c('height', 'weight')], 2, mean, na.rm=TRUE) > print(b) height weight 5.44 161.25 > > c <- split(a[,'weight'], a$color) > print(c) $Blue [1] 170 160 150 $Green [1] NA $Orange [1] 170 NA $Red [1] 150 190 $Yellow [1] 150 150 > > lapply(c, mean, na.rm=TRUE) $Blue [1] 160 $Green [1] NaN $Orange [1] 170 $Red [1] 170 $Yellow [1] 150 > > sapply(c, mean, na.rm=TRUE) Blue Green Orange Red Yellow 160 NaN 170 170 150 > > d <- split(a[,c('height', 'weight')], a$education) > print(d) $BS height weight 2 5.1 170 3 5.0 170 8 5.5 NA 9 5.9 190 $HS height weight 7 5.0 NA 10 5.9 150 $MS height weight 4 5.9 150 5 5.1 160 6 5.9 150 $PHD height weight 1 5.1 150 > > lapply(d, colMeans, na.rm=TRUE) $BS height weight 5.3750 176.6667 $HS height weight 5.45 150.00 $MS height weight 5.633333 153.333333 $PHD height weight 5.1 150.0 > > sapply(d, colMeans, na.rm=TRUE) BS HS MS PHD height 5.3750 5.45 5.633333 5.1 weight 176.6667 150.00 153.333333 150.0 > > e <- tapply(a[,'height'], a[,'education'], mean) > print(e) BS HS MS PHD 5.375000 5.450000 5.633333 5.100000 > > rm(list=ls())
The apply() function evaluates the specified function on either the rows or columns of the specified tabular collection (matrix or data.frame).
The first argument specifies the tabular data to operate on
The second argument indicates on what dimension (row or column) to apply the function to. A value of 1 indicates rows, while a value of 2 indicates columns
The third argument specifies the function to be applied
The arguments beyond the third are passed as arguments to the specified function. The argument na.rm=TRUE indicates ignore NA values from the collection
The split() function segregates the specified collection into groups based on the specified set of factors.
The colMeans() function is the shorthand form for the function apply(data, 2, func), which computes the average on the columns of data.
The tapply() function is similar to the combination of split() and sapply() functions working together.
More to come in Part-5 ...
References