R Programming Quick Notes :: Part - 2
Bhaskar S | 03/17/2017 |
Overview
In Part - 1, we introduced the atomic data types, collection types, and some commonly used functions in R.
In this part, we will dive into the collection types and explore them further.
Hands-on With R - II
Vector
Let us start our journey with vector.
Let us assume a vector with at least n elements in it. The following are some of the operations one can perform on that vector:
[n] - return the element value at the n'th position
[-n] - return all the elements except the one at the n'th position
[m:n] - return all the elements starting at the m'th position and ending at the n'th position (where n > m)
(-[m:n]) - return all the elements except the ones starting at the m'th position and ending at the n'th position (where n > m)
[c(i, j, k, ...)] - return elements is the specified positions i, j, k, ...
[]-c(i, j, k, ...)] - return elements except those in the specified positions i, j, k, ...
Let us create the following R script named vector_ops.R in RStudio:
Execute the R script vector_ops.R in RStudio and the following is the output:
> a <- sample(1:25, 10) > print(a) [1] 2 11 7 18 16 22 12 9 24 25 > > b <- sample(1:10, 10, replace = TRUE) > print(b) [1] 2 1 2 4 7 10 2 4 5 9 > > print(a[5]) [1] 16 > > print(a[-5]) [1] 2 11 7 18 22 12 9 24 25 > > print(b[4:6]) [1] 4 7 10 > > print(b[-(4:6)]) [1] 2 1 2 2 4 5 9 > > print(a[c(3, 5, 7)]) [1] 7 16 12 > > print(a[-c(3, 5, 7)]) [1] 2 11 18 22 9 24 25 > > print(b/2) [1] 1.0 0.5 1.0 2.0 3.5 5.0 1.0 2.0 2.5 4.5 > > c <- a > 15 > print(a[c]) [1] 18 16 22 24 25 > > d <- (a > 15) & (a <= 20) > print(a[d]) [1] 18 16 > > e <- (b > 3) & (b <= 7) > print(b[e]) [1] 4 7 4 5 > > str(a) int [1:10] 2 11 7 18 16 22 12 9 24 25 > > length(b) [1] 10 > > which(a > 15) [1] 4 5 6 9 10 > > table(b) b 1 2 4 5 7 9 10 1 3 2 1 1 1 1 > > min(a) [1] 2 > > max(b) [1] 10 > > sum(a) [1] 146 > > mean(b) [1] 4.6 > > median(a) [1] 14 > > summary(b) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 2.0 4.0 4.6 6.5 10.0
The sample() function returns a vector of elements after taking a random sample from a specified vector (first argument), of the specified size (second argument) either with (replace = TRUE) or without replacement.
The logical expression a > 15 (or for that matter of fact (a > 15) & (a <= 20)) returns a logical (TRUE or FALSE) vector that is the results of performing the logical operation on each element in the vector a.
The str() function displays the structure of any R objects such as a vector, list, matrix, or a data.frame.
The length() function returns a count of the number of elements in the vector.
The which() function returns a vector of index positions where the specified logical expression is TRUE.
The table() function returns a frequency table that lists each unique element of the specified vector and how many times that element occurs.
The min() function finds the minimum value in the vector.
The max() function finds the maximum value in the vector.
The sum() function computes the sum of all the elements in the numeric vector.
The mean() function computes the arithmetic mean of all the elements in the numeric vector.
The median() function finds the median from all the elements in the numeric vector.
The summary() function outputs the distribution of the values in the numeric vector, such as, the min value, max value, the mean, the median, and the quartiles.
List
Now, let us move on to explore list.
A list can also be created using any of the following two ways:
list(value1, value2, value3, ...)
OR
list(name1 = value1, name2 = value2, name3 = value3, ...)
where, name1, name2, name3, etc are names.
Let us assume a list with at least n elements in it.
The following are some of the operations one can perform on a list:
[n] - return the element value at the n'th position as a list
[[n]] - return the element value as is at the n'th position. This is the difference compared to the [n] operator
[m:n] - return all the elements starting at the m'th position and ending at the n'th position (where n > m) as a list
[c(i, j, k, ...)] - return elements is the specified positions i, j, k, ... as a list
[[c(m, n)]] - return the element value at n'th position of the collection located at the m'th position
[[m]][[n]] - is the same as [[c(m, n)]]
[['xyz']] - return the element value with the name xyz
$xyz - same as [['xyz']]
Let us create the following R script named list_ops.R in RStudio:
Execute the R script list_ops.R in RStudio and the following is the output:
> a <- list(5, 'ABC', c(1, 2, 3), 7.5, FALSE) > print(a) [[1]] [1] 5 [[2]] [1] "ABC" [[3]] [1] 1 2 3 [[4]] [1] 7.5 [[5]] [1] FALSE > > print(a[3]) [[1]] [1] 1 2 3 > class(a[3]) [1] "list" > > print(a[[1]]) [1] 5 > class(a[[1]]) [1] "numeric" > > print(a[2:3]) [[1]] [1] "ABC" [[2]] [1] 1 2 3 > class(a[2:3]) [1] "list" > > print(a[c(1, 3)]) [[1]] [1] 5 [[2]] [1] 1 2 3 > > print(a[[c(3, 2)]]) [1] 2 > > print(a[[3]][[2]]) [1] 2 > > b <- list(b1 = 10, b2 = 'PQR', b3 = 6:8, b4 = 8.75, b5 = TRUE) > print(b) $b1 [1] 10 $b2 [1] "PQR" $b3 [1] 6 7 8 $b4 [1] 8.75 $b5 [1] TRUE > > print(b[['b1']]) [1] 10 > > print(b$b3) [1] 6 7 8
Matrix
Now, let us switch gears to explore matrix.
One of the ways by which a matrix can be created is using the format:
matrix(vector_of_data, no_of_rows, no_of_columns)
By default, the data elements in a Matrix are filled column-wise. One can change that behavior to fill the data elements row-wise by specifying the argument byrow = TRUE.
One can assign names to each of the rows and columns in a matrix using the argument dirnames = list(rows_names, column_names).
Let us assume a matrix with m rows and n columns.
The following are some of the operations one can perform on a matrix:
[x, y] - return the data element at the x'th row and y'th column (where x <= m and y <= n)
[x,] - return the data elements across all the columns in the x'th row (where x <= m) as a vector
[,x] - return the data elements across all the rows in the x'th column (where x <= n) as a vector
[x:y,] - return all the data elements across all the columns starting at the x'th row and ending at the y'th row (where x <= m and y <= m) as a matrix
[,x:y] - return all the data elements across all the rows starting at the x'th column and ending at the y'th column (where x <= n and y <= n) as a matrix
['r', 'c'] - return the data element at the row with the name 'r' and at the column with the name 'c'
%*% - perform a true matrix multiplication (not element-wise)
Let us create the following R script named matrix_ops.R in RStudio:
Execute the R script matrix_ops.R in RStudio and the following is the output:
> a <- matrix(1:20, 4, 5) > print(a) [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20 > > class(a) [1] "matrix" > > dim(a) [1] 4 5 > > b <- matrix(1:20, 4, 5, byrow = TRUE) > print(b) [,1] [,2] [,3] [,4] [,5] [1,] 1 2 3 4 5 [2,] 6 7 8 9 10 [3,] 11 12 13 14 15 [4,] 16 17 18 19 20 > > m <- 1:5 > n <- 6:10 > o <- 11:15 > p <- 16:20 > > c <- cbind(m, n, o, p) > print(c) m n o p [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 > > d <- rbind(m, n, o, p) > print(d) [,1] [,2] [,3] [,4] [,5] m 1 2 3 4 5 n 6 7 8 9 10 o 11 12 13 14 15 p 16 17 18 19 20 > > e <- matrix(1:20, 4, 5, + dimnames = list(c('r1', 'r2', 'r3', 'r4'), + c('c1', 'c2', 'c3', 'c4', 'c5'))) > > print(e) c1 c2 c3 c4 c5 r1 1 5 9 13 17 r2 2 6 10 14 18 r3 3 7 11 15 19 r4 4 8 12 16 20 > > rownames(e) [1] "r1" "r2" "r3" "r4" > > colnames(e) [1] "c1" "c2" "c3" "c4" "c5" > > print(a[1,]) [1] 1 5 9 13 17 > > print(b[,2]) [1] 2 7 12 17 > > print(c[2:3,]) m n o p [1,] 2 7 12 17 [2,] 3 8 13 18 > > print(d[,2:3]) [,1] [,2] m 2 3 n 7 8 o 12 13 p 17 18 > > print(e[2,4]) [1] 14 > > print(e['r2', 'c3']) [1] 10 > > f <- t(e) > print(f) r1 r2 r3 r4 c1 1 2 3 4 c2 5 6 7 8 c3 9 10 11 12 c4 13 14 15 16 c5 17 18 19 20 > > a * b [,1] [,2] [,3] [,4] [,5] [1,] 1 10 27 52 85 [2,] 12 42 80 126 180 [3,] 33 84 143 210 285 [4,] 64 136 216 304 400 > > a / b [,1] [,2] [,3] [,4] [,5] [1,] 1.0000000 2.5000000 3.0000000 3.2500000 3.400000 [2,] 0.3333333 0.8571429 1.2500000 1.5555556 1.800000 [3,] 0.2727273 0.5833333 0.8461538 1.0714286 1.266667 [4,] 0.2500000 0.4705882 0.6666667 0.8421053 1.000000 > > a %*% f r1 r2 r3 r4 [1,] 565 610 655 700 [2,] 610 660 710 760 [3,] 655 710 765 820 [4,] 700 760 820 880
The dim() function returns the dimensions (rows, columns) for the specified matrix.
The rownames() function returns the row names for the specified matrix.
The colnames() function returns the column names for the specified matrix.
The cbind() function combines the specified list of vectors as columns of a matrix.
The rbind() function combines the specified list of vectors as rows of a matrix.
The t() function performs a transpose operation on the specified matrix.
Data Frame
Finally, we will explore data.frame, which is a tabular, spreadsheet like data object.
Let us assume a data.frame with n rows and columns with names c1, c2, c3, ..., cm.
The following are some of the operations one can perform on a data.frame:
['c1'] - return a data.frame with just the column with name c1
[c('c2', 'c3')] - return a data.frame with the columns named c2 and c3
[['c4']] - return the data elements across all the rows for the column named c4 as a vector
$c5 - return the data elements across all the rows for the column named c5 as a vector
[x,] - return the data elements across all the columns in the x'th row (where x <= n) as a data.frame
[x, c('c6', 'c7')] - return the data elements for the columns named c6 and c7 in the x'th row (where x <= n) as a data.frame
[x:y,] - return all the data elements across all the columns starting at the x'th row and ending at the y'th row (where x <= n and y <= n) as a data.frame
[x:y, c('c8', 'c9')] - return the data elements for the columns named c8 and c9 starting at the x'th row and ending at the y'th row (where x <= n and y <= n) as a data.frame
[c(x, y, z),] - return all the data elements across all the columns for rows at positions x, y, and z (where x <= n, y <= n, and z <= n) as a data.frame
[c(x, y, z), c('c1', 'c2')] - return the data elements for the columns named c1 and c2 for rows at positions x, y, and z (where x <= n, y <= n, and z <= n) as a data.frame
[x, 'c3'] - return the data element at the x'th row for the column with the name 'c3'
[lexpr,] - return all the data elements across all the columns for the rows that satisfy the logical expression lexpr, where the logical expression will involve one or more of the columns
Let us create the following R script named dataframe_ops.R in RStudio:
Execute the R script dataframe_ops.R in RStudio and the following is the output:
> a <- sample(21:30, 10) > b <- sample(c('Blue', 'Green', 'Orange', 'Red', 'Yellow'), 10, replace = TRUE) > c <- sample(c(NA, 'HS', 'BS', 'MS', 'PHD'), 10, replace = TRUE) > d <- sample(c(NA, TRUE, FALSE), 10, replace = TRUE) > e <- sample(round(runif(5, min = 5, max = 6.5), digits = 1), 10, replace = TRUE) > f <- sample(c(NA, 150, 160, 170, 180, 190, 200), 10, replace = TRUE) > > df <- data.frame(age = a, + color = b, + education = c, + employed = d, + height = e, + weight = f) > print(df) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 GreenTRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > class(df) [1] "data.frame" > > nrow(df) [1] 10 > > ncol(df) [1] 6 > > dim(df) [1] 10 6 > > length(df) [1] 6 > > str(df) 'data.frame': 10 obs. of 6 variables: $ age : int 29 25 26 28 21 23 24 27 30 22 $ color : Factor w/ 5 levels "Blue","Green",..: 4 3 1 5 1 1 3 2 4 5 $ education: Factor w/ 4 levels "BS","HS","MS",..: 4 1 1 3 3 3 2 NA 1 2 $ employed : logi NA TRUE NA NA NA NA ... $ height : num 5.1 5.1 5 5.9 5.1 5.9 5 5.5 5.9 5.9 $ weight : num 150 170 170 150 160 150 NA NA 190 150 > > head(df) age color education employed height weight 1 29 Red PHD NA 5.1 150 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 > > tail(df) age color education employed height weight 5 21 Blue MS NA 5.1 160 6 23 Blue MS NA 5.9 150 7 24 Orange HS FALSE 5.0 NA 8 27 Green TRUE 5.5 NA 9 30 Red BS FALSE 5.9 190 10 22 Yellow HS NA 5.9 150 > > names(df) [1] "age" "color" "education" "employed" "height" "weight" > > df['age'] age 1 29 2 25 3 26 4 28 5 21 6 23 7 24 8 27 9 30 10 22 > > df[c('color', 'education')] color education 1 Red PHD 2 Orange BS 3 Blue BS 4 Yellow MS 5 Blue MS 6 Blue MS 7 Orange HS 8 Green 9 Red BS 10 Yellow HS > > df[['employed']] [1] NA TRUE NA NA NA NA FALSE TRUE FALSE NA > > df$age [1] 29 25 26 28 21 23 24 27 30 22 > > df[5,] age color education employed height weight 5 21 Blue MS NA 5.1 160 > > df[5, c('height', 'weight')] height weight 5 5.1 160 > > df[2:5,] age color education employed height weight 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 4 28 Yellow MS NA 5.9 150 5 21 Blue MS NA 5.1 160 > > df[2:5, c('age', 'education')] age education 2 25 BS 3 26 BS 4 28 MS 5 21 MS > > df[c(1, 3, 5),] age color education employed height weight 1 29 Red PHD NA 5.1 150 3 26 Blue BS NA 5.0 170 5 21 Blue MS NA 5.1 160 > > df[c(1, 3, 5), c('employed', 'height')] employed height 1 NA 5.1 3 NA 5.0 5 NA 5.1 > > df[5, 'weight'] [1] 160 > > g <- df$weight > 160 > > df[g,] age color education employed height weight 2 25 Orange BS TRUE 5.1 170 3 26 Blue BS NA 5.0 170 NA NA NA NA NA NA.1 NA NA NA NA 9 30 Red BS FALSE 5.9 190 > > h <- (df$age > 22) & (df$education == 'MS') > > df[h,] age color education employed height weight 4 28 Yellow MS NA 5.9 150 6 23 Blue MS NA 5.9 150 NA NA NA NA NA
The runif() function returns a specified number of random samples (the first argument) as a uniform distribution in the interval between min to max.
The nrow() function returns the number of rows for the specified data.frame.
The ncol() function returns the number of columns for the specified data.frame.
The head() function returns the first 5 rows of the specified data.frame.
The tail() function returns the last 5 rows of the specified data.frame.
More to come in Part-3 ...
References