R Programming Quick Notes :: Part - 1
Bhaskar S | 03/11/2017 |
Overview
R is an open-source programming language that is one of the favorites amongst statisticians and data scientists with the following characteristics:
Cross Platform
Interpreted
Dynamic Typed
Modular
Functional
Installation and Setup
We will assume a Ubuntu 16.04 based platform with the user id alice.
Download and install the following software:
Base R Package r-base from the Ubuntu repository
Integrated R development environment RStudio
Edit the file /etc/R/Rprofile.site to include the following lines:
.libPaths("/home/alice/Applications/R")
setwd("/home/alice/Projects/R")
Hands-on With R - I
The following are the basic atomic data types supported by R:
logical - TRUE or FALSE$
numeric - signed decimal numbers (Ex: 5.25, -5.25)
integer - signed integers (Ex: 5L, -5L)
character - string of characters (Ex: 'Hello')
complex - a complex number with a real part and an imaginary part (Ex: 5 + 7i)
Let us create the following R script named basic.R in RStudio:
To execute the R script basic.R, select all the lines from the script and press the CTRL+<Enter> keys. The following is the output:
> a <- TRUE > print(a) [1] TRUE > class(a) [1] "logical" > > b <- 2.5 > print(b) [1] 2.5 > class(b) [1] "numeric" > > c <- 5L > print(c) [1] 5 > class(c) [1] "integer" > > d <- 'm' > print(d) [1] "m" > class(d) [1] "character" > > e <- 'hello' > print(e) [1] "hello" > class(e) [1] "character" > > f <- 2 +5i > print(f) [1] 2+5i > class(f) [1] "complex"
The operator <- is the assignment operator.
The print() function prints the value of the specified object.
The class() function return the type of the specified object.
The following are the basic collection types supported by R:
Vector - a sequence of data elements of the same basic atomic type
List - a generic vector that can contain any data type
Matrix - a collection of data elements of the same type arranged in a 2-dimensional tabular (rows and columns) layout
Array - a multi-dimensional collection of data elements
Data Frame - a multi-column data table where each column is a vector of the same length
Let us create the following R script named collection.R in RStudio:
Execute the R script collection.R in RStudio and the following is the output:
> a <- c('A', 'B', 'C', 'D', 'E') > print(a) [1] "A" "B" "C" "D" "E" > class(a) [1] "character" > > b <- 1:5 > print(b) [1] 1 2 3 4 5 > class(b) [1] "integer" > > c <- seq(5, 10) > print(c) [1] 5 6 7 8 9 10 > class(c) [1] "integer" > > d <- seq(1, 10, 2) > print(d) [1] 1 3 5 7 9 > class(d) [1] "numeric" > > e <- list(1, 'Two', 3L, FALSE) > print(e) [[1]] [1] 1 [[2]] [1] "Two" [[3]] [1] 3 [[4]] [1] FALSE > class(e) [1] "list" > > f <- matrix(1:8, 2, 4) > print(f) [,1] [,2] [,3] [,4] [1,] 1 3 5 7 [2,] 2 4 6 8 > class(f) [1] "matrix" > > g <- data.frame(a = 1:4, b = c('A', 'B', 'C', 'D'), c = c(TRUE, FALSE, FALSE, TRUE)) > print(g) a b c 1 1 A TRUE 2 2 B FALSE 3 3 C FALSE 4 4 D TRUE > class(g) [1] "data.frame"
The c() function stands for concatenate and allows us to create a vector.
The m:n expression is another way to create a vector from a sequence where m is the start of the sequence and n is the end of the sequence.
The seq() function is yet another way to create a vector and generates a sequence. The first argument specifies the start of the sequence, the second argument specifies the end of the sequence, and the third argument if specified is the increment.
The list() function allows us to create a list of data elements of different types.
The matrix() function allows us to create a matrix from the specified data elements with specified rows (second argument) and columns (third argument).
The data.frame() function allows us to create tabular data where each column is a vector of a certain data type and of the same length.
In the above R script collection.R, we have assigned (or bound) values to the variables (or symbols) a through g. In R, these name-value pairs are stored in the current working session called the global environment (referred to as .GlobalEnv). Think of the global environment as the working memory consisting of a collection of R objects. The global environment is initialized when R is started first.
Execute the following R function to take a peek into what symbols are in the global environment:
ls()
The following is the typical output:
> ls() [1] "a" "b" "c" "d" "e" "f" "g"
To remove a particular object from the global environment working memory, execute the following R function:
rm(object)
For example, to remove the object a, execute the following R function:
rm(a)
To remove all the objects from the global environment working memory, execute the following R function:
rm(list = ls())
To display the current working directory, execute the following R function:
getwd()
The following is the typical output:
> getwd() [1] "/home/alice/Projects/R"
Operations and functions in R are Vectorized meaning they not only work on a single data value but also work on a collection of data values in parallel at the same time.
The following are some of the commonly used basic functions supported by R:
abs() - to find the absolute value of the specified numeric value(s)
sqrt() - to find the square root of the specified numeric value(s)
ceiling() - to find the ceiling, which returns the smallest integer not less than the specified numeric value(s)
floor() - to find the floor, which returns the largest integer not greater than the specified numeric value(s)
round() - to round the numeric value(s) to the specified number of decimal places
substr() - to extract substrings from the specified character value(s)
paste() - to concatenate characters using the sep string
tolower() - to convert characters to lower case
toupper() - to convert characters to upper case
rep() - to replicate the value(s) specified in the first argument a specified number of times (second argument)
Let us create the following R script named vector.R in RStudio to demonstrate the functions mentioned above:
Execute the R script vector.R in RStudio and the following is the output:
> a <- seq(11, 20) > b <- seq(-5, -50, -5) > c <- c('abc', 'DEFGH', 'iJKlmnop', 'qrSTUVWxyz') > > d <- a + b > print(d) [1] 6 2 -2 -6 -10 -14 -18 -22 -26 -30 > > e <- b / a > print(e) [1] -0.4545455 -0.8333333 -1.1538462 -1.4285714 -1.6666667 -1.8750000 -2.0588235 -2.2222222 -2.3684211 -2.5000000 > > abs(b) [1] 5 10 15 20 25 30 35 40 45 50 > > f <- sqrt(a) > > ceiling(f) [1] 4 4 4 4 4 4 5 5 5 5 > > floor(f) [1] 3 3 3 3 3 4 4 4 4 4 > > round(f) [1] 3 3 4 4 4 4 4 4 4 4 > > substr(c, 2, 3) [1] "bc" "EF" "JK" "rS" > > paste('Welcome', 'to', 'R', 'Programming') [1] "Welcome to R Programming" > paste('Weekdays are - Mon', 'Tue', 'Wed', 'Thu', 'Fri', sep = ',') [1] "Weekdays are - Mon,Tue,Wed,Thu,Fri" > > tolower(c) [1] "abc" "defgh" "ijklmnop" "qrstuvwxyz" > > toupper(c) [1] "ABC" "DEFGH" "IJKLMNOP" "QRSTUVWXYZ" > > rep(1:5, 3) [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Missing data values in R is represented as either an NA or a NaN. A NaN value is also considered an NA value but an NA is never a NaN.
Let us create the following R script named missing.R in RStudio to demonstrate the missing values as mentioned above:
Execute the R script missing.R in RStudio and the following is the output:
> a <- c(5, 10, NA, 20, NA, 30) > b <- c(5, 10, NaN, 20, NA, 30) > > is.na(a) [1] FALSE FALSE TRUE FALSE TRUE FALSE > > is.nan(a) [1] FALSE FALSE FALSE FALSE FALSE FALSE > > is.na(b) [1] FALSE FALSE TRUE FALSE TRUE FALSE > > is.nan(b) [1] FALSE FALSE TRUE FALSE FALSE FALSE
More to come in Part-2 ...