Note:

-An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input.

-Notebook output are available as HTML, PDF, Word, or Latex.

-This Notebook as HTML is preferably open with Google Chrome.

-R-Code can be extracted as Rmd file under the button “Code” in the notebook.

-This Notebook using iterative development. It means the process starts with a simple implementation of a small set of idea requirements and iteratively enhances the evolving versions until the complete version is implemented and perfect.

Getting started

#https://rstudio-education.github.io/hopr/

This book are:

  • Teach you how to program in R, with hands-on examples
  • A friendly introduction to the R language
  • Newfound skills to solve practical data science problems
  • Go from loading data to writing your own functions
  • Help you become a data scientist, as well as a computer scientist
  • Focus on the programming skills that are most related to data science
  • Treat R purely as a programming language


==Project 1: Weighted Dice==

Improve your ability as a data scientist:

  • Memorize (store) entire data sets
  • Recall data values on demand
  • Perform complex calculations with large amounts of data
  • Do repetitive tasks without becoming careless or bored

Your first mission is simple: assemble R code that will simulate rolling a pair of dice, like at a craps table. In this project, you will learn how to:

  • Use the R and RStudio interfaces
  • Run R commands
  • Create R objects
  • Write your own R functions and scripts
  • Load and use R packages
  • Generate random samples
  • Create quick plots
  • Get help when you need it

You’ll need to have both R and RStudio installed on your computer before you can use them. Both are free and easy to download. See Appendix A


1. The Very Basics

In it, you will build a pair of virtual dice that you can use to generate random numbers.

The R User Interface:

#https://rstudio-education.github.io/hopr/basics.html#the-r-user-interface

If you do not yet have R and RStudio intalled on your computer-or do not know what I am talking about-visit Appendix A. The appendix will give you an overview of the two free tools and tell you how to download them.

When you type a command at the prompt and hit Enter, your computer executes the command and shows you the results. Then RStudio displays a fresh prompt for your next command. For example, if you type 1 + 1 and hit Enter, RStudio will display:

1 + 1 
[1] 2

You can mostly ignore the numbers that appear in brackets:

100:130 
 [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
[24] 123 124 125 126 127 128 129 130

If you type an incomplete command and press Enter, R will display a + prompt, which means R is waiting for you to type the rest of your command. Either finish the command or hit Escape to start over:

5 -
Error: Incomplete expression: 5 -

If you type a command that R doesn’t recognize, R will return an error message. If you ever see an error message, don’t panic. R is just telling you that your computer couldn’t understand or do what you asked it to do.

3 % 5
Error: unexpected input in "3 % 5"

Once you get the hang of the command line, you can easily do anything in R that you would do with a calculator. For example, you could do some basic arithmetic:

2 * 3
[1] 6
4 - 1
[1] 3
6 / (4 - 1)
[1] 2

____Exercise_The R User Interface:

  1. Choose any number and add 2 to it.
  2. Multiply the result by 3.
  3. Subtract 6 from the answer.
  4. Divide what you get by 3.

Solution:

10 + 2
[1] 12
## 12
12 * 3
[1] 36
## 36
36 - 6
[1] 30
## 30
30 / 3
[1] 10
## 10

Objects:

R lets you save data by storing it inside an R object. What’s an object? Just a name that you can use to call up stored data.

a <- 1 
a
[1] 1
  
die <- 1:6
die
[1] 1 2 3 4 5 6

#https://rstudio-education.github.io/hopr/basics.html#objects

R also understands capitalization (or is case-sensitive), so name and Name will refer to different objects:

Name <- 1 
Name
[1] 1
name <- 0
name  
[1] 0

You can see which object names you have already used with the function ls:

ls() 
[1] "a"    "die"  "name" "Name"

If you are a big fan of linear algebra (and who isn’t?), you may notice that R does not always follow the rules of matrix multiplication. Instead, R uses element-wise execution. When you manipulate a set of numbers, R will apply the same operation to each element in the set. So for example, when you run die - 1, R subtracts one from each element of die.

The result will be a new vector the same length as the first two, as shown in Figure 1-3.


#https://rstudio-education.github.io/hopr/basics.html#objects

If you give R two vectors of unequal lengths, R will repeat the shorter vector until it is as long as the longer vector, and then do the math, as shown in Figure 1-4.


#https://rstudio-education.github.io/hopr/basics.html#objects

But don’t think that R has given up on traditional matrix multiplication. You just have to ask for it when you want it. You can do inner multiplication with the %*% operator and outer multiplication with the %o% operator:

die<-1:8
die
[1] 1 2 3 4 5 6 7 8
die*die  
[1]  1  4  9 16 25 36 49 64
die%*%die
     [,1]
[1,]  204
die%o%die   
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    2    3    4    5    6    7    8
[2,]    2    4    6    8   10   12   14   16
[3,]    3    6    9   12   15   18   21   24
[4,]    4    8   12   16   20   24   28   32
[5,]    5   10   15   20   25   30   35   40
[6,]    6   12   18   24   30   36   42   48
[7,]    7   14   21   28   35   42   49   56
[8,]    8   16   24   32   40   48   56   64

Functions:

Using a function is pretty simple. Just write the name of the function and then the data you want the function to operate on in parentheses:

round(3.1415) ## 3
[1] 3
factorial(3) ## 6 
[1] 6

The data that you pass into the function is called the function’s argument. The argument can be raw data, an R object, or even the results of another R function. In this last case, R will work from the innermost function to the outermost, as in Figure 1-5:

#https://rstudio-education.github.io/hopr/basics.html#objects

If you’re not sure which names to use with a function, you can look up the function’s arguments with args. To do this, place the name of the function in the parentheses behind args. For example, you can see that the round function takes two arguments, one named x and one named digits:

round(3.444)
[1] 3
  args(round) ## function (x, digits = 0) ## NULL 
function (x, digits = 0) 
NULL

Sample with Replacement:

Sampling with replacement is an easy way to create independent random samples. Each value in your sample will be a sample of size one that is independent of the other values. This is the correct way to simulate a pair of dice:

die<-1:8
sample(die, size = 2, replace = TRUE) 
[1] 5 5
dice <- sample(die, size = 2, replace = TRUE) 
dice 
[1] 8 6
sum(dice) 
[1] 14

What would happen if you call dice multiple times? Would R generate a new pair of dice values each time? Let’s give it a try:

dice 
[1] 4 6
dice 
[1] 4 6
dice 
[1] 4 6

Nope. Each time you call dice, R will show you the result of that one time you called sample and saved the output to dice. R won’t rerun sample(die, 2, replace = TRUE) to create a new roll of the dice. This is a relief in a way. Once you save a set of results to an R object, those results do not change. Programming would be quite hard if the values of your objects changed each time you called them.

Writing Your Own Functions:

To recap, you already have working R code that simulates rolling a pair of dice:

die <- 1:6 
dice <- sample(die, size = 2, replace = TRUE) 
sum(dice) 
[1] 7

It would be easier to use your code if you wrapped it into its own function. Functions may seem mysterious or fancy, but they are just another type of R object. Instead of containing data, they contain code. This code is stored in a special format that makes it easy to reuse the code in new situations. You can write your own functions by recreating this format.

The Function Constructor:

Every function in R has three basic parts:

  1. a name
  2. a body of code, and
  3. a set of arguments

my_function <- function() {}

roll <- function() {  
      die <- 1:6  
      dice <- sample(die, size = 2, replace = TRUE)  
      sum(dice) 
      } 
    roll()
[1] 9
    roll
function() {  
      die <- 1:6  
      dice <- sample(die, size = 2, replace = TRUE)  
      sum(dice) 
      }

Arguments:

 roll2 <- function(bones = 1:6) {  
      dice <- sample(bones, size = 2, replace = TRUE) 
      sum(dice) 
      } 
    
    
    roll2()
[1] 5
    roll2
function(bones = 1:6) {  
      dice <- sample(bones, size = 2, replace = TRUE) 
      sum(dice) 
      }

Finally, you give your function a name by saving its output to an R object, as shown in Figure 2-6.

#https://rstudio-education.github.io/hopr/basics.html#objects

Scripts:

You can open an R script in RStudio by going to File > New File > R script in the menu bar. RStudio will then open a fresh script above your console pane, as shown in Figure 1-7. Figure 1-8.

#https://rstudio-education.github.io/hopr/basics.html#objects
#https://rstudio-education.github.io/hopr/basics.html#objects

Summary:

The two most important components of the R language are

  1. objects, which store data, and

  2. functions, which manipulate data. You’ll also look at two of the most useful components of the R language.

R packages, which are collections of functions writted by R’s talented community of developers, and R documentation, which is a collection of help pages built into R that explains every function and data set in the language.

2. Packages and Help Pages

Many of the most useful R tools come in R packages, so let’s take a moment to look at what R packages are and how you can use them.

Packages:

You’re not the only person writing your own functions with R. Many professors, programmers, and statisticians use R to design tools that can help people analyze data. They then make these tools free for anyone to use. Appendix 2: R Packages contains detailed instructions for downloading and updating R packages.

We’re going to use the qplot function to make some quick plots. qplot comes in the ggplot2 package, a popular package for making graphs. Before you can use qplot, or anything else in the ggplot2 package, you need to download and install it.

Install.packages:

Each R package is hosted at http://cran.r-project.org, the same website that hosts R. However, you don’t need to visit the website to download an R package; You can download packages straight from R’s command line in RStudio.

Here’s how:

  1. Open RStudio.
  2. Make sure you are connected to the Internet.
  3. Run install.packages("ggplot2") at the command line.

library:

Installing a package doesn’t place its functions at your fingertips just yet: It simply places them in your hard drive. To use an R package, you next have to load it in your R session with the command library("ggplot2").

The main thing to remember is that you only need to install a package once, but you need to load it with library each time you wish to use it in a new R session. R will unload all of its packages each time you close RStudio.

The following code will make the plot that appears in Figure 2-1.

x <- c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1)
x
 [1] -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0
## -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0
y <- x^3
y
 [1] -1.000 -0.512 -0.216 -0.064 -0.008  0.000  0.008  0.064  0.216  0.512  1.000
## -1.000 -0.512 -0.216 -0.064 -0.008  0.000  0.008
##  0.064  0.216  0.512  1.000
qplot(x, y)

#https://rstudio-education.github.io/hopr/basics.html#objects

Give c all of the numbers that you want to appear in the vector, separated by a comma. c stands for concatenate, but you can think of it as “collect” or “combine”:

x <- c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1) 
x ## -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0
 [1] -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0
y <- x^3 
y ## -1.000 -0.512 -0.216 -0.064 -0.008  0.000  0.008 ##  0.064  0.216  0.512  1.000
 [1] -1.000 -0.512 -0.216 -0.064 -0.008  0.000  0.008  0.064  0.216  0.512  1.000
plot(x, y)

The following code makes the left-hand plot in Figure 2-2 (we’ll worry about the right-hand plot in just second). To make sure our graphs look the same, use the extra argument binwidth = 1:

#https://rstudio-education.github.io/hopr/basics.html#objects
x <- c(1, 2, 2, 2, 3, 3) 
hist(x, binwidth = 1)

Let’s try another histogram. This code makes the right-hand plot in Figure 2-2.

x2 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4) 
plot(x2, binwidth = 1)

Exercise_1:

Let x3 be the following vector:

x3 <- c(0, 1, 1, 2, 2, 2, 3, 3, 4)
x3
[1] 0 1 1 2 2 2 3 3 4
  • Imagine what a histogram of x3 would look like.
  • Assume that the histogram has a bin width of 1.
  • How many bars will the histogram have?
  • Where will they appear?
  • How high will each be? When you are done, plot a histogram of x3 with binwidth = 1, and see if you are right.

How can you use a histogram to check the accuracy of your dice? Well, if you roll your dice many times and keep track of the results, you would expect some numbers to occur more than others. This is because there are more ways to get some numbers by adding two dice together than to get other numbers, as shown in Figure 2-3.

#https://rstudio-education.github.io/hopr/basics.html#objects

This is where replicate comes in. Replicate provides an easy way to repeat an R command many times. To use it, first give replicate the number of times you wish to repeat an R command, and then give it the command you wish to repeat. Replicate will run the command multiple times and store the results as a vector:

roll <- function() {  
      die <- 1:6  
      dice <- sample(die, size = 2, replace = TRUE)  
      sum(dice) 
      } 
replicate(3, 1 + 1) 
[1] 2 2 2
replicate(10, roll()) 
 [1]  7 11  7 11  5  6  8  4 10  6
  

So let’s simulate 10,000 dice rolls and plot the results. Don’t worry; qplot and replicate can handle it. Figure 2-4:

#https://rstudio-education.github.io/hopr/basics.html#objects
roll <- function() {  
    die <- 1:6  
    dice <- sample(die, size = 2, replace = TRUE)  
    sum(dice) 
  } 
  rolls <- replicate(10, roll()) 
  hist(rolls, binwidth = 1) 

rolls
 [1] 8 8 4 3 6 7 8 8 7 3

See Appendix B and Appendix C

Getting Help with Help Pages:

There are over 1,000 functions at the core of R, and new R functions are created all of the time. This can be a lot of material to memorize and learn! Luckily, each R function comes with its own help page, which you can access by typing the function’s name after.

?sqrt
?log10 
?sample

Here, almost every help page includes some example code that puts the function in action. Running this code is a great way to learn by example.

Note: If a function comes in an R package, R won’t be able to find its help page unless the package is loaded.

Parts of a Help Page:

Each help page is divided into sections. Which sections appear can vary from help page to help page, but you can usually expect to find these useful topics:

  • -Description
  • -Usage
  • -Arguments
  • -Details
  • -Value

If you’d like to look up the help page for a function but have forgotten the function’s name, you can search by keyword.

To do this, type two question marks followed by a keyword in R’s command line. R will pull up a list of links to help pages related to the keyword. You can think of this as the help page for the help page:

??log

Exercise_2:

Rewrite the roll function to roll a pair of weighted dice:

roll <- function() {  
    die <- 1:6  
    dice <- sample(die, size = 2, replace = TRUE)  
    sum(dice) 
    } 

You will need to add a prob argument to the sample function inside of roll. This argument should tell sample to sample the numbers one through five with probability 1/8 and the number 6 with probability 3/8.

roll <- function() {  
       die <- 1:6  
       dice <- sample(die, size = 2, replace = TRUE,    prob = c(1/8, 1/8, 1/8, 1/8, 1/8, 3/8))  
       
       sum(dice) 
       } 
roll()
[1] 7

This will cause roll to pick 1 through 5 with probability 1/8 and 6 with probability 3/8.

rolls <- replicate(10000, roll()) 
hist(rolls, binwidth = 1)

#https://rstudio-education.github.io/hopr/basics.html#objects

Getting More Help:

  • R also comes with a super active community of users that you can turn to for help on the R-help mailing list. https://community.rstudio.com/

  • You can email the list with questions, but there’s a great chance that your question has already been answered.

  • Find out by searching the archives.

Even better than the R-help list is Stack Overflow, a website that allows programmers to answer questions and users to rank answers based on helpfulness.

Summary:

  • R’s packages and help pages can make you a more productive programmer.

  • Often the function that you want to write will already exist in an R package.

  • install.packages, and then load it into each new R session with library for using package.

  • R’s help pages will help you master the functions that appear in R and its packages.

==Project 2: Playing Cards==

How to store, retrieve, and change data values in your computer’s memory. These skills will help you save and manage data without accumulating errors.

Along the way, you will learn how to:

  • Save new types of data, like character strings and logical values
  • Save a data set as a vector, matrix, array, list, or data frame
  • Load and save your own data sets with R
  • Extract individual values from a data set
  • Change individual values within a data set
  • Write logical tests
  • Use R’s missing-value symbol, NA

We’ve divided it into four tasks. Each task will teach you a new skill for managing data with R:

Task 1: build the deck In R Objects, you will design and build a virtual deck of playing cards. This will be a complete data set, just like the ones you will use as a data scientist. You’ll need to know how to use R’s data types and data structures to make this work.

Task 2: write functions that deal and shuffle Next, in R Notation, you will write two functions to use with the deck. One function will deal cards from the deck, and the other will reshuffle the deck. To write these functions, you’ll need to know how to extract values from a data set with R.

Task 3: change the point system to suit your game In Modifying Values, you will use R’s notation system to change the point values of your cards to match the card games you may wish to play, like war, hearts, or blackjack. This will help you change values in place in existing data sets.

Task 4: manage the state of the deck Finally, in Environments, you will make sure that your deck remembers which cards it has dealt. This is an advanced task, and it will introduce R’s environment system and scoping rules. To do it successfully, you will need to learn the minute details of how R looks up and uses the data that you have stored in your computer.

3. R Objects

Atomic Vectors:

You can make an atomic vector by grouping some values of data together with c:

die <- c(1, 2, 3, 4, 5, 6) 
die
[1] 1 2 3 4 5 6
is.vector(die) 
[1] TRUE
#is.vector tests whether an object is an atomic vector. 
#It returns TRUE if the object is an atomic vector and FALSE otherwise.

You can also make an atomic vector with just one value. R saves single values as an atomic vector of length 1:

five <- 5 
five ## 5
[1] 5
is.vector(five) 
[1] TRUE
##  TRUE
length(five) 
[1] 1
## 1 
length(die) 
[1] 6
## 6
#!!!length returns the length of an atomic vector.

Each atomic vector stores its values as a one-dimensional vector, and each atomic vector can only store one type of data. You can save different types of data in R by using different types of atomic vectors. Altogether, R recognizes six basic types of atomic vectors: doubles, integers, characters, logicals, complex, and raw.

You can do this by using some simple conventions when you enter your data. For example, you can create an integer vector by including a capital L with your input. You can create a character vector by surrounding your input in quotation marks:

int <- 1L 
int
[1] 1
text <- "ace" 
text
[1] "ace"

If you’d like to make atomic vectors that have more than one element in them, you can combine an element with the c function.

int <- c(1L, 5L) 
text <- c("ace", "hearts") 
  

You may wonder why R uses multiple types of vectors. Vector types help R behave as you would expect. For example, R will do math with atomic vectors that contain numbers, but not with atomic vectors that contain character strings:

sum(int) ## 6
[1] 6
sum(text) ## Error in sum(text) : invalid 'type' (character) of argument 
Error in sum(text) : invalid 'type' (character) of argument

But we’re getting ahead of ourselves! Get ready to say hello to the six types of atomic vectors in R.

  • 1-Doubles
die <- c(1, 2, 3, 4, 5, 6) 
die ## 1 2 3 4 5 6 
[1] 1 2 3 4 5 6
typeof(die) ##  "double" 
[1] "double"
    
  • 2-Integers
int <- c(-1L, 2L, 4L) 
    int ## -1  2  4
[1] -1  2  4
    typeof(int) ## "integer" 
[1] "integer"
  • 3-Characters
text <- c("Hello",  "World") 
    text ##  "Hello"  "World"
[1] "Hello" "World"
    typeof(text) ## "character"
[1] "character"
    typeof("Hello") ## "character" 
[1] "character"
  • 4-Logicals
3 > 4 
[1] FALSE
    ## FALSE 
    
    logic <- c(TRUE, FALSE, TRUE) 
    logic ##   TRUE FALSE  TRUE
[1]  TRUE FALSE  TRUE
    typeof(logic) ## "logical"
[1] "logical"
    typeof(F) ## "logical" 
[1] "logical"
    
  • 5-Complex
x <- 1
y <- 1
 
comp<- complex(real = x, imaginary = y)
 
typeof(comp) ## "complex" 
[1] "complex"
    
  • 6-Raw

Raw vectors store raw bytes of data. Making raw vectors gets complicated, but you can make an empty raw vector of length n with raw(n). See the help page of raw for more options when working with this type of data:

#Raw vectors store raw bytes of data. Making raw vectors gets complicated, 
    #but you can make an empty raw vector of length n with raw(n). 
    #See the help page of raw for more options when working with this type of data:
    
    raw(3) 
[1] 00 00 00
      ## 00 00 00
    typeof(raw(3))
[1] "raw"
    ## "raw"
  • Exercise

Create an atomic vector that stores just the face names of the cards in a royal flush, for example, the ace of spades, king of spades, queen of spades, jack of spades, and ten of spades. The face name of the ace of spades would be “ace,” and “spades” is the suit. Which type of vector will you use to save the names?

hand <- c("ace", "king", "queen", "jack", "ten") 
    hand 
[1] "ace"   "king"  "queen" "jack"  "ten"  
    ## "ace"   "king"  "queen" "jack"  "ten"
    typeof(hand)
[1] "character"
    ## "character" 

Attributes:

You can think of an attribute as “metadata”; it is just a convenient place to put information associated with an object.

die <- c(1, 2, 3, 4, 5, 6) 
attributes(die)
NULL
  ## NULL

The most common attributes to give an atomic vector are names, dimensions (dim), and classes.

Note: R uses NULL to represent the null set, an empty object. NULL is often returned by functions whose values are undefined. You can create a NULL object by typing NULL in capital letters.

  • Names:
die <- c(1, 2, 3, 4, 5, 6) 
names(die) ## NULL 
NULL
names(die) <- c("one", "two", "three", "four", "five", "six") 
names(die) ## "one"   "two"   "three" "four"  "five"  "six"
[1] "one"   "two"   "three" "four"  "five"  "six"  
attributes(die)
$names
[1] "one"   "two"   "three" "four"  "five"  "six"  
  

R will display the names above the elements of die whenever you look at the vector:

 die
  one   two three  four  five   six 
    1     2     3     4     5     6 

To remove the names attribute, set it to NULL:

names(die) <- NULL 
  die 
[1] 1 2 3 4 5 6
  • Dim:

You can transform an atomic vector into an n-dimensional array by giving it a dimensions attribute with dim.

die <- c(1, 2, 3, 4, 5, 6)  
  dim(die) <- c(2, 3) 
die   
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
    
dim(die) <- c(3, 2) 
die 
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
  
dim(die) <- c(1, 2, 3)
die
, , 1

     [,1] [,2]
[1,]    1    2

, , 2

     [,1] [,2]
[1,]    3    4

, , 3

     [,1] [,2]
[1,]    5    6

For example, R always fills up each matrix by columns, instead of by rows. If you’d like more control over this process, you can use one of R’s helper functions, matrix or array. They do the same thing as changing the dim attribute, but they provide extra arguments to customize the process.

Matrices:

die <- c(1, 2, 3, 4, 5, 6)
m <- matrix(die, nrow = 2) 
m 
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrix will fill up the matrix column by column by default, but you can fill the matrix row by row if you include the argument byrow = TRUE:

mm <- matrix(die, nrow = 2, byrow = TRUE) 
mm
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Arrays:

The array function creates an n-dimensional Matrix.

ar <- array(c(11:14, 21:24, 31:34), dim = c(2, 2, 3)) 
ar
, , 1

     [,1] [,2]
[1,]   11   13
[2,]   12   14

, , 2

     [,1] [,2]
[1,]   21   23
[2,]   22   24

, , 3

     [,1] [,2]
[1,]   31   33
[2,]   32   34

Exercise:

Create the following matrix, which stores the name and suit of every card in a royal flush.

hand1 <- c("ace", "king", "queen", "jack", "ten", "spades", "spades",  "spades", "spades", "spades")
matrix(hand1, nrow = 5) 
     [,1]    [,2]    
[1,] "ace"   "spades"
[2,] "king"  "spades"
[3,] "queen" "spades"
[4,] "jack"  "spades"
[5,] "ten"   "spades"
matrix(hand1, ncol = 2) 
     [,1]    [,2]    
[1,] "ace"   "spades"
[2,] "king"  "spades"
[3,] "queen" "spades"
[4,] "jack"  "spades"
[5,] "ten"   "spades"
dim(hand1) <- c(5, 2) 
hand1
     [,1]    [,2]    
[1,] "ace"   "spades"
[2,] "king"  "spades"
[3,] "queen" "spades"
[4,] "jack"  "spades"
[5,] "ten"   "spades"
test<-matrix(hand1,nrow=5, ncol=2)
test
     [,1]    [,2]    
[1,] "ace"   "spades"
[2,] "king"  "spades"
[3,] "queen" "spades"
[4,] "jack"  "spades"
[5,] "ten"   "spades"

Class:

class("Hello")
[1] "character"
##  "character"
class(5) ##  
[1] "numeric"
"numeric" 
[1] "numeric"
  • Dates and Times:
now <- Sys.time() 
now
[1] "2019-04-14 11:35:59 CEST"
typeof(now) ##  "double"
[1] "double"
class(now) ## "POSIXct" "POSIXt" 
[1] "POSIXct" "POSIXt" 
  • Factors:

Factors are R’s way of storing categorical information, like ethnicity or eye color. Think of a factor as something like a gender; it can only have certain values (male or female).

gender <- factor(c("male", "female", "female", "male"))
gender
[1] male   female female male  
Levels: female male
typeof(gender) 
[1] "integer"
attributes(gender) 
$levels
[1] "female" "male"  

$class
[1] "factor"

You can see exactly how R is storing your factor with unclass:

gender <- factor(c("male", "female", "female", "male"))
unclass(gender)
[1] 2 1 1 2
attr(,"levels")
[1] "female" "male"  
gender
[1] male   female female male  
Levels: female male

Factors make it easy to put categorical variables into a statistical model because the variables are already coded as numbers. However, factors can be confusing since they look like character strings but behave like integers.

gender <- factor(c("male", "female", "female", "male"))
as.character(gender) 
[1] "male"   "female" "female" "male"  

Exercise:

Many card games assign a numerical value to each card. For example, in blackjack, each face card is worth 10 points, each number card is worth between 2 and 10 points, and each ace is worth 1 or 11 points, depending on the final score. Make a virtual playing card by combining “ace,” “heart,” and 1 into a vector. What type of atomic vector will result? Check if you are right.

You can convert a factor to a character string with the as.character function. R will retain the display version of the factor, not the integers stored in memory:

card <- c("ace", "hearts", 1) 
card 

This will cause trouble if you want to do math with that point value, for example, to see who won your game of blackjack. Since matrices and arrays are special cases of atomic vectors, they suffer from the same behavior. Each can only store one type of data. This creates a couple of problems. First, many data sets contain multiple types of data. Simple programs like Excel and Numbers can save multiple types of data in the same data set, and you should hope that R can too. Don’t worry, it can.

Coercion:

Figure 3-1.

R uses the same coercion rules when you try to do math with logical values. So the following code:

sum(c(TRUE, TRUE, FALSE, FALSE)) 
[1] 2
#will become:
sum(c(1, 1, 0, 0)) 
[1] 2

You can explicitly ask R to convert data from one type to another with the as functions. R will convert the data whenever there is a sensible way to do so:

as.character(1) 
[1] "1"
  ## "1"
as.logical(1) 
[1] TRUE
  ## TRUE
as.numeric(FALSE) 
[1] 0
  ## 0 

Lists:

Lists are like atomic vectors because they group data into a one-dimensional set. However, lists do not group together individual values; lists group together R objects, such as atomic vectors and other lists.

list1 <- list(100:130, "R", list(TRUE, FALSE)) 
print(list1)
[[1]]
 [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
[24] 123 124 125 126 127 128 129 130

[[2]]
[1] "R"

[[3]]
[[3]][[1]]
[1] TRUE

[[3]][[2]]
[1] FALSE

Exercise:

Use a list to store a single playing card, like the ace of hearts, which has a point value of one. The list should save the face of the card, the suit, and the point value in separate elements.

card <- list("ace", "hearts", 1) 
print(card)
[[1]]
[1] "ace"

[[2]]
[1] "hearts"

[[3]]
[1] 1

Data Frames:

You can think of a data frame as R’s equivalent to the Excel spreadsheet because it stores data in a similar format. Data frames group vectors together into a two-dimensional table. Each vector becomes a column in the table. As a result, each column of a data frame can contain a different type of data; but within a column, every cell must be the same type of data, as in Figure 3-2.

df <- data.frame(face = c("ace", "two", "six"),  suit = c("clubs", "clubs", "clubs"), value = c(1, 2, 3)) 
df

In fact, each data frame is a list with class data.frame.

typeof(df)
[1] "list"
class(df) 
[1] "data.frame"
str(df) 
'data.frame':   3 obs. of  3 variables:
 $ face : Factor w/ 3 levels "ace","six","two": 1 3 2
 $ suit : Factor w/ 1 level "clubs": 1 1 1
 $ value: num  1 2 3

Notice that R saved your character strings as factors. R likes factors! It is not a very big deal here, but you can prevent this behavior by adding the argument stringsAsFactors = FALSE to data.frame:

df <- data.frame(face = c("ace", "two", "six"),  suit = c("clubs", "clubs", "clubs"), value = c(1, 2, 3),  stringsAsFactors = FALSE) 
df

You should avoid typing large data sets in by hand whenever possible. Typing invites typos and errors, not to mention RSI. It is always better to acquire large data sets as a computer file. You can then ask R to read the file and store the contents as an object.

Loading Data:

Dataset: https://gist.github.com/garrettgman/9629323

  • Most data-science applications can open plain-text files and export data as plain-text files.
  • This makes plain-text files a sort of lingua franca for data science.
  • To load a plain-text file into R, click the Import Dataset icon in RStudio, shown in Figure 3-3.
  • Then select “From text file.”

RStudio will ask you to select the file you want to import, then it will open a wizard to help you import the data, as in Figure 3-4. You can also unclick the box “Strings as factors” in the wizard. I recommend doing this. If you do, R will load all of your character strings as character strings. If you do not, R will convert them to factors.

deck
class(deck)
[1] "data.frame"

Once everything looks right, click Import. RStudio will read in the data and save it to a data frame. RStudio will also open a data viewer, so you can see your new data in a spreadsheet format. This is a good way to check that everything came through as expected.

Saving Data:

You can save any data frame in R to a .csv file with the command write.csv. To save deck, run:

`write.csv(deck, file = “cards.csv”, row.names = FALSE)``

R will turn your data frame into a plain-text file with the comma-separated values format and save the file to your working directory. To see where your working directory is, run getwd().

getwd()
[1] "C:/Users/cevi herdian/Documents/MEGA/Data-Sciences/R Programming/tutorial/Hands On Programming with R/Github/hopr"
  

To change the location of your working directory, visit Session > Set Working Directory > Choose Directory in the RStudio menu bar.

Summary:

You can save data in R with five different objects, which let you store different types of values in different types of relationships, as in Figure 3-6. Data frames store one of the most common forms of data used in data science, tabular data. You can load tabular data into a data frame with RStudio’s Import Dataset button-so long as the data is saved as a plain-text file. No program is better at converting Excel files than Excel. Similarly, no program is better at converting SAS Xport files than SAS, and so on.

  • 6 types R atomic vectors:
  1. Integers
  2. Double
  3. Logic
  4. Characters
  5. Complexs
  6. Raw
  • 5 R objects:
  1. Vectors
  2. Matrix
  3. Arrays
  4. List
  5. Data Frame

See the Appendix D

4. R Notation

Selecting Values:

First of all, import the deck.csv. To extract a value or set of values from a data frame, write the data frame’s name followed by a pair of hard brackets:

deck
class(deck)
[1] "data.frame"
deck[ , ]

You have a choice when it comes to writing indexes. There are six different ways to write an index for R, and each does something slightly different.

  1. Positive integers
  2. Negative integers
  3. Zero
  4. Blank spaces
  5. Logical values
  6. Names

The simplest of these to use is positive integers.

  1. Positive Integers

R treats positive integers just like ij notation in linear algebra: deck[i,j] will return the value of deck that is in the ith row and the jth column, Figure 4-1.

deck[1, 1]
[1] king
Levels: ace eight five four jack king nine queen seven six ten three two

To extract more than one value, use a vector of positive integers. For example, you can return the first row of deck with deck[1, c(1, 2, 3)] or deck[1, 1:3]:

deck[1, c(1, 2, 3)] 

R will give you a new set of values which are copies of the original values. You can then save this new set to an R object with R’s assignment operator:

new <- deck[1, c(1, 2, 3)] 
new 

R’s notation system is not limited to data frames. You can use the same syntax to select values in any R object, as long as you supply one index for each dimension of the object. So, for example, you can subset a vector (which has one dimension) with a single index:

vec <- c(6, 1, 3, 6, 10, 5)
vec[1:3]
[1] 6 1 3
  1. Negative Integers

Negative integers do the exact opposite of positive integers when indexing. R will return every element except the elements in a negative index.

deck[-(2:52), 1:3] 

Negative integers are a more efficient way to subset than positive integers if you want to include the majority of a data frame’s rows or columns.

  1. Zero

R will return nothing from a dimension when you use zero as an index. This creates an empty object:

deck[0, 0] 

data frame with 0 columns and 0 rows To be honest, indexing with zero is not very helpful.

  1. Blank Spaces

You can use a blank space to tell R to extract every value in a dimension.

deck[1, ] 
  1. Logical Values

If you supply a vector of TRUEs and FALSEs as your index, R will match each TRUE and FALSE to a row in your data frame (or a column depending on where you place the index). R will then return each row that corresponds to a TRUE, Figure 4-2.

deck[1, c(TRUE, TRUE, FALSE)] 
  1. Names

Finally, you can ask for the elements you want by name-if your object has names.

deck[1, c("face", "suit", "value")]

Deal a Card:

Complete the following code to make a function that returns the first row of a data frame:

deal <- function(cards) {  
    cards[1, ] 
    } 
  
  deal(deck)

Shuffle the Deck:

deck2 <- deck[1:52, ]
  deck2
  
  head(deck2) 
  
  
  deck3 <- deck[c(2, 1, 3:52), ]
  deck3  

How could you generate such a random collection of integers? With our friendly neighborhood sample function:

random <- sample(1:52, size = 52) 
  random 
 [1] 33 38 48 47 36  5 28  6  2 10 31  4 16 19 39 18  8 52 15 34 23 12  7 42 43 30 14 13 32 40 17
[32] 25 35  3 51 44 26 29 49  9  1 20 21 22 24 37 46 45 41 11 27 50
  
  
  
  deck4 <- deck[random, ] 
  head(deck4)

Exercise:

Use the preceding ideas to write a shuffle function. Shuffle should take a data frame and return a shuffled copy of the data frame. Your shuffle function will look like the one that follows:

shuffle <- function(cards) {  
    random <- sample(1:52, size = 52)  
    cards[random, ] 
    } 
  
  
  deal(deck)
  deck2 <- shuffle(deck)
  deal(deck2)

Dollar Signs and Double Brackets:

Two types of object in R obey an optional second system of notation. You can extract values from data frames and lists with the $ syntax.

deck$value 
 [1] 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9
[32]  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3  2  1
mean(deck$value) 
[1] 7
median(deck$value)
[1] 7

You can use the same $ notation with the elements of a list, if they have names.

lst <- list(numbers = c(1, 2), logical = TRUE, strings = c("a", "b", "c")) 
lst 
$numbers
[1] 1 2

$logical
[1] TRUE

$strings
[1] "a" "b" "c"

And then subset it:

lst[1] 
$numbers
[1] 1 2
lst$numbers 
[1] 1 2
lst[[1]] 
[1] 1 2
lst[["numbers"]] 
[1] 1 2
  

In the R community, there is a popular, and helpful, way to think about it, Figure 4-3.

Summary:

You have learned how to access values that have been stored in R. You can retrieve a copy of values that live inside a data frame and use the copies for new computations.

5. Modifying Values

Changing Values in Place:

vec <- c(0, 0, 0, 0, 0, 0) 
vec
[1] 0 0 0 0 0 0

Here’s how you can select the first value of vec:

vec[1]
[1] 0

And here is how you can modify it:

vec[1] <- 1000
vec 
[1] 1000    0    0    0    0    0

You can replace multiple values at once as long as the number of new values equals the number of selected values:

vec[c(1, 3, 5)] <- c(1, 1, 1) 
vec
[1] 1 0 1 0 1 0
vec[4:6] <- vec[4:6] + 1
vec 
[1] 1 0 1 2 3 2

This provides a great way to add new variables to your data set: firts import the deck.csv

deck
deck2<-deck
deck2$new <- 1:52
deck2

You can also remove columns from a data frame (and elements from a list) by assigning them the symbol NULL:

deck2$new <- NULL
  head(deck2)

You can single out just the values of the aces by subsetting the columns dimension of deck2. Or, even better, you can subset the column vector deck2$value:

deck2[c(13, 26, 39, 52), 3]
[1] 1 1 1 1
deck2$value[c(13, 26, 39, 52)]
[1] 1 1 1 1

Now all you have to do is assign a new set of values to these old values.

deck2$value[c(13, 26, 39, 52)] <- c(14, 14, 14, 14)
  # or
  deck2$value[c(13, 26, 39, 52)] <- 14
  deck2  

Logical Subsetting:

vec <- c(1, 0, 1, 1, 0, 2) 
vec
[1] 1 0 1 1 0 2
vec[c(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)] 
[1] 0

At first glance, this system might seem impractical. Who wants to type out long vectors of TRUEs and FALSEs? No one. But you don’t have to. You can let a logical test create a vector of TRUEs and FALSEs for you.

Logical Tests:

A logical test is a comparison like “is one less than two?”, 1 < 2, or “is three greater than four?”, 3 > 4. R provides seven logical operators that you can use to make comparisons, shown in Table 7-1 and Table 7-2.

Each operator returns a TRUE or a FALSE. If you use an operator to compare vectors, R will do element-wise comparisons-just like it does with the arithmetic operators:

1 > 2 
[1] FALSE
1 > c(0, 1, 2) 
[1]  TRUE FALSE FALSE
c(1, 2, 3) == c(3, 2, 1) 
[1] FALSE  TRUE FALSE

%in% is the only operator that does not do normal element-wise execution. %in% tests whether the value(s) on the left side are in the vector on the right side.

1 %in% c(3, 4, 5) 
[1] FALSE
c(1, 2) %in% c(3, 4, 5) 
[1] FALSE FALSE
c(1, 2, 3) %in% c(3, 4, 5)
[1] FALSE FALSE  TRUE
c(1, 2, 3, 4) %in% c(3, 4, 5)
[1] FALSE FALSE  TRUE  TRUE

Notice that you test for equality with a double equals sign, ==, and not a single equals sign, =, which is another way to write <-. You can compare any two R objects with a logical operator; however, logical operators make the most sense if you compare two objects of the same data type. If you compare objects of different data types, R will use its coercion rules to coerce the objects to the same type before it makes the comparison.

Missing Information:

Missing information problems happen frequently in data science. The NA character is a special symbol in R. It stands for “not available” and can be used as a placeholder for missing information. Generally, NAs will propagate whenever you use them in an R operation or function. This can save you from making errors based on missing data.

  • na.rm:

Missing values can help you work around holes in your data sets, but they can also create some frustrating problems. Suppose, for example, that you’ve collected 1,000 observations and wish to take their average with R’s mean function. If even one of the values is NA, your result will be NA:

c(NA, 1:50) 
 [1] NA  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[32] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
mean(c(NA, 1:50)) 
[1] NA

Understandably, you may prefer a different behavior. Most R functions come with the optional argument, na.rm, which stands for NA remove. R will ignore NAs when it evaluates a function if you add the argument na.rm = TRUE:

mean(c(NA, 1:50), na.rm = TRUE) 
[1] 25.5
  • is.na:

On occasion, you may want to identify the NAs in your data set with a logical test, but that too creates a problem. How would you go about it? If something is a missing value, any logical test that uses it will return a missing value, even this test:

NA == NA
[1] NA

Which means that tests like this won’t help you find missing values:

c(1, 2, 3, NA) == NA 
[1] NA NA NA NA

But don’t worry too hard; R supplies a special function that can test whether a value is an NA. The function is sensibly named is.na:

is.na(NA) ## TRUE
[1] TRUE
vec <- c(1, 2, 3, NA)
is.na(vec)
[1] FALSE FALSE FALSE  TRUE

Let’s set all of your ace values to NA. This will accomplish two things. First, it will remind you that you do not know the final value of each ace. Second, it will prevent you from accidentally scoring a hand that has an ace before you determine the ace’s final value.

deck5<-deck
deck5$value[deck5$face == "ace"] <- NA
deck5

Summary:

You can modify values in place inside an R object when you combine R’s notation syntax with the assignment operator, <-. This lets you update your data and clean your data sets.

6. Environments-X

Environments:

For example, Figure 8.1 shows part of the file system on my computer. I have tons of folders. Inside one of them is a subfolder named Documents, inside of that subfolder is a sub-subfolder named ggsubplot, inside of that folder is a folder named inst, inside of that is a folder named doc, and inside of that is a file named manual.pdf.

R uses a similar system to save R objects. Each object is saved inside of an environment, a list-like object that resembles a folder on your computer. Each environment is connected to a parent environment, a higher-level environment, which creates a hierarchy of environments.

You can see R’s environment system with the parenvs function in the pryr package (note parenvs came in the pryr package when this book was first published). parenvs(all = TRUE) will return a list of the environments that your R session is using.

library(pryr)
parenvs(all = TRUE)

It takes some imagination to interpret this output, so let’s visualize the environments as a system of folders, Figure 8.2.

Working with Environments:

R comes with some helper functions that you can use to explore your environment tree.

as.environment("package:stats")
<environment: package:stats>
attr(,"name")
[1] "package:stats"
attr(,"path")
[1] "C:/Program Files/R/R-3.5.3/library/stats"

Three environments in your tree also come with their own accessor functions.

globalenv()
<environment: R_GlobalEnv>
## <environment: R_GlobalEnv>
baseenv()
<environment: base>
## <environment: base>
emptyenv()
<environment: R_EmptyEnv>
##<environment: R_EmptyEnv>

Next, you can look up an environment’s parent with parent.env:

parent.env(globalenv())
<environment: package:pryr>
attr(,"name")
[1] "package:pryr"
attr(,"path")
[1] "C:/Users/cevi herdian/Documents/R/win-library/3.5/pryr"

You can view the objects saved in an environment with ls or ls.str.

ls(globalenv())
[1] "a"     "b"     "c"     "data"  "deck"  "poker" "stuff"

You can use R’s $ syntax to access an object in a specific environment. For example, you can access deck from the global environment:

head(globalenv()$deck, 3)

And you can use the assign function to save an object into a particular environment.

assign("new", "Hello Global", envir = globalenv())
globalenv()$new
[1] "Hello Global"
  • The Active Environment

At any moment of time, R is working closely with a single environment. R will store new objects in this environment (if you create any), and R will use this environment as a starting point to look up existing objects (if you call any).

environment()
<environment: R_GlobalEnv>

The global environment plays a special role in R. It is the active environment for every command that you run at the command line. As a result, any object that you create at the command line will be saved in the global environment. You can think of the global environment as your user workspace.

When you call an object at the command line, R will look for it first in the global environment. But what if the object is not there? In that case, R will follow a series of rules to look up the object.

Scoping Rules:

R follows a special set of rules to look up objects. These rules are known as R’s scoping rules:

  1. R looks for objects in the current active environment.
  2. When you work at the command line, the active environment is the global environment. Hence, R looks up objects that you call at the command line in the global environment as in Figure 8.3.

Assignment:

When you assign a value to an object, R saves the value in the active environment under the object’s name. If an object with the same name already exists in the active environment, R will overwrite it.

new
[1] "Hello Global"
new <- "Hello Active"
new
[1] "Hello Active"

This arrangement creates a quandary for R whenever R runs a function. Many functions save temporary objects that help them do their jobs.

roll <- function() {
  die <- 1:6
  dice <- sample(die, size = 2, replace = TRUE)
  sum(dice)
}
roll()
[1] 9

Evaluation:

R creates a new environment each time it evaluates a function. R will use the new environment as the active environment while it runs the function, and then R will return to the environment that you called the function from, bringing the function’s result with it.

We want to know what the environments look like: what are their parent environments, and what objects do they contain? show_env is designed to tell us:

show_env <- function(){
  list(ran.in = environment(), 
    parent = parent.env(environment()), 
    objects = ls.str(environment()))
}

show_env is itself a function, so when we call show_env(), R will create a runtime environment to evaluate the function in.

show_env()
$ran.in
<environment: 0x000000000a0831e0>

$parent
<environment: R_GlobalEnv>

$objects

This time show_env ran in a new environment, 0x000000000a0831e0

==Project 3: Slot Machine==

7. Programs-X

The following function generates three symbols from a group of common slot machine symbols

get_symbols <- function() {
  wheel <- c("DD", "7", "BBB", "BB", "B", "C", "0")
  sample(wheel, size = 3, replace = TRUE, 
    prob = c(0.03, 0.03, 0.06, 0.1, 0.25, 0.01, 0.52))
}
get_symbols()
[1] "0"   "0"   "BBB"

get_symbols uses the probabilities observed in a group of video lottery terminals from Manitoba, Canada. The Manitoba slot machines use the complicated payout scheme shown in Table 9.1. A player will win a prize if he gets:

  1. Three of the same type of symbol (except for three zeroes)
  2. Three bars (of mixed variety)
  3. One or more cherries
  • Sequential Steps

To have R execute steps in sequence, place the steps one after another in an R script or function body. See Figure 9.1

  • Parallel Cases

Another way to divide a task is to spot groups of similar cases within the task. Some tasks require different algorithms for different groups of input. If you can identify those groups, you can work out their algorithms one at a time. See Figure 9.2 and Figure 9.3

if Statements:

An if statement tells R to do a certain task for a certain case. In English you would say something like, “If this is true, do that.” In R, you would say:

# if (this) {
#   that
# }
x <- 1
if (3 == 3) {
  x <- 2
}
x
[1] 2

else Statements:

# if (this) {
#   Plan A
# } else {
#   Plan B
# }
a <- 1
b <- 1
if (a > b) {
  print("A wins!")
} else if (a < b) {
  print("B wins!")
} else {
  print("Tie.")
}
[1] "Tie."
# if ( # Case 1: all the same <1>) {
#   prize <- # look up the prize <3>
# } else if ( # Case 2: all bars <2> ) {
#   prize <- # assign $5 <4>
# } else {
#   # count cherries <5>
#   prize <- # calculate a prize <7>
# }

If you like, you can reorganize your flow chart around these tasks, as in Figure 9.4. The chart will describe the same strategy, but in a more precise way. I’ll use a diamond shape to symbolize an if else decision. See Figure 9-4.

Lookup Tables:

Very often in R, the simplest way to do something will involve subsetting. How could you use subsetting here? Since you know the exact relationship between the symbols and their prizes, you can create a vector that captures this information.

payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25, 
  "B" = 10, "C" = 10, "0" = 0)
payouts
 DD   7 BBB  BB   B   C   0 
100  80  40  25  10  10   0 

Now you can extract the correct prize for any symbol by subsetting the vector with the symbol’s name:

payouts["DD"]
 DD 
100 
payouts["B"]
 B 
10 

If you want to leave behind the symbol’s name when subsetting, you can run the unname function on the output:

unname(payouts["DD"])
[1] 100

unname returns a copy of an object with the names attribute removed.

payouts is a type of lookup table, an R object that you can use to look up values. Subsetting payouts provides a simple way to find the prize for a symbol. It doesn’t take many lines of code, and it does the same amount of work whether your symbol is DD or 0.

Sadly, our method is not quite automatic; we need to tell R which symbol to look up in `payouts

symbols <- c("7", "7", "7")
symbols[1]
[1] "7"
## "7"
payouts[symbols[1]]
 7 
80 
##  7 
## 80 
symbols <- c("C", "C", "C")
payouts[symbols[1]]
 C 
10 
##  C 
## 10 

You don’t need to know the exact symbol to look up because you can tell R to look up whichever symbol happens to be in symbols.

8. S3-X

9. Loops-X

10. Speed-X

Appendix

A. Installing R and Rstudio

You’ll go from downloading R to opening your first R session. Both R and RStudio are free and easy to download (Open Source).

How to Download and Install R:

R is maintained by an international team of developers who make the language available through the web page of The Comprehensive R Archive Network. The top of the web page provides three links for downloading R. Follow the link that describes your operating system: Windows, Mac, or Linux. https://cran.r-project.org/

  • Windows/ Mac/ Linux

To install R on Windows/Mac/Linux, click the “Download R for Windows” link. Then click the “base” link. Next, click the first link at the top of the new page.

  • 32-bit Versus 64-bit

64-bit R uses 64-bit memory pointers, and 32-bit R uses 32-bit memory pointers. This means 64-bit R has a larger memory space to use (and search through). The terms 32-bit and 64-bit refer to the way a computer’s processor (also called a CPU), handles information. The 64-bit version of Windows handles large amounts of random access memory (RAM) more effectively than a 32-bit system. Source: https://support.microsoft.com/en-us/help/15056/windows-7-32-64-bit-faq

Using R:

R isn’t a program that you can open and start using, like Microsoft Word or Internet Explorer. Instead, R is a computer language, like C, C++, or UNIX. Now almost everyone uses R with an application called RStudio IDE (Integrated Development).

Windows and Mac users usually do not program from a terminal window, so the Windows and Mac downloads for R come with a simple program that opens a terminal-like window for you to run R code in. You may hear people refer to them as the Windows or Mac R GUIs (Grafics User Interfaces.

When you open RStudio, a window appears with three panes in it, as in Figure A-1.vThe largest pane is a console window. This is where you’ll run your R code and see results. The console window is exactly what you’d see if you ran R from a UNIX console or the Windows or Mac GUIs.

RStudio:

RStudio is an application like Microsoft Word-except that instead of helping you write in English, RStudio helps you write in R.

Getting started R Programming:

Now that you have both R and RStudio on your computer, you can begin using R by opening the RStudio program. Open RStudio just as you would any program, by clicking on its icon or by typing “RStudio” at the Windows Run prompt.

B. R Packages

  • Many of R’s most useful functions do not come preloaded when you start R, but reside in packages that can be installed on top of R.
  • R packages are similar to libraries in C, C ++, and Javascript, packages in Python, and gems in Ruby. An R package bundles together useful functions, help files, and data sets. You can use these functions within your own R code once you load the package they live in.

R packages will let you take advantage of R’s most useful features:

  • its large community of package writers (many of whom are active data scientists)
  • and its prewritten routines for handling many common (and exotic) data-science tasks

Base R: It is just the collection of R functions that gets loaded every time you start R. These functions provide the basics of the language, and you don’t have to load a package before you can use them.

Installing Packages:

The easiest way to install an R package is with the install.packages R function. install.packages("package name")

This will search for the specified package in the collection of packages hosted on the CRAN site. When R finds the package, it will download it into a libraries folder on your computer. R can access the package here in future R sessions without reinstalling it.

You can install multiple packages at once by linking their names with R’s concatenate function, c. For example, to install the ggplot2, reshape2, and dplyr packages, run:

install.packages(c("ggplot2", "reshape2", "dplyr"))

Loading Packages:

Installing a package doesn’t immediately place its functions at your fingertips. It just places them on your computer. To use an R package, you next have to load it in your R session with the command:

library(package name)

Library will make all of the package’s functions, data sets, and help files available to you until you close your current R session. The next time you begin an R session, you’ll have to reload the package with library if you want to use it, but you won’t have to reinstall it.

You only have to install each package once. After that, a copy of the package will live in your R library. To see which packages you currently have in your R library, run:

library()

library() also shows the path to your actual R library, which is the folder that contains your R packages. You may notice many packages that you don’t remember installing. This is because R automatically downloads a set of useful packages when you first install R.

Install packages from (almost) anywhere The devtools R package makes it easy to install packages from locations other than the CRAN website. devtools provides functions like

  • install_github
  • install_gitorious
  • install_bitbucket
  • in stall_url

These work similar to install.packages, but they search new locations for R packages. install_github is especially useful because many R developers provide development versions of their packages on GitHub. The development version of a package will contain a sneak peek of new functions and patches but may not be as stable or as bug free as the CRAN version.

Why does R make you bother with installing and loading packages?

  • If every packed load-> very large and slow program
  • It is simpler to only install and load the packages that you want to use when you want to use them.
  • It is possible to update your copy of an R package without updating your entire copy of R.

What’s the best way to learn about R packages?

C. Updating R and Its Packages

The R Core Development Team continuously hones the R language by catching bugs, improving performance, and updating R to work with new technologies. The easiest way to stay current with R is to periodically check the CRAN website: https://cran.r-project.org

RStudio also constantly improves its product. You can acquire the newest updates just by downloading them from RStudio.

  • R Packages

Package authors occasionally release new versions of their packages to add functions, fix bugs, or improve performance. update.packages(c("ggplot2", "reshape2", "dplyr"))

D. Loading and Saving Data in R

  • Data Sets in Base R

Import deck.

deck

These data sets are not very interesting, but they give you a chance to test code or make a point without having to load a data set from outside R. You can see a list of R’s data sets as well as a short description of each by running: help(package = "datasets")

help(package = "datasets")

To use a data set, just type its name. Each data set is already presaved as an R object. For example: iris (Edgar Anderson’s Iris Data)

iris
  • Working Directory

Each time you open R, it links itself to a directory on your computer, which R calls the working directory. This is where R will look for files when you attempt to load them, and it is where R will save files when you save them.

To determine which directory R is using as your working directory, run: getwd()

getwd()

You can place data files straight into the folder that is your working directory, or you can move your working directory to where your data files are. You can move your working directory to any folder on your computer with the function setwd()

You can also change your working directory by clicking on Session > Set Working Directory > Choose Directory in the RStudio menu bar.

You can see what files are in your working directory with list.files().

list.files()
  • Plain-text Files

Plain-text files are one of the most common ways to save data. They are very simple and can be read by many different computer programs-even the most basic text editors. For this reason, public data often comes as plain-text files.

All plain-text files can be saved with the extension .txt (for text), but sometimes a file will receive a special extension that advertises how it separates data-cell entries. Since entries in the data set mentioned earlier are separated with a comma, this file would be a comma-separated-values file and would usually be saved with the extension .csv.

  • read.table

To load a plain-text file, use read.table.If the royal flush data set was saved as a file named poker.csv in your working directory, you could load it with:

poker <- read.table("deck.csv", sep = ",", header = TRUE)

More about read.table, see the documentation https://www.rdocumentation.org/packages/utils/versions/3.5.3/topics/read.table sep: Use sep to tell read.table what character your file uses to separate data entries.

header: Use header to tell read.table whether the first line of the file contains variable names instead of values.

na.strings: Oftentimes data sets will use special symbols to represent missing information. If you know that your data uses a certain symbol to represent missing entries, you can tell read.table (and the preceding functions) what the symbol is with the na.strings argument.

Skip and nrow: Use skip to tell R to skip a specific number of lines before it starts reading in values from the file. Use nrow to tell R to stop reading in values after it has read in a certain number of lines.

stringsAsFactors: Setting the argument stringsAsFactors to FALSE will ensure that R saves any character strings in your data set as character strings, not factors.

If you will be loading more than one data file, you can change the default factoring behavior at the global level with: options(stringsAsFactors = FALSE)

Or

You can use Import Dataset from toolbar Environment -From Text -From Excel etc.

The read Family:

R also comes with some prepackaged short cuts for read.table, shown in Table D-1.

‘read.delim2’ and ‘read.csv2’ exist for European R users. These functions tell R that the data uses a comma instead of a period to denote decimal places. (If you’re wondering how this works with CSV files, CSV2 files usually separate cells with a semicolon, not a comma.)

  • read.fwf

One type of plain-text file defies the pattern by using its layout to separate data cells. Each row is placed in its own line (as with other plain-text files), and then each column begins at a specific number of characters from the lefthand side of the document.

You can read fixed-width files into R with the function read.fwf. The function takes the same arguments as read.table but requires an additional argument, widths, which should be a vector of numbers. Each ith entry of the widths vector should state the width (in characters) of the ith column of the data set. poker <- read.fwf("poker.fwf", widths = c(10, 7, 6), header = TRUE)

  • HTML Links

You can pass a web address into the file name argument for any of R’s data-reading functions. poker <- read.csv("http://.../poker.csv")

Note that websites that begin with https:// are secure websites, which means R may not be able to access the data provided at these links.

** Saving Plain-Text Files: **

Once your data is in R, you can save it to any file format that R supports. The three basic write functions appear in Table D-2.

Use write.csv to save your data as a .csv file and write.table to save your data as a tab delimited document or a document with more exotic separators. write.csv(poker, "data/poker.csv", row.names = FALSE)

  • Compressing Files

To compress a plain-text file, surround the file name or file path with the function bzfile, gzfile, or xzfile. For example: write.csv(poker, file = bzfile("data/poker.csv.bz2"), row.names = FALSE)

Each of these functions will compress the output with a different type of compression format, shown in Table D-3.

  • R Files

R provides two file formats of its own for storing data, .RDS and .RData. RDS files can store a single R object, and RData files can store multiple R objects.

You can open a RDS file with readRDS. For example, if the royal flush data was saved as poker.RDS, you could open it with: poker <- readRDS("poker.RDS")

Opening RData files is even easier. Simply run the function load with the file: load("file.RData")

Both readRDS and load take a file path as their first argument, just like R’s other read and write functions. If your file is in your working directory, the file path will be the file name.

  • Saving R Files

For example, if you have three R objects, a, b, and c, you could save them all in the same RData file and then reload them in another R session:

a <- 1
b <- 2
c <- 3
data<-c(a,b,c)
save(data, file = "stuff.RData")
load("stuff.RData")
stuff<-load("stuff.RData")
stuff
data

Saving your data as an R file offers some advantages over saving your data as a plaintext file. R automatically compresses the file and will also save any R-related metadata associated with your object. This can be handy if your data contains factors, dates and times, or class attributes. You won’t have to reparse this information into R the way you would if you converted everything to a text file.

On the other hand, R files cannot be read by many other programs, which makes them inefficient for sharing. They may also create a problem for long-term storage if you don’t think you’ll have a copy of R when you reopen the files.

Excel Spreadsheets:

Microsoft Excel is a popular spreadsheet program that has become almost industry standard in the business world. There is a good chance that you will need to work with an Excel spreadsheet in R at least once in your career. You can read spreadsheets into R and also save R data as a spreadsheet in a variety of ways.

  • Export from Excel

The best method for moving data from Excel to R is to export the spreadsheet from Excel as a .csv or .txt file. Not only will R be able to read the text file, so will any other data analysis software. Text files are the lingua franca of data storage.

To export data from Excel, open the Excel spreadsheet and then go to Save As in the Microsoft Office Button menu. Then choose CSV in the Save as type box that appears and save the files. You can then read the file into R with the read.csv function.

  • Copy and Paste

You can also copy portions of an Excel spreadsheet and paste them into R. To do this, open the spreadsheet and select the cells you wish to read into R. Then select Edit > Copy in the menu bar-or use a keyboard shortcut-to copy the cells to your clipboard. On most operating systems, you can read the data stored in your clipboard into R with: read.table("clipboard")

  • XLConnect

Many packages have been written to help you read Excel files directly into R. Unfortunately, many of these packages do not work on all operating systems. Others have been made out of date by the .xlsx file format. One package that does work on all file systems (and gets good reviews) is the XLConnect package. To use it, you’ll need to install and load the package: install.packages("XLConnect")

  • Reading Spreadsheets

You can use XLConnect to read in an Excel spreadsheet with either a one- or a two-step process.

wb <- loadWorkbook("file.xlsx")

sheet1 <- readWorksheet(wb, sheet = 1, startRow = 0, startCol = 0, endRow = 100, endCol = 3)

You can combine these two steps with readWorksheetFromFile. It takes the file argument from loadWorkbook and combines it with the arguments from readWorksheet. You can use it to read one or more sheets straight from an Excel file:

sheet1 <- readWorksheetFromFile("file.xlsx", sheet = 1, startRow = 0, startCol = 0, endRow = 100, endCol = 3)

  • Writing Spreadsheets

Writing to an Excel spreadsheet is a four-step process:

  1. b <- loadWorkbook("file.xlsx", create = TRUE)
  2. createSheet(wb, "Sheet 1")
  3. writeWorksheet(wb, data = poker, sheet = "Sheet 1")
  4. writeWorksheetToFile("file.xlsx", data = poker, sheet = "Sheet 1", startRow = 1, startCol = 1)

Loading Files from Other Programs:

You should follow the same advice I gave you for Excel files whenever you wish to work with file formats native to other programs: open the file in the original program and export the data as a plain-text file, usually a CSV. This will ensure the most faithful transcription of the data in the file, and it will usually give you the most options for customizing how the data is transcribed.

Sometimes, however, you may acquire a file but not the program it came from. As a result, you won’t be able to open the file in its native program and export it as a text file. In this case, you can use one of the functions in Table D-4 to open the file.

  • Connecting to Databases

Working with a database will require experience that goes beyond the skill set of a typical R user. However, if you are interested in doing this, the best place to start is by downloading these R packages and reading their documentation. Use the DBI package to connect to databases through individual drivers.

For MySQL use RMySQL, for SQLite use RSQLite, for Oracle use ROracle, for PostgreSQL use RPostgreSQL, and for databases that use drivers based on the Java Database Connectivity (JDBC) API use RJDBC. Once you have loaded the appropriate driver package, you can use the commands provided by DBI to access your database.

E. Debugging R Code-X

Change log update

  • 19.01.2019
  • 13.04.2019
  • 14.04.2019
  • 15.04.2019
  • 16.04.2019
  • 17.04.2019


License

MIT

