A R programming

This appendix on R programming is a brief overview of important programming constructs and concepts in R that are used in the book. For a detailed and much more extensive coverage of R as a programming language the reader is referred to the book Advanced R.

Depending on your background, the appendix can serve different purposes. If you already have experience with programming in R, this appendix can serve as a brush up on some basic aspects of the R language that are used throughout the book. If you have experience with programming in other languages than R, it can serve as an introduction to typical R programming techniques, where some may differ from what you know from other languages. If you have little prior experience with programming, this appendix can teach you the most important things for getting started with the book, but you are encouraged to follow up with additional and more detailed material like Advanced R.

For everybody, this appendix covers specific topics relevant for reading the book. These topics include

  • data types, comparisons and numerical precision
  • vectorized computations
  • functions, environments and function factories
  • performance assessment and improvement
  • S3 objects and methods

Several important topics, such as S4 and R6 objects, expressions, and data wrangling, are not covered in any detail. The book is on implementations of correct and efficient numerical algorithms used in statistics, and this is reflected in the topics covered in this appendix.

A.1 Data structures

The fundamental data structure in R is a vector. Even variables that look and behave like single numbers are vectors of length one. Vectors come in two flavors: atomic vectors and lists.

An atomic vector is an indexed collection of data elements that are all of the same type, e.g.

  • integers
  • floating point numbers
  • logical values
  • character strings

A list is an indexed collection of elements without any type restrictions on the individual elements. An element in a list can, for instance, be a list itself.

A.1.1 Atomic vectors

You can construct a vector in R by simply typing in its elements, e.g. 

first_vector <- c(1, 2, 3)  # Note the 'c'
first_vector
## [1] 1 2 3

The constructed vector contains the numbers 1, 2 and 3. We use the classical assignment operator <- throughout, while R supports using = if you prefer. The c() used on the right hand side of the assignment is short for combine (or concatenate), and it is also used if you combine two vectors into one.

second_vector <- c(4, 5, 6)
c(first_vector, second_vector)
## [1] 1 2 3 4 5 6

There are several convenient techniques in R for constructing vectors of various regular nature, e.g. sequences. The following example shows how to construct a vector containing the integers from 1 to 10. The type of the vector is integer indicating that the elements of the vector are stored as integers.

integer_vector <- 1:10
integer_vector
##  [1]  1  2  3  4  5  6  7  8  9 10
typeof(integer_vector)  # `typeof` reveals the internal storage mode
## [1] "integer"

We can access the individual elements as well as subsets of a vector by indices using square brackets.

integer_vector[3]
## [1] 3
integer_vector[first_vector]  # The first three elements
## [1] 1 2 3

The function seq generalizes the colon operator, 1:10, as a way to generate regular sequences. The following example shows how to generate a sequence, double_vector, from 0.1 to 1.0 with increments of size 0.1. The type of the resulting vector is double, which indicates that the elements of double_vector are stored as doubles. That is, the numbers are stored as floating point numbers using 64 bits of storage with a precision of just about 16 digits.

double_vector <- seq(0.1, 1, by = 0.1)
double_vector
##  [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
typeof(double_vector)
## [1] "double"

Integers are often stored as or coerced into doubles automatically. The supposedly integer vector, first_vector, is actually of type double.

typeof(first_vector)
## [1] "double"

In R, numerical data of either type integer or double is collectively referred to as numerics and have mode (and class) numeric. This is confusing, in particular because “numeric” is used also as pseudonym of the type double, but it is rarely a practical problem. A vector of type integer or double is often just said to by a numeric, and the function is.numeric() reflects this.

It is possible to insist that integers are actually stored as integers by appending L to each integer, e.g.

typeof(c(1L, 2L, 3L))
## [1] "integer"

When sequences are generated using the colon operator with the endpoints being integers, as in 1:10, the result will be a vector of type integer. This is an exception. Apparent integers are usually – and silently – converted into doubles if they are not explicitly marked as integers by L.

Vectors of any length can be created by the generic function vector(), or by type specific functions such as numeric() that creates vectors of type double.

vector("numeric", length = 10) 
##  [1] 0 0 0 0 0 0 0 0 0 0
##  [1] 0 0 0 0 0 0 0 0 0 0

Both vectors above are of type double and of length 10 and initialized with all elements being 0.

A logical vector is another example of a useful atomic vector. The default type of a vector created by vector() is logical, with all elements being FALSE.

vector(length = 10)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Logical vectors are encountered when we compare the elements of one vector to another vector or to a number.

logical_vector <- integer_vector > 4
logical_vector
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
typeof(logical_vector)
## [1] "logical"

While a logical vector has its own type and is stored efficiently as such, it behaves in many ways as a numeric vector with FALSE equivalent to 0 and TRUE equivalent to 1. If we want to compute the relative frequency of elements in integer_vector that are are (strictly) larger than 4, say, we can simply take the mean of logical_vector.

mean(logical_vector)
## [1] 0.6

The mean of a logical vector behaves as if the logical values are coerced into zeros and ones before the mean is computed.

A final example of an atomic vector is a character vector. In this example, the vector is combined from 6 individual strings to form a vector of length 6. Combining strings into a vector does not paste the strings together – it forms a vector, whose elements are the individual strings.

character_vector <- c("A", "vector", "of", "length", 6, ".")
character_vector
## [1] "A"      "vector" "of"     "length" "6"      "."
typeof(character_vector)
## [1] "character"

The type of the vector is character. Elements of a vector of type character are strings. Note how the numeric value 6 in the construction of the vector was automatically coerced into the string "6".

It is possible to paste together the strings in a character vector.

paste(character_vector, collapse = " ")  # Now a character vector of length 1!
## [1] "A vector of length 6 ."

It is likewise possible to split a string according to a pattern. For instance into its individual characters.

strsplit(character_vector[2], split = "")  # Split "vector" into characters
## [[1]]
## [1] "v" "e" "c" "t" "o" "r"

Various additional string operations are available – see ?character for more information.

In summary, atomic vectors are the primitive data structures used in R, with elements being accessible via indices (random access). Typical vectors contain numbers, logical values or strings. There is no declarations of data types – they are inferred from data or computations. This is a flexible type system with many operations in R silently coercing elements in a vector from one type to another.

A.1.2 Comparisons and numerical precision

We can compare vectors using the equality operator ==, which compares two vectors element-by-element. The result of the following comparison might be a surprise at first sight.

double_vector[2:3]
## [1] 0.2 0.3
double_vector[2:3] == c(0.2, 0.3)
## [1]  TRUE FALSE

The result of the comparison is a vector of length 2 containing the logical values TRUE FALSE, but double_vector[2:3] appears to be equal to c(0.2, 0.3). The difference shows up if we increase the number of printed digits from the default (which is 7) to 20.

c(0.2, 0.3)
## [1] 0.2000000000000000111 0.2999999999999999889
double_vector[2:3]
## [1] 0.20000000000000001110 0.30000000000000004441

The 0.3 produced by seq is computed as 0.1 + 0.1 + 0.1, while the 0.3 in the vector c(0.2, 0.3) is converted directly into a double precision number. The difference arises because neither 0.1 nor 0.3 are exactly representable in the binary numeral system, and the arithmetic operations induce rounding errors. The function numToBits can reveal the exact difference in the three least significant bits.

numToBits(0.3)[1:3] 
## [1] 01 01 00
numToBits(0.1 + 0.1 + 0.1)[1:3] 
## [1] 00 00 01

Differences in the least significant bits are tolerable when we do numerical computations but can be a nuisance for equality testing. When comparing vectors containing doubles we are therefore often interested in testing for approximate equality instead using e.g. 

all.equal(double_vector, c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0))
## [1] TRUE

The function all.equal has a tolerance argument controlling if numerical differences will be regarded as actual differences. Another way of comparing numerical vectors is by computing the range of their difference

range(double_vector - c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0))
## [1] 0.000000e+00 1.110223e-16

This shows the largest positive and negative difference. Usually, the size of the differences should be assessed relative to the magnitudes of the numbers compared. For numbers approximately of magnitude 1, differences of the order \(2^{-52} \simeq 2.22 \times 10^{-16}\) (the “machine epsilon”) can safely be regarded as rounding errors. The default (relative) tolerance of all.equal is the more permissive number \(\sqrt{2^{-53}} \simeq 1.5 \times 10^{-8}\). This number is an ad hoc but commonly used tolerance. See ?all.equal for more details.

A.1.3 Lists

A.1.4 Other data structures

Factors, dates, data frames

A.2 Functions

When you write an R script and source it, the different R expressions in the script are evaluated sequentially. Intermediate results may be stored in variables and used further down in the script. It may not be obvious, but all the expressions in the script are, in fact, function calls. Your script relies on functions implemented in either core R packages or in other installed packages.

It is possible to use R as a scripting language without ever writing your own functions, but writing new R functions is how you extend the language, and it is how you modularize and reuse your code in an efficient way. As a first example, we implement the Gaussian kernel with bandwidth \(h\) (see Section 2.2) \[K_h(x) = \frac{1}{h \sqrt{2 \pi}} e^{- \frac{x^2}{2h^2}}.\] We call the function gauss() to remind us that this is the Gaussian kernel.

gauss <- function(x, h = 1) {
  exp(- x^2 / (2 * h^2)) / (h * sqrt(2 * pi))
}

The body of the function is the R expression exp(- x^2 / (2 * h^2)) / (h * sqrt(2 * pi)). When the function is called with specific numeric values of its two formal arguments x and h, the body is evaluated with the formal arguments replaced by their values. The value of the (last) evaluated expression in the body is returned by the function. The bandwidth argument h is given a default value 1, so if that argument is not specified when the function is called, it gets the value 1.

The Gaussian kernel with bandwidth \(h\) is the Gaussian density with standard deviation \(h\). Thus we can compare our implementation with dnorm() in R that computes the Gaussian density.

c(gauss(1), dnorm(1))
## [1] 0.2419707 0.2419707
c(gauss(0.1, 0.1), dnorm(0.1, sd = 0.1))  # Bandwidth changed to 0.1
## [1] 2.419707 2.419707

For those two cases the functions compute the same (at least, up to the printed precision). Note how the formal argument sd is given the value 0.1 in the second call of dnorm(). The argument sd is the third argument of dnorm(), but we don’t need to specify the second, as in dnorm(0.1, 0, 0.1), to specify the third. We can do so by its name, in which case the second argument in dnorm() gets its default value 0.

There are several alternative ways to implement the Gaussian kernel that illustrate how functions can be written in R.

# A one-liner without curly brackets
gauss_one_liner <- function(x, h = 1) 
  exp(- x^2 / (2 * h^2)) / (h * sqrt(2 * pi)) 

# A stepwise implementation computing the exponent first 
gauss_step <- function(x, h = 1) {
  exponent <- (x / h)^2 / 2
  exp(- exponent) / (h * sqrt(2 * pi))
}

# A stepwise implementation with an explicit return statement
gauss_step_return <- function(x, h = 1) {
  exponent <- (x / h)^2 / 2
  value <- exp(- exponent) / (h * sqrt(2 * pi))
  return(value)
}

The following small test shows two cases where all implementations compute the same.

c(gauss(1), gauss_one_liner(1), gauss_step(1), gauss_step_return(1))
## [1] 0.2419707 0.2419707 0.2419707 0.2419707
c(
  gauss(0.1, 0.1), 
  gauss_one_liner(0.1, 0.1), 
  gauss_step(0.1, 0.1), 
  gauss_step_return(0.1, 0.1)
)
## [1] 2.419707 2.419707 2.419707 2.419707

A function should always be tested. A test is a comparison of the return value of the function with the expected value for some specific argument(s). The expected value can either be computed by hand or by another implementation – as in the comparisons of gauss() and dnorm(). Such tests cannot prove that the implementation is correct, but discrepancies can help you to catch and correct bugs. Remember that when testing numerical computations that use floating point numbers we cannot expect exact equality. Thus tests should reveal if return values are within an acceptable tolerance of the expected results.

Larger pieces of software – such as an entire R package – should include a number of tests of each function it implements. This is known as unit testing (each unit, that is, each function, is tested), and there are packages supporting the systematic development of unit tests for R package development. A comprehensive set of unit tests also helps when functions are rewritten to e.g. improve performance or extend functionality. If the rewritten function passes all tests, chances are that we didn’t break anything by rewriting the function.

It is good practice to write functions that do one well defined computation and to keep the body of a function relatively small. Then it is easier to reason about what the function does and it is easier to comprehensively test it. Complex behavior is achieved by composing small and well tested functions.

To summarize:

  • The body is enclosed by curly brackets {}, which may be left out if the body is only one line.
  • The function returns the value of the last expression in the body except when the body contains an explicit return statement.
  • Formal arguments can be given default values when the function is implemented, and arguments can be passed to the function by position as well as by name.
  • Functions should do a single well defined computation and be well tested.

A.2.1 Vectorization

In Section A.1 it was shown how comparison operators work in a vectorized way. In R, comparing a vector to another vector or a number leads to element-by-element comparisons with a logical vector as the result. This is one example of how many operations and function evaluations in R are natively vectorized, which means that when the function is evaluated with a vector argument the function body is effectively evaluated for each entry in the vector.

Our gauss() function is another example of a function that automatically works as a vectorized function.

gauss(c(1, 0.1), c(1, 0.1))
## [1] 0.2419707 2.4197072

It works as expected for the vector input because all the functions in the body of gauss() are vectorized, that is, the arithmetic operators are vectorized, the square root is vectorized and the exponential function is vectorized.

It is good practice to write R programs that use vectorized computations whenever possible. The alternative for-loop can be much slower. Several examples in the book illustrate the computational benefits of vectorized implementations. It may, however, not always be obvious how to correctly implement a vectorized function.

Suppose we want to implement the following function \[\overline{f}_h(x) = \frac{1}{n} \sum_{i=1}^n K_h(x - x_i)\] for a data set \(x_1, \ldots, x_n\) and \(K_h\) the Gaussian kernel. This is the Gaussian kernel density estimator considered in Section 2.4. A straight forward implementation is

xs <- rnorm(10)  # A data set with 10 observations
f_bar <- function(x, h) 
  mean(gauss(x - xs, h))

This implementation works correctly when x and h are single numbers but when e.g. x is a vector, the function returns a number (not a vector) unrelated to \(\overline{f}_h(0)\) and \(\overline{f}_h(1)\).

c(f_bar(0, 1), f_bar(1, 1))
## [1] 0.2289101 0.2434550
f_bar(c(0, 1), 1) # Computation is not vectorized
## [1] 0.2201532

A quick fix is the following explicit vectorization

f_bar_vec <- Vectorize(f_bar)

The R function Vectorize() is a function operator, which takes a function as argument and returns a function. In this case a vectorized version of the input function. That is, f_bar_vec() can be applied to a vector, which results in applying f_bar() to each element of the vector. We test that f_bar_vec() works correctly – both when the first and the second argument is given as a vector.

f_bar_vec(c(0, 1), 1)
## [1] 0.2289101 0.2434550
c(f_bar(0, 1), f_bar(1, 1)) # Same result as the vectorized computation
## [1] 0.2289101 0.2434550
c(f_bar(1, 1), f_bar(1, 0.1))
## [1] 0.2434550 0.4022502
f_bar_vec(1, c(1, 0.1)) # Same result as the vectorized computation
## [1] 0.2434550 0.4022502

The function Vectorize() basically works by looping over the elements in the vector argument(s) and applying the function to each element. For prototyping and quick implementations it can be very convenient, but it is not a shortcut to efficient vectorized computations.

Vectorize() is used in Section 2.4 to implement \(\overline{f}_h\) based on dnorm(). The purpose of that implementation is to be able to compute \(\overline{f}_h(x)\) for arbitrary \(x\) in a vectorized way, so that the function can be used together with curve() and in the likelihood computations of that section. The implementations in Section 2.2.1 differ by computing and returning \[\overline{f}_h(\tilde{x}_1), \ldots, \overline{f}_h(\tilde{x}_m).\] That is, those implementations return the evaluations of \(\overline{f}_h\) in an equidistant grid \(\tilde{x}_1, \ldots, \tilde{x}_m\) and not \(\overline{f}_h\) itself.

A.2.2 Environments

Something

Use xs above as example.

A.2.3 Function factories

Something

The random number streams from Chap. 4. Include implementation in package.

A.3 Performance

Something

Mention parallel computations

A.3.1 Tracing

A.3.2 Rcpp

Use the mean_numeric and mean_complex examples.

A.4 Objects and methods

Something

A.5 Exercises

Functions

Exercise A.1 Explain the result of evaluating the following R expression.
(0.1 + 0.1 + 0.1) > 0.3
## [1] TRUE

Exercise A.2 Write a function that takes a numeric vector x and a threshold value h as arguments and returns the vector of all values in x greater than h. Test the function on seq(0, 1, 0.1) with threshold 0.3. Have the example from Exercise A.1 in mind.

Exercise A.3 Investigate how your function from Exercise A.2 treats missing values (NA), infinite values (Inf and -Inf) and the special value “Not a Number” (NaN). Rewrite your function (if necessary) to exclude all or some of such values from x.

Hint: The functions is.na, is.nan and is.finite are useful.

Histograms with non-equidistant breaks

The following three exercises will use a data set consisting of measurements of infrared emissions from objects outside of our galax. We will focus on the variable F12, which is the total 12 micron band flux density.

infrared <- read.table("data/infrared.txt", header = TRUE)
F12 <- infrared$F12

The purpose of this exercise is two-fold. First, you will get familiar with the data and see how different choices of visualizations using histograms can affect your interpretation of the data. Second, you will learn more about how to write functions in R and gain a better understanding of how they work.

Exercise A.4 Plot a histogram of log(F12) using the default value of the argument breaks. Experiment with alternative values of breaks.

Exercise A.5 Write your own function, called my_breaks, which takes two arguments, x (a vector) and h (a positive integer). Let h have default value 5. The function should first sort x into increasing order and then return the vector that: starts with the smallest entry in x; contains every \(h\)th unique entry from the sorted x; ends with the largest entry in x.

For example, if h = 2 and x = c(1, 3, 2, 5, 10, 11, 1, 1, 3) the function should return c(1, 3, 10, 11). To see this, first sort x, which gives the vector c(1, 1, 1, 2, 3, 3, 5, 10, 11), whose unique values are c(1, 2, 3, 5, 10, 11). Every second unique entry is c(1, 3, 10), and then the largest entry 11 is concatenated.

Hint: The functions sort and unique can be useful.

Use your function to construct breakpoints for the histogram for different values of h, and compare with the histograms obtained in Exercise A.4.

Exercise A.6 If there are no ties in the data set, the function above will produce breakpoints with h observations in the interval between two consecutive breakpoints (except the last two perhaps). If there are ties, the function will by construction return unique breakpoints, but there may be more than h observations in some intervals.

The intention is now to rewrite my_breaks so that if possible each interval contains h observations.

Modify the my_breaks function with this intention and so that is has the following properties:

  • All breakpoints must be unique.
  • The range of the breakpoints must cover the range of x.
  • For two subsequent breakpoints, \(a\) and \(b\), there must be at least h observations in the interval \((a,b],\) provided h < length(x). (With the exception that for the first two breakpoints, the interval is \([a,b].\))

Functions and objects

The following exercises build on having implemented a function that computes breakpoints for a histogram either as in Exercise A.5 or as in Exercise A.6.

Exercise A.7 Write a function called my_hist, which takes a single argument h and plots a histogram of log(F12). Extend the implementation so that any additional argument specified when calling my_hist is passed on to hist. Investigate and explain what happens when executing the following function calls.
my_hist()
my_hist(h = 5, freq = TRUE)
my_hist(h = 0)

Exercise A.8 Modify your my_hist function so that it returns an object of class my_histogram, which is not plotted. Write a print method for objects of this class, which prints just the number of cells.

Hint: It can be useful to know about the function cat.

How can you assign a class label to the returned object so that it is printed using your new print method, but it is still plotted as a histogram when given as argument to plot?
Exercise A.9 Write a summary method that returns a data frame with two columns containing the midpoints of the cells and the counts.
Exercise A.10 Write a new plot method for objects of class my_histogram that uses ggplot2 for plotting the histogram.

Functions and environments

The following exercises assume that you have implemented a my_hist function as in Exercise A.7.

Exercise A.11 What happens if you remove that data and call my_hist subsequently? What is the environment of my_hist? Change it to a new environment, and assign (using the function assign) the data to a variable with an appropriate name in that environment. Once this is done, check what now happens when calling my_hist after the data is removed from the global environment.

Exercise A.12 Write a function that takes an argument x (the data) and returns a function, where the returned function takes an argument h (just as my_hist) and plots a histogram (just as my_hist). Because the return value is a function, we may refer to the function as a function factory.

What is the environment of the function created by the function factory? What is in the environment? Does it have any effect when calling the function whether the data is altered or removed from the global environment?

Exercise A.13 Evaluate the following function call:

tmp <- my_hist(10, plot = FALSE)

What is the type and class of tmp? What happens when plot(tmp, col = "red") is executed? How can you find help on what plot does with an object of this class? Specifically, how do you find the documentation for the argument col, which is not an argument of plot?