Chapter 1 Introduction

Computational statistics is about turning theory and methods into algorithms and actual numerical computations with data. It is about solving real computational problems that arise when we visualize, analyze and model data.

Computational statistics is not a single coherent topic but rather a large number of vaguely related computational techniques that we use in statistics. This short book is in no way attempting to be comprehensive. Instead, a few selected statistical topics are treated in some detail with the intention that good computational practice can be learned from these topics and transferred to other parts of statistics as needed. Though the topics are arguably fundamental, they reflect the knowledge and interests of the author, and different topics could clearly have been chosen.

The demarcation line between statistical methodology and computational statistics is also blurred. Most methodology involves mathematical formulas and even algorithms for computing estimates and other statistics of interest from data, or for evaluating probabilities or integrals numerically or via simulations. The viewpoint taken in this book is that the transition from methodology to computational statistics happens when the methodology is to be implemented, that is, when formulas, algorithms and pseudo code are transformed into actual code and statistical software. It is during this transition that a number of practical challenges reveal themselves, such as actual run time and memory usage, the limitations of finite precision arithmetic, and the value of approximate solutions that may be theoretically suboptimal but sufficiently accurate for any pratical purpose.

Statistical software development also requires some basic software engineering skills and knowledge of the most common programming paradigms. Implementing a single algorithm for a specific problem is one thing, but developing a piece of statistical software for others to use is something quite different. This book is not an introduction to statistical software development as such, but the process of developing good software plays a prominent role. Thus solutions are not presented as code that magically manifests itself, but code is developed and analyzed in cycles that resemble how software development takes place.

There is a notable practical and experimental component to software development. However important theoretical considerations are regarding correctness and complexity of algorithms, say, the actual code has to strike a balance between generality, readability, efficiency, accuracy, ease of usage and ease of development among other things. Finding a good balance requires that one is able to reason about benefits and deficiencies of different implementations. It is a major point of this book that such reasoning should rely on experiments and empirical facts and not speculations.

R and RStudio is used throughout, and the reader is expected to have some basic knowledge of R programming. While RStudio is not a requirement for most of the book, it is a recommendable IDE (integrated development environment) for R, which offers a convenient framework for developing, benchmarking, profiling, testing, documenting and experimenting with statistical software. Appendix A covers some basic and important aspects of R programming and can serve as a survey and quick reference. For an in-depth treatment of R as a programming language the reader is referred to Advanced R by Hadley Wickham. In fact, direct references to that book are given for detailed explanations of R programming language concepts such as functional programming and object oriented programming.

This book is organized into three parts on smoothing, Monte Carlo methods and optimization. Each part is introduced in the following three sections to give the reader an overview of the topics covered, how they are related to each other and how they are related to some main trends and challenges in contemporary computational statistics. In this introduction, several R functions from various packages are used to illustrate how smoothing, simulation of random variables and optimization play roles in statistics. Thus the introduction relies largely on already implemented solutions, and some data analysts will never want to move beyond that use of R. However, the remaining part of the book is written for those who want to move on and learn how to develop their own solutions and not just how to use interfaces to the plethora of already existing implementations.