3 An R Package Engineering Workflow

BBS Course: Good Software Engineering Practice for R Packages

Friedrich

February 10, 2023

Motivation

From an idea to a production-grade R package

Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.

The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient and on top of that, you want to share your helpful R functions with colleagues and the world…

Professional Workflow

Photo CC0 by ELEVATE on pexels.com

Typical work steps

  1. Idea
  2. Concept creation
  3. Validation planning
  4. Specification:
    1. User Requirements Spec (URS),
    2. Functional Spec (FS), and
    3. Software Design Spec (SDS)
  1. R package programming
  2. Documented verification
  3. Completion of formal validation
  4. R package release
  5. Use in production
  6. Maintenance

Workflow in Practice

Photo CC0 by Chevanon Photography on pexels.com

Frequently Used Workflow in Practice

  1. Idea
  2. R package programming
  3. Use in production
  4. Bug fixing
  5. Use in production
  1. Bug fixing + Documentation
  2. Use in production
  3. Bug fixing + Further development
  4. Use in production
  5. Bug fixing + …

Bad practice!

Why?

Why practice good engineering?

Warning: Paket 'ggplot2' wurde unter R Version 4.2.3 erstellt
Warning: Paket 'dplyr' wurde unter R Version 4.2.3 erstellt

Attache Paket: 'dplyr'
Die folgenden Objekte sind maskiert von 'package:stats':

    filter, lag
Die folgenden Objekte sind maskiert von 'package:base':

    intersect, setdiff, setequal, union

Cost distribution among software process activities

doi:10.14569/IJACSA.2020.0110375

Why practice good engineering?

Origin of errors in system development

Boehm, B. (1981). Software Engineering Economics. Prentice Hall.

Why practice good engineering?

  • Don’t waste time on maintenance
  • Be faster with release on CRAN
  • Don’t waste time with inefficient and buggy further development
  • Fulfill regulatory requirements1
  • Save refactoring time when the PoC becomes the release version
  • You don’t have to be shy any longer about inviting other developers to contribute to the package on GitHub

Why practice good engineering?

Invest time in

  • requirements analysis,
  • software design, and
  • architecture…

… but in many cases the workflow must be workable for a single developer or a small team.

Workable Workflow

Photo CC0 by Kateryna Babaieva on pexels.com

Suggestion for a Workable Workflow

  1. Idea
  2. Design docs
  3. R package programming
  4. Quality check (see Ensuring Quality by Friedrich)
  5. Publication (see Publication by Daniel and Friedrich)
  6. Use in production

Example - Step 1: Idea

Let’s assume that you used some lines of code to create simulated data in multiple projects:

dat <- data.frame(
    group = c(rep(1, 50), rep(2, 50)),
    values = c(
        rnorm(n = 50, mean = 8, sd = 12),
        rnorm(n = 50, mean = 14, sd = 11)
    )
)

Idea: put the code into a package

Example - Step 2: Design docs

  1. Describe the purpose and scope of the package
  2. Analyse and describe the requirements in clear and simple terms (“prose”)
Obligation level Key word1 Description
Duty shall “must have”
Desire should “nice to have”
Intention will “optional”

Example - Step 2: Design docs

Purpose and Scope

The R package simulatr shall enable the creation of reproducible fake data.

Package Requirements

simulatr shall provide a function to generate normal distributed random data for two independent groups. The function shall allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data shall be ensured via an optional seed It should be possible to print the function result. A graphical presentation of the simulated data will also be possible.

Example - Step 2: Design docs

Useful formats / tools for design docs:

UML Diagram

Example - Step 3: Packaging

R package programming

  1. Create basic package project (see R Packages by Daniel)
  2. C&P existing R scripts (one-off scripts, prototype functions) and refactor1 it if necessary
  3. Create R generic functions
  4. Document all functions

Example - Step 3: Packaging

One-off script as starting point:

sim.data <- function(n1, n2, m1, m2, s1, s2) {
    data.frame(
        group = c(rep(1, n1), rep(2, n2)),
        values = c(
            rnorm(n = n1, mean = m1, sd = s1),
            rnorm(n = n2, mean = m2, sd = s2)
        )
    )
}

Example - Step 3: Packaging

Refactored script:

getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
    data.frame(
        group = c(rep(1, n1), rep(2, n2)),
        values = c(
            rnorm(n = n1, mean = mean1, sd = sd1),
            rnorm(n = n2, mean = mean2, sd = sd2)
        )
    )
}

Almost all functions, arguments, and objects should be self-explanatory due to their names.

Example - Step 3: Packaging

Define that the result is a list1 which is defined as class2:

getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
    result <- list(n1 = n1, n2 = n2, 
         mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
    result$data <- data.frame(
        group = c(rep(1, n1), rep(2, n2)),
        values = c(
            rnorm(n = n1, mean = mean1, sd = sd1),
            rnorm(n = n2, mean = mean2, sd = sd2)
        )
    )
    # set the class attribute
    result <- structure(result, class = "SimulationResult")
    return(result)
}

Example - Step 3: Packaging

The output is impractical, e.g., we need to scroll down:

x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5, mean2 = 7, sd1 = 3, sd2 = 4)
x
$n1
[1] 50

$n2
[1] 50

$mean1
[1] 5

$mean2
[1] 7

$sd1
[1] 3

$sd2
[1] 4

$data
    group      values
1       1  8.68064083
2       1  6.39745714
3       1  7.21711711
4       1  6.50761044
5       1  9.59162376
6       1  2.26972752
7       1  1.91521545
8       1  7.11236143
9       1 10.43559901
10      1  4.43742302
11      1  6.13109610
12      1  4.59842525
13      1  2.07578083
14      1  9.25296403
15      1  4.56812262
16      1 -1.12312973
17      1  5.10313844
18      1  9.31141402
19      1  7.28996543
20      1  4.87300823
21      1  9.20348007
22      1  4.58457001
23      1  5.18557564
24      1  5.78791703
25      1  2.78854415
26      1  3.52937152
27      1  5.67552526
28      1  3.02615791
29      1  4.50510663
30      1  6.05430704
31      1  9.21213913
32      1  5.68745072
33      1  7.44154813
34      1  6.03938427
35      1  7.76094984
36      1  5.71422534
37      1  8.21274487
38      1 10.88677276
39      1  0.09642121
40      1  7.38281043
41      1  3.71841030
42      1  8.92071903
43      1  2.38139275
44      1  1.06896357
45      1  7.84577139
46      1  2.90588667
47      1  1.28299573
48      1  1.98017719
49      1  5.72716612
50      1  4.52133215
51      2  7.07243697
52      2 11.89041739
53      2  9.32211607
54      2  9.29736751
55      2  6.38621604
56      2  9.93512784
57      2  6.90791548
58      2 13.59539939
59      2  8.19040909
60      2  8.90092196
61      2  7.55025224
62      2 -0.81017866
63      2 13.92597513
64      2 10.47336492
65      2  4.36222036
66      2  6.33767625
67      2 -3.90688152
68      2  4.20361575
69      2  9.45969799
70      2 15.24757853
71      2  0.39044942
72      2 12.75914893
73      2  9.62574650
74      2 12.19362357
75      2  9.30336106
76      2 13.39287508
77      2  6.76688953
78      2 10.25251649
79      2  9.48878446
80      2  7.70298784
81      2 11.08440991
82      2 15.22121843
83      2  5.24498197
84      2  0.33035643
85      2 10.54055221
86      2  1.10215483
87      2  5.90912634
88      2 11.25683659
89      2  8.88871960
90      2  4.81402524
91      2 14.42766359
92      2  4.26065880
93      2 10.79008307
94      2  7.89656300
95      2 11.78236101
96      2  3.08428371
97      2 12.07637845
98      2  6.93974642
99      2  3.67230448
100     2  9.78703927

attr(,"class")
[1] "SimulationResult"

Solution: implement generic function print

Example - Step 3: Packaging

Generic function print:

print.SimulationResult <- function(x, ...) {
    args <- list(n1 = x$n1, n2 = x$n2, 
        mean1 = x$mean1, mean2 = x$mean2, sd1 = x$sd1, sd2 = x$sd2)
    
    print(list(
        args = format(args), 
        data = dplyr::tibble(x$data)
    ), ...)
}
x
#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#' 
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5, 
#'      mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
   n1    n2 mean1 mean2   sd1   sd2 
 "50"  "50"   "5"   "7"   "3"   "4" 

$data
# A tibble: 100 × 2
   group values
   <dbl>  <dbl>
 1     1   8.68
 2     1   6.40
 3     1   7.22
 4     1   6.51
 5     1   9.59
 6     1   2.27
 7     1   1.92
 8     1   7.11
 9     1  10.4 
10     1   4.44
# ℹ 90 more rows

Exercise

Photo CC0 by Pixabay on pexels.com

Preparation

  1. Download the unfinished R package simulatr
  2. Extract the package zip file
  3. Open the project with RStudio
  4. Complete the tasks below

Tasks

Add assertions to improve the usability and user experience

Tip on assertions

Use the package checkmate to validate input arguments.

Example:

playWithAssertions <- function(n1) {
  checkmate::assertInt(n1, lower = 1)
}
playWithAssertions(-1)

Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.

Add three additional results:

  1. n total,
  2. creation time, and
  3. allocation ratio

Tip on creation time

Sys.time(), format(Sys.time(), '%B %d, %Y'), Sys.Date()

Add an additional result: t.test result

Add an optional alternative argument and pass it through t.test:

alternative = c("two.sided", "less", "greater")

Implement the generic functions print and plot.

Tip on print

Use the plot example function from above and extend it.

Tip on plot

Use R base plot or ggplot2 to create a grouped boxplot of the fake data.

Optional extra tasks:

  • Implement the generic functions summary and cat

  • Implement the function kable known from the package knitr as generic. Tip: use

    kable <- function(x) UseMethod("kable")

    to define kable as generic

Optional extra task1:

Document your functions with Roxygen2

  1. If you are already familiar with Roxygen2

References

  • Gillespie, C., & Lovelace, R. (2017). Efficient R Programming: A Practical Guide to Smarter Programming. O’Reilly UK Ltd. [Book | Online]
  • Grolemund, G. (2014). Hands-On Programming with R: Write Your Own Functions and Simulations (1. Aufl.).
    O’Reilly and Associates. [Book | Online]
  • Rupp, C., & SOPHISTen, die. (2009). Requirements-Engineering und -Management: Professionelle, iterative Anforderungsanalyse für die Praxis (5. Ed.). Carl Hanser Verlag GmbH & Co. KG. [Book]
  • Wickham, H. (2015). R Packages: Organize, Test, Document, and Share Your Code (1. Aufl.). O’Reilly and Associates. [Book | Online]
  • Wickham, H. (2019). Advanced R, Second Edition.
    Taylor & Francis Ltd. [Book | Online]

License information