Introduction

One of the strengths of R is the ability for users to define their own functions. A function is simply a piece of R code that performs a specific task. They are particularly useful if you have some R code that you want to use repeatedly. You don’t need functions, any task can be done with plain code, but functions have several advantages::

Your code might be shorter and easier to interpret
Your code is easier to maintain
It is easier to reuse functions in future R scripts

You can write your own functions with the following syntax:

my_function <- function(my_arguments) {
# some R code to be executed
return(my_result)
}

We define the function my_function with the keyword function(). The names of one and more arguments can be specified in round parentheses. Most functions require at least one argument but even if a function does not require any arguments (e.g. dir() you have to type the round parentheses. Next, the body of the function – the R code to be executed – is contained within curly parentheses. Finally, you specify which value (i.e. which result) should be returned if the function is called with return(my_result). If you don’t explicitly define a return value, the function will return the last expression evaluated within the within curly parentheses.

Traditionally, the first program written by people learning a new programming language is the famous Hello World program. A program that displays the message “Hello, World!”. This function takes no arguments, because whatever the input, the output is always the same text message. We define a new function called hello and in the function body we simply write some code to print “Hello, World!”.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJoZWxsbyA8LSBmdW5jdGlvbigpIHtcbiAgIHByaW50KFwiSGVsbG8sIFdvcmxkIVwiKVxufVxuXG5oZWxsbygpIn0=

You can see several things:

Although we don’t use any arguments the round parentheses are provided after function.
Although R usually print all values (in this cases the text “Hello, World!”) automatically if they are not assigned to a new object. We used in this example the print function because R will not print an unassigned value if the code is enclosed within curly parentheses.
The function is called by typing the name of the function followed by parentheses. If you omit the parentheses hello R will return the code stored in the function.

A simple function

R does not have a function to calculate the geometric mean, so we will define one. The geometric mean is the nth root of the product of n numbers. However, the product can be extremely large when n becomes large, so it is usually calculated as the exponential of the arithmetic mean of the logarithms, which is in the R language::
exp(mean(log(x)))
where x is a numeric vector. If you don’t know the term numeric vector, it is simply a collection of numbers, e.g. c(2, 3.1, 0) or 5:10. Have a look at the code below ro see how the function is defined:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZW9tLm1lYW4gPC0gZnVuY3Rpb24oeCkge1xuICAgcmVzdWx0IDwtIGV4cChtZWFuKGxvZyh4KSkpXG4gICByZXR1cm4ocmVzdWx0KVxufVxuc29tZV9udW1iZXJzIDwtIGMoMSwxMCwxMDAsMTAwMClcbmdlb20ubWVhbihzb21lX251bWJlcnMpIn0=

We call the function geom.mean together with the argument x. x contains the numbers to calculate the geometric mean from. You could also define the numbers directly in the function call:
geom.mean(c(1,10,100,1000))
but usually you will apply the function to a variables in a dataset, e.g. the variable Wind in the dataset airquality which is included in R. Copy the line below in the code window above and execute the code again.
geom.mean(airquality$Wind)
The code behaves in the same way as plain R code, e.g. if x contains any negative numbers R would show the warning message:
[1] NaN
Warning message:
In log(x) : NaNs produced

Of course, you can assign the result to a new R object for later use in the same ways you do it with other functions:
gm1 <- geom.mean(c(1,10,100,1000))

Exercise
The variable Ozone in the dataset airquality has several missing observations. Therefore, the following line:
geom.mean(airquality$Ozone)
will return NA.
Can you modify the code within the curly parentheses to omit the missing values before the geometric mean is calculated (na.rm = T)?

Multiple arguments

In most functions you can specify more than one argument. The code below defines a new function exponentiate which has to arguments: the base b and the exponent n.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJleHBvbmVudGlhdGUgPC0gZnVuY3Rpb24oYiwgbiA9IDIpIHtcbiAgIHJlc3VsdCA8LSBiXm5cbiAgIHJldHVybihyZXN1bHQpXG59XG5cbmV4cG9uZW50aWF0ZSgyLCAxMCkifQ==

If you have multiple arguments, R expects that you provide them in the same order as defined in the syntax. If you would like to specify them in a different order, you have to provide in addition the argument names, i.e.: exponentiate(n = 10, b = 2)
You have likly noticed that the argument n has a default value (b = 2). If you omit the argument, R will use the default value, e.g.: exponentiate(7) # result will be 7^2 = 49

Exercise: Ancient Babylonian multiplication
In ancient Babylon, they did not have a positional number system (place value notation). Like ancient Rome, they used symbols, each with a fixed integer value (such as 1, 10, 100). Multiplication is not easy, so they used the following formula to multiply 2 numbers (x * y):
((x + y)^2 - (x - y)^2) / 4

Complete the code below to define the function babylon.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJiYWJ5bG9uIDwtIGZ1bmN0aW9uKHgsIHkpIHtcbiAgICMgdHlwZSB0aGUgY29kZSBmb3IgdGhlIEJhYnlsb25pYW4gbXVsdGlwbGljYXRpb24gaGVyZSBcbiAgIHJldHVybihyZXN1bHQpXG59XG5cbmJhYnlsb24oNjYsIDU1KSIsInNvbHV0aW9uIjoiYmFieWxvbiA8LSBmdW5jdGlvbih4LCB5KSB7XG4gICAgcmVzdWx0IDwtICgoeCArIHkpXjIgLSAoeCAtIHkpXjIpIC8gNFxuICAgcmV0dXJuKHJlc3VsdClcbn1cblxuYmFieWxvbig2NiwgNTUpIn0=

Multiple results

If you want to return more than 1 result, the best way is to return it as a list. If you don’t know what a list is: it’s a collection of elements, like a vector, but the elements can be of different classes, e.g. numbers, text, data frames, etc. The elements are usually named. The elements are usually named and can be accessed like a variable in a data frame with the $ sign. Let us define a function that calculates the odds ratio of a 2x2 table and the confidence interval. The confidence interval can be approximated using the following formula:
OR +/- 1.96 * sqrt(1/A + 1/B + 1/C + 1/D)

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEEgPSBudW1iZXIgb2YgZXhwb3NlZCB3aXRoIGRpc2Vhc2UsIC4uLCBEID0gbnVtYmVyIHVuZXhwb3NlZCB3aXRob3V0IGRpc2Vhc2Vcbk9SIDwtIGZ1bmN0aW9uKEEsIEIsIEMsIEQpIHtcbiAgIE9SIDwtIChBICogRCkgLyAoQiAqIEMpXG4gICBjb25mLmludC5sb3cgPC0gT1IgLSAxLjk2ICogc3FydCgxL0EgKyAxL0IgKyAxL0MgKyAxL0QpXG4gICBjb25mLmludC5oaWdoIDwtIE9SICsgMS45NiAqIHNxcnQoMS9BICsgMS9CICsgMS9DICsgMS9EKVxuICAgcmV0dXJuKGxpc3QoT1IgPSBPUiwgY29uZi5pbnQgPSBjKGNvbmYuaW50LmxvdywgY29uZi5pbnQuaGlnaCkpKVxufVxuXG5vcjEgPC0gT1IoMzAsIDcwLCA0MCwgODApXG5vcjFcbiMgUmV0dXJuIG9ubHkgdGhlIGNvbmZpZGVuY2UgaW50ZXJ2YWxcbm9yMSRjb25mLmludCJ9

Condiotional execution with `if`

Let us revisit the geometric mean example. The geometric mean is often used for parasitic egg counts in stool samples. However, there is one problem: there are usually also some negative stool samples, i.e. stool samples with 0 eggs. If there is at least 0, the geometric mean is 0. Therefore, the formula is usually modified as:
exp(mean(log(x + 1))) - 1
Our aim is to write a function which calculates the modified geometric mean if there is at lest one 0 in x and the original formula if all numbers are positive. To check if there is at least one 0, we van use:

if (any(x == 0)){
# code to be executed
}

If an if condition is given in round brackets and the condition is TRUE, then the code in the curly brackets will be executed. As you can see, it follows more or less the same logic as the brackets for the functions.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZW9tLm1lYW4gPC0gZnVuY3Rpb24oeCkge1xuICBpZihhbnkoeCA9PSAwKSkgeyBcbiAgICAgICBwcmludChcIjAgZGV0ZWN0ZWQsICsxIGFkZGVkXCIpXG4gICAgICAgcmVzdWx0IDwtIGV4cChtZWFuKGxvZyh4ICsgMSkpKSAtIDFcbiAgfVxuICBpZihhbGwoeCA+IDApKSB7IFxuICAgICAgIHJlc3VsdCA8LSBleHAobWVhbihsb2coeCkpKVxuICB9ICBcbiAgcmV0dXJuKHJlc3VsdClcbn1cblxuZ2VvbS5tZWFuKGMoMCwxMCwxMDAsMTAwMCkpIiwic29sdXRpb24iOiJnZW9tLm1lYW4gPC0gZnVuY3Rpb24oeCkge1xuICBpZihhbnkoeCA8IDApKSB7IFxuICAgICAgIHN0b3AoXCJOZWdhdGl2ZSBudW1iZXJzXCIpXG4gIH0gIFxuICBpZihhbnkoeCA9PSAwKSkgeyBcbiAgICAgICBwcmludChcIjAgZGV0ZWN0ZWQsICsxIGFkZGVkXCIpXG4gICAgICAgcmVzdWx0IDwtIGV4cChtZWFuKGxvZyh4ICsgMSkpKSAtIDFcbiAgfVxuICBpZihhbGwoeCA+IDApKSB7IFxuICAgICAgIHJlc3VsdCA8LSBleHAobWVhbihsb2coeCkpKVxuICB9ICBcbiAgcmV0dXJuKHJlc3VsdClcbn1cblxuZ2VvbS5tZWFuKGMoLTEsMTAsMTAwLDEwMDApKSJ9

Exercise
It is impossible to calculate the geometric mean with negative numbers. Modify the code above that it produces an error with a warning massage stop(“Negative numbers”) if there are any negative numbers in x.

Anonymous functions

Sometimes you just need a simple function and you only need it once. For example, if you need a custom function inside a by or apply function. It is possible to define such a function directly without a proper definition and without a name. In the following example, you want to calculate the mean and standard deviation of the variable Wind, stratified by month (airquality data). Remember that by is used as follows: It is possible to define such a function directly without proper definition and without an own name. In the example below you would like to calculate the mean and the standard deviation of the variable Wind stratified by month (data airquality).
Remember: by is used in the following way:
by(object_to_analyse, vector_defining_strata, function_to_apply)

You need only a single line!

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJieShhaXJxdWFsaXR5JFdpbmQsIGFpcnF1YWxpdHkkTW9udGgsIGZ1bmN0aW9uKHgpIGMobWVhbih4KSwgc2QoeCkpKSJ9

Passing arguments with `…`

Sometimes a function is called inside another function, e.g. in our geometric mean example we call the built-in mean function inside our own geom.mean function. You may want some arguments of your newly defined function to be passed to a function called within it. If you know in advance which arguments you want to pass, you can simply add them as additional arguments. However, sometimes you do not know exactly which additional arguments might be useful at the time of writing the code. In such a case, you can simply add a placeholder to the list of arguments …

In our initial geometric mean example we had some difficulties with missing values. We could expand the arguments with the 3 dots as follows:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJnZW9tLm1lYW4gPC0gZnVuY3Rpb24oeCwgLi4uKSB7XG4gICByZXN1bHQgPC0gZXhwKG1lYW4obG9nKHgpLCAuLi4pKVxuICAgcmV0dXJuKHJlc3VsdClcbn1cbnNvbWVfbnVtYmVycyA8LSBjKE5BLDEwLDEwMCwxMDAwKVxuZ2VvbS5tZWFuKHNvbWVfbnVtYmVycylcbmdlb20ubWVhbihzb21lX251bWJlcnMsIG5hLnJtID0gVCkifQ==

Writing own functions in R