One of the strengths of R is the ability for users to define their
own functions. A function is simply a piece of R code that performs a
specific task. They are particularly useful if you have some R code that
you want to use repeatedly. You don’t need functions, any task can be
done with plain code, but functions have several advantages::
Your code might be shorter and easier to interpret
Your code is
easier to maintain
It is easier to reuse functions in future R
scripts
You can write your own functions with the following syntax:
my_function <- function(my_arguments) {
# some R
code to be executed
return(my_result)
}
We define the function my_function with the keyword
function(). The names of one and more arguments can be
specified in round parentheses. Most functions require at least one
argument but even if a function does not require any arguments
(e.g. dir() you have to type the round parentheses. Next, the
body of the function – the R code to be executed – is contained within
curly parentheses. Finally, you specify which value (i.e. which result)
should be returned if the function is called with
return(my_result). If you don’t explicitly define a return value,
the function will return the last expression evaluated within the within
curly parentheses.
Traditionally, the first program written by people learning a new
programming language is the famous Hello World program. A program that
displays the message “Hello, World!”. This function takes no arguments,
because whatever the input, the output is always the same text message.
We define a new function called hello and in the function body
we simply write some code to print “Hello, World!”.
You can see several things:
Although we don’t use any
arguments the round parentheses are provided after
function.
Although R usually print all values (in this
cases the text “Hello, World!”) automatically if they are not assigned
to a new object. We used in this example the print function
because R will not print an unassigned value if the code is enclosed
within curly parentheses.
The function is called by typing the name
of the function followed by parentheses. If you omit the parentheses
hello R will return the code stored in the function.
R does not have a function to calculate the geometric mean, so we
will define one. The geometric mean is the nth root of the product of n
numbers. However, the product can be extremely large when n becomes
large, so it is usually calculated as the exponential of the arithmetic
mean of the logarithms, which is in the R language::
exp(mean(log(x)))
where x is a numeric vector. If you don’t
know the term numeric vector, it is simply a collection of numbers,
e.g. c(2, 3.1, 0) or 5:10. Have a look at the code
below ro see how the function is defined:
We call the function geom.mean together with the argument x.
x contains the numbers to calculate the geometric mean from. You could
also define the numbers directly in the function call:
geom.mean(c(1,10,100,1000))
but usually you will apply the
function to a variables in a dataset, e.g. the variable Wind in
the dataset airquality which is included in R. Copy the line
below in the code window above and execute the code again.
geom.mean(airquality$Wind)
The code behaves in the same way
as plain R code, e.g. if x contains any negative numbers R would show
the warning message:
[1] NaN
Warning
message:
In log(x) : NaNs produced
Of course,
you can assign the result to a new R object for later use in the same
ways you do it with other functions:
gm1 <-
geom.mean(c(1,10,100,1000))
Exercise
The variable Ozone in the dataset
airquality has several missing observations. Therefore, the
following line:
geom.mean(airquality$Ozone)
will return
NA.
Can you modify the code within the curly parentheses to
omit the missing values before the geometric mean is calculated
(na.rm = T)?
In most functions you can specify more than one argument. The code
below defines a new function exponentiate which has to
arguments: the base b and the exponent n.
If you have multiple arguments, R expects that you provide them in
the same order as defined in the syntax. If you would like to specify
them in a different order, you have to provide in addition the argument
names, i.e.: exponentiate(n = 10, b = 2)
You have likly
noticed that the argument n has a default value (b = 2). If you
omit the argument, R will use the default value, e.g.:
exponentiate(7) # result will be 7^2 = 49
Exercise: Ancient Babylonian multiplication
In ancient
Babylon, they did not have a positional number system (place value
notation). Like ancient Rome, they used symbols, each with a fixed
integer value (such as 1, 10, 100). Multiplication is not easy, so they
used the following formula to multiply 2 numbers (x * y):
((x +
y)^2 - (x - y)^2) / 4
Complete the code below to define the function babylon.
If you want to return more than 1 result, the best way is to return
it as a list. If you don’t know what a list is: it’s a collection of
elements, like a vector, but the elements can be of different classes,
e.g. numbers, text, data frames, etc. The elements are usually named.
The elements are usually named and can be accessed like a variable in a
data frame with the $ sign. Let us define a function that calculates the
odds ratio of a 2x2 table and the confidence interval. The confidence
interval can be approximated using the following formula:
OR +/-
1.96 * sqrt(1/A + 1/B + 1/C + 1/D)
Let us revisit the geometric mean example. The geometric mean is
often used for parasitic egg counts in stool samples. However, there is
one problem: there are usually also some negative stool samples,
i.e. stool samples with 0 eggs. If there is at least 0, the geometric
mean is 0. Therefore, the formula is usually modified as:
exp(mean(log(x + 1))) - 1
Our aim is to write a function
which calculates the modified geometric mean if there is at lest one 0
in x and the original formula if all numbers are positive. To check if
there is at least one 0, we van use:
if (any(x ==
0)){
# code to be executed
}
If an if condition is given in round brackets and the
condition is TRUE, then the code in the curly brackets will be
executed. As you can see, it follows more or less the same logic as the
brackets for the functions.
Exercise
It is impossible to calculate the geometric mean
with negative numbers. Modify the code above that it produces an error
with a warning massage stop(“Negative numbers”) if there are
any negative numbers in x.
Sometimes you just need a simple function and you only need it once.
For example, if you need a custom function inside a by or
apply function. It is possible to define such a function
directly without a proper definition and without a name. In the
following example, you want to calculate the mean and standard deviation
of the variable Wind, stratified by month (airquality data).
Remember that by is used as follows: It is possible to define such a
function directly without proper definition and without an own name. In
the example below you would like to calculate the mean and the standard
deviation of the variable Wind stratified by month (data
airquality).
Remember: by is used in the
following way:
by(object_to_analyse, vector_defining_strata,
function_to_apply)
You need only a single line!
Sometimes a function is called inside another function, e.g. in our
geometric mean example we call the built-in mean function
inside our own geom.mean function. You may want some arguments of your
newly defined function to be passed to a function called within it. If
you know in advance which arguments you want to pass, you can simply add
them as additional arguments. However, sometimes you do not know exactly
which additional arguments might be useful at the time of writing the
code. In such a case, you can simply add a placeholder to the list of
arguments …
In our initial geometric mean example we
had some difficulties with missing values. We could expand the arguments
with the 3 dots as follows: