Introduction to `ggplot2`

R's package ggplot2 is based on a 'grammar of graphics' framework which follows a layered approach to construct graphics in a structured manner. All graphics in this package use layers to create the final graphic. A layer has 5 important components:

Data - the source of the information to be visualized.
ggplot(data=airquality, ...
Mapping - 'aesthetics' how the variables are applied to the graph in terms of X and Y position, stratification, colors etc.
ggplot(data=airquality, mapping = aes(x=Temp)) + ...
Statistical transformation - includes value transformations, calculating means for bar plots, adding a regression line etc. Every geometric object function has a default statistic.
Geometric objects - points, lines, bars, text
... geom_density()
Scales - A scale controls how data interact with aesthetic attributes. Particularly important are the color scales.
... scale_fill_manual(values = c("red", "blue", "green"))
Coordinate system - most often you will likely use the default coordinate system - the Cartesian system.

Visual overview

If you are new to ggplot2, I highly recommend to work through the entire tutorial. If you are just interested in a specific graph, select the desired thumbnail below.

Example data `airquality`

Data represent daily air quality measurements in New York, May to September 1973. The variables Ozone and Solar.R have several missing values.

Table 1: Example dataset airquality
No	Var	Type	Description
1	Ozone	number	Ozone (ppb)
2	Solar.R	number	Solar radiation (lang)
3	Wind	number	Wind speed (mph)
4	Temp	number	Temperature (°F)
5	Month	number	Month (1--12)
6	Day	number	Day of month (1--31)

Example data `birthwt`

This dataset is in the package MASS. The aim of the study was to assess risk factors associated with low infant birth weight collected at a medical center in US during 1986. 189 women were enrolled in the study.

Table 2: Example dataset birthwt
No	Var	Type	Description
1	low	number	Birth weight less than 2.5 kg (0/1)
2	age	number	Mother's age [years]
3	lwt	number	Mother's weight at last menstrual period [pounds]
4	race	number	Mother's race (1 = white, 2 = black, 3 = other)
5	smoke	number	Smoking status during pregnancy (0/1)
6	ptl	number	Number of previous premature labours
7	ht	number	Hypertension (0/1)
8	ui	number	Uterine irritability (0/1)
9	ftv	number	Number of physician visits during the first trimester
10	bwt	number	Birth weight [g]

Example data `ChickWeight`

Experiment on the effect of diet on early growth of 50 chicks. The Idea was to measure all chicken 12 times over 3 weeks but a few chicks have been measured less often.

Table 3: Example dataset ChickWeight
No	Var	Type	Description
1	weight	number	Weight [g]
2	Time	number	Time since birth [days]
3	Chick	number	ID
4	Diet	number	Diet (number 1--4)

Example data `swiss`

Swiss fertility and socioeconomic indicators from 1888 for each canton.

Table 4: Example dataset swiss
No	Var	Type	Description
-	(rownames)	text	Canton
1	Fertility	number	'Ig' index of marital fertility
2	Agriculture	number	Males involved in agriculture as occupation [%]
3	Examination	number	Draftees receiving highest mark on army examination [%]
4	Education	number	Education beyond primary school for draftees [%]
5	Catholic	number	Catholic (as opposed to protestant) [%]
6	Infant.Mortality	number	Live births who live less than 1 year [%]

Visualising distributions

Histograms `geom_histogram`

The most common graphical representation of the distribution of numerical data is the histogram. Grouping each value into a "bin" of values and displaying the bin counts across the range of values. The code below generates a histogram of variable Temp of the dataset airquality.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMSA8LSBnZ3Bsb3QoZGF0YT1haXJxdWFsaXR5LCBhZXMoeD1UZW1wKSkgK1xuICAgICAgICAgZ2VvbV9oaXN0b2dyYW0oKSBcbnBsb3QocDEpIn0=

An important characteristic of a histogram is the number of bins (parts) you would like to use for the distribution. You can specify either the number of bins or the width of the bins with binwidth. Change the 2nd line in the code above to:
geom_histogram(bins=10)
and run the code again.

Overlaid histograms

Sometimes you would like to compare distribution of a numeric variable across different subgroups. If you specify a categorical variable in one of the aesthetics:
group, fill of color, ggplot2 will show a seperate histogram for each unique value of this variable. You can also convert a numeric variable into categories inside the aes argument. Also conditions which might be TRUE or FALSE are interpreted as two categories, e.g. show a histogram of variable Temp stratified by the condition Month > 6:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMiA8LSBnZ3Bsb3QoZGF0YT1haXJxdWFsaXR5LCBhZXMoeCA9IFRlbXAsIGZpbGwgPSBNb250aCA+IDYpKSArXG4gICAgICAgICBnZW9tX2hpc3RvZ3JhbSgpIFxucGxvdChwMikifQ==

If you change an argument in geom_histogram, e.g.:
geom_histogram(binwidth = 4)
the changes will be applied to both histograms.

Density plots `geom_density`

Density plots and histograms

A density plot is like a smoothed histogram. In the figure above you see how both plots are related. The advantage of density plots over histograms gets obvious, if you would like to compare more than 2 distributions. In the next example we would like to visualize the distribution of temperature seperately for each month:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMyA8LSBnZ3Bsb3QoZGF0YSA9IGFpcnF1YWxpdHksIGFlcyh4ID0gT3pvbmUsIGZpbGwgPSBmYWN0b3IoTW9udGgpKSkgK1xuICAgICAgICAgZ2VvbV9kZW5zaXR5KCkgXG5wbG90KHAzKSJ9

The R Console shows the warning message:
Removed 37 rows containing non-finite values (stat_density)
indicating that the veraible Temp contains 37 missing values.
Overplotting is here obviously a problem but we can use transparent colors with the argument alpha. Values between 0 and 1 are allowed for alpha. It represents the fraction of opacity, i.e. values close to 0 are very transparent and values close to 1 are not transparent. Run the code below with differnt values for alpha

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiIjIE5vdCBzbyBuaWNlLiBXZSBuZWVkIHRvIHNwZWNpZnkgdHJhbnNwYXJlbmN5LiBcbiMgUmVwbGFjZSBnZW9tX2RlbnNpdHkoKSB3aXRoIGdlb21fZGVuc2l0eShhbHBoYT0wLjUpIFxucDQgPC0gZ2dwbG90KGRhdGE9YWlycXVhbGl0eSwgYWVzKHg9T3pvbmUsIGZpbGw9ZmFjdG9yKE1vbnRoKSkpICtcbiAgICAgICAgIGdlb21fZGVuc2l0eShhbHBoYSA9IDAuNSkgXG5wbG90KHA0KSJ9

Violine plots `geom_violin`

If you would like to see the distributions next to each other use a violin plot.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwNSA8LSBnZ3Bsb3QoZGF0YT1haXJxdWFsaXR5LCBhZXMoeD1mYWN0b3IoTW9udGgpLCB5PU96b25lKSkgK1xuICAgICAgICAgIGdlb21fdmlvbGluKCkgXG5wbG90KHA1KSJ9

You can also add quantiles to the plot. Paste the following code into the script.R window above:
geom_violin(draw_quantiles = c(0.25,0.5,0.75))

Box-plots `geom_boxplot`

The traditional way of presenting distributions is the boxplot. We can easily generate a boxplot for the chicken weight for each time point. Note: Time is a numerical variable in the dataset ChickWeight. It has some advantages to change it to categorical with the function factor

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwNiA8LSBnZ3Bsb3QoQ2hpY2tXZWlnaHQsIGFlcyh4ID0gZmFjdG9yKFRpbWUpKSwgeSA9IHdlaWdodCkgK1xuICAgICAgICAgIGdlb21fYm94cGxvdCgpXG5wbG90KHA2KSJ9

Also this geometry can be modified in various ways, e.g. outlier.color, outlier.shape or outlier.size to modify the appearance of the outlier dots. With the argument geom_boxplot(notch = T) notches indicating an approximate confidence interval of the median.

+ `geom_violin`

We can also combine box and violin plots. Although we have add some additional arguments. It is useful to specify scale = "width" to ensure that the maximum width of each violin plot is 1. At the same time we reduce the width of the boxes.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiIjIFN0aWxsIG5vdCBwZXJmZWN0IGxldCB1cyB0cnkgYSB2aW9saW4gcGxvdCAobm93IE1vbnRoIGlzIHRoZSB4IHZhcmlhYmxlKS5cbiAgXG5wNyA8LSBnZ3Bsb3QoQ2hpY2tXZWlnaHQsIGFlcyh4ID0gZmFjdG9yKFRpbWUpLCB5ID0gd2VpZ2h0KSkgK1xuICAgICAgICAgIGdlb21fdmlvbGluKHNjYWxlID0gXCJ3aWR0aFwiLCBmaWxsID0gXCJncmF5XCIpICsgXG4gICAgICAgICAgZ2VvbV9ib3hwbG90KHdpZHRoPTAuNCwgZmlsbD1cIndoaXRlXCIpXG5wbG90KHA3KSJ9

+ `geom_dotplot`

Boxplots combined with dotplots became recently quite popular. We can add a new aes statement to overwrite the defaults from the ggplot function. In the example below we would like to have different dot colors for different Diets.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwOCA8LSBnZ3Bsb3QoQ2hpY2tXZWlnaHQsIGFlcyh4ID0gZmFjdG9yKFRpbWUpLCB5ID0gd2VpZ2h0KSkgK1xuICAgICAgICAgIGdlb21fYm94cGxvdCgpICtcbiAgICAgICAgICBnZW9tX2RvdHBsb3QoYWVzKGZpbGwgPSBEaWV0KSwgYmluYXhpcyA9ICd5Jywgc3RhY2tkaXIgPSAnY2VudGVyJywgZG90c2l6ZSA9IC41LCBiaW53aWR0aCA9IDEwKVxucGxvdChwOCkifQ==

Fun fact: The diameter of the dots specified in dotsize (default value is 1) is relative to the binwidth. If you change the code to ... binwidth = 20) it will change the size of the dots as well.

+ `geom_line`

In case of grouped observations, like repeated measurements over time, you can add lines. In the plot below a line for each chick is added to visualize the individual weight gain.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwOSA8LSBnZ3Bsb3QoQ2hpY2tXZWlnaHQsIGFlcyh5ID0gd2VpZ2h0LCB4ID0gZmFjdG9yKFRpbWUpKSkgK1xuICAgICAgICAgIGdlb21fYm94cGxvdCgpICtcbiAgICAgICAgICBnZW9tX2xpbmUoYWVzKGdyb3VwPUNoaWNrLCBjb2xvcj1EaWV0KSkgXG5wbG90KHA5KSJ9

Scatterplots

Scatterplots are the most commonly used visualisation technique to show the correlation between two numeric variables.

Basic scatterplot `geom_scatter`

The basic scatterplot is easily done using the geom geom_point. Besides the X and Y variables you can select a 3rd variable to specify a color code. We would like to plot the Temperature (in Fahrenheit), Ozone (in ppb) and the dots should have different colors depending on the month of measurement. Month is stored as a numeric variable in the datset (5 to 9). By default ggplot2 will use a color gradient instead of discrete colors unless you convert month into a categorical variable with factor(Month) . We will discuss colors in more detail later. For the moment simply run the code below and compare the two graphs.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMTAgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh5ID0gT3pvbmUsIHggPSBUZW1wLCBjb2xvciA9IE1vbnRoKSkgK1xuICAgICAgICAgIGdlb21fcG9pbnQoKVxucGxvdChwMTApXG5cbnAxMSA8LSBnZ3Bsb3QoYWlycXVhbGl0eSwgYWVzKHkgPSBPem9uZSwgeCA9IFRlbXAsIGNvbG9yID0gZmFjdG9yKE1vbnRoKSkpICtcbiAgICAgICAgICBnZW9tX3BvaW50KClcbnBsb3QocDExKSJ9

There are some value combinations which occur more than once. They are plotted on top of each other and only one is visible. If we use geom_jitter we can add small random noise.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMTIgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh5ID0gT3pvbmUsIHggPSBUZW1wLCBjb2xvciA9IGZhY3RvcihNb250aCkpKSArXG4gICAgICAgICAgIGdlb21faml0dGVyKHdpZHRoPTAuMjUsIGhlaWdodD0wLjI1KVxucGxvdChwMTIpIn0=

Regression lines

Sometimes, you would like to add linear regression lines to the scatterplot. One nice feature of ggplot2: if you specify a categorical variable for color, group or fill, individual regression lines will be calculated for each value of this variable. The argument se specifies, if confidence intervals around the regression lines should be shown (default) or omitted.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMTMgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh5ID0gT3pvbmUsIHggPSBUZW1wLCBjb2xvciA9IGZhY3RvcihNb250aCkpKSArXG4gICAgICAgICAgIGdlb21fcG9pbnQoKSArXG4gICAgICAgICAgIGdlb21fc21vb3RoKG1ldGhvZD1cImxtXCIsIHNlID0gVFJVRSlcbnBsb3QocDEzKSJ9

What can you do if you would like to have different colors but only a single regression line? Within each geom you can overwrite the aesthetics specified in the ggplot2 line. Therefore, you could either omit the color argument from the first line and specify it later as:
geom_point(aes(color = factor(Month))) +
Or you use the opposite way and you specify in the smoothing geom that there is no color stratification:
geom_smooth(method="lm", aes(color = NULL))

Loess smoother

A linear regression line represents the data only well if there is a linear relationship between the two variables but often it is not the case. Instead of regression, you can add a smoothed line which tries to find a nice curve in proximity to the data. A very popular approach is the Locally Weighted Scatterplot Smoother (loess or lowess) which is the default smoothing algorithm in ggplot2.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMTQgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh5ID0gT3pvbmUsIHggPSBUZW1wKSkgK1xuICAgICAgICAgICBnZW9tX3BvaW50KCkgK1xuICAgICAgICAgICBnZW9tX3Ntb290aCgpXG5wbG90KHAxNCkifQ==

Alternatively, you can use a quadratic regression line via:
stat_smooth(method = "lm", formula = y ~ poly(x, 2))
or an even higher polynomal:
stat_smooth(method = "lm", formula = y ~ poly(x, 4))

Binary smoothing

It is also straight forward to show the regression line from a logistic regression. However, there is one important thing to know: R's glm function for logistic regression accepts as outcome variable numerical data coded as 0/1 or categorical variables with 2 categories or logical variables coded as TRUE/FASLE. In contrast, geom_smooth accepts only data coded as 0/1. Therefore, we cant use the function factor to split a numeric variable into Yes/No for logistic regression. We have to use the function as.numeric. In the example below we would like to see if wind speed is associated with high ozone levels defined as concentrations > 50 ppb. Note the Y-axis shows the predicted probabilites, i.e. the values on probability scale.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJhaXJxdWFsaXR5JGhpZ2gub3pvbiA8LSBhcy5udW1lcmljKGFpcnF1YWxpdHkkT3pvbmUgPiA1MClcbnAxNSA8LSBnZ3Bsb3QoYWlycXVhbGl0eSwgYWVzKHggPSBXaW5kLCB5ID0gaGlnaC5vem9uKSkgK1xuICAgICAgICAgICBnZW9tX2ppdHRlcihoZWlnaHQ9MC4wMSwgd2lkdGg9MC41KSArXG4gICAgICAgICAgIGdlb21fc21vb3RoKG1ldGhvZD1cImdsbVwiLCAgbWV0aG9kLmFyZ3MgPSBsaXN0KGZhbWlseSA9IFwiYmlub21pYWxcIikpXG5wbG90KHAxNSkifQ==

Do we have something similar to loess for binary data? Yes, it is called GAM for generalized additive models. The formula argument can be omitted in the recent ggplot versions but if you specify it you are on the save side.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJhaXJxdWFsaXR5JGhpZ2gub3pvbiA8LSBhcy5udW1lcmljKGFpcnF1YWxpdHkkT3pvbmUgPiA1MClcbnAxNiA8LSBnZ3Bsb3QoYWlycXVhbGl0eSwgYWVzKHggPSBTb2xhci5SLCB5ID0gaGlnaC5vem9uKSkgK1xuICAgICAgICAgICBnZW9tX2ppdHRlcihoZWlnaHQ9MC4wMSwgd2lkdGg9MC41KSArXG4gICAgICAgICAgIGdlb21fc21vb3RoKG1ldGhvZD1cImdhbVwiLCBmb3JtdWxhID0geSB+IHMoeCksIG1ldGhvZC5hcmdzID0gbGlzdChmYW1pbHkgPSBcImJpbm9taWFsXCIpKVxucGxvdChwMTYpIn0=

Smoothing & transformations

A note of caution: If you run a linear regression and you plot the regression line on an linear axis scale you get a straight line. In contrast, it should be a curve if one of the axis is transformed to log scale. However, ggplot2 will first transform the data points and than estimates the regression line. Therefore, it will also in this case a straight line. Compare the 3 graphs below.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMTcgPC0gZ2dwbG90KENoaWNrV2VpZ2h0LCBhZXMoeCA9IFRpbWUsIHkgPSB3ZWlnaHQpKStcbiAgICAgICAgICAgZ2VvbV9qaXR0ZXIoKSArXG4gICAgICAgICAgIGdlb21fc21vb3RoKG1ldGhvZCA9IFwibG1cIiwgc2U9RikgKyBcbiAgICAgICAgICAgY29vcmRfdHJhbnMoeSA9IFwibG9nXCIpIFxucGxvdChwMTcpXG5cbnAxOCA8LSBnZ3Bsb3QoQ2hpY2tXZWlnaHQsIGFlcyh4ID0gVGltZSwgeSA9IHdlaWdodCkpICtcbiAgICAgICAgICAgZ2VvbV9qaXR0ZXIoKSArXG4gICAgICAgICAgIGdlb21fc21vb3RoKG1ldGhvZCA9IFwibG1cIiwgc2U9RikgKyBcbiAgICAgICAgICAgc2NhbGVfeV9sb2cxMCgpXG5wbG90KHAxOClcblxucDE5IDwtIGdncGxvdChDaGlja1dlaWdodCwgYWVzKHggPSBUaW1lLCB5ID0gbG9nMTAod2VpZ2h0KSkpICtcbiAgICAgICAgICAgZ2VvbV9qaXR0ZXIoKSArXG4gICAgICAgICAgIGdlb21fc21vb3RoKG1ldGhvZCA9IFwibG1cIiwgc2U9RikgXG5wbG90KHAxOSkifQ==

Can you see the difference? Why are there straight lines in P18 and P19 but a curve in P17?

Annotations

The main geometry to annotate text is geom_text. geom_label work pretty much the same except that the text is wrapped in a rectangle that you can customize. With nudge_x and nudge_y you shifts the text from the marker but - especially if the length of the text differs - it is better to use the arguments hjust and vjust. As the name suggests, the argument check_overlap = T tries to move the text a bit around. In the example this seems to work only partly, some labels are simply omitted. If you work often with annotations check out the package ggrepel to repel overlapping text labels.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMjAgPC0gZ2dwbG90KHN3aXNzLCBhZXMoeCA9IEVkdWNhdGlvbiwgeSA9IEluZmFudC5Nb3J0YWxpdHkpKSArXG4gICAgICAgICAgIGdlb21fcG9pbnQoKSArXG4gICAgICAgICAgIGdlb21fdGV4dChsYWJlbD1yb3duYW1lcyhzd2lzcyksIGhqdXN0ID0gXCJsZWZ0XCIsIHZqdXN0ID0gXCJvdXR3YXJkXCIsIGNoZWNrX292ZXJsYXAgPSBUKVxucGxvdChwMjApIn0=

Instead of a variable you can also specify simply one number for x and y to put a single label on a certain position:
geom_text(label="Hi!", x = 40, y = 20)
Alternatively, you can specify a condition to annotate only a subsample of the points. This is usually done via geom_text(data=subset(... but this unfortunately does not work well with rownames. But there is a trick using ifelse which works perfectly fine.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMjEgPC0gZ2dwbG90KHN3aXNzLCBhZXMoeCA9IEVkdWNhdGlvbiwgeSA9IEluZmFudC5Nb3J0YWxpdHkpKSArXG4gICAgICAgICAgIGdlb21fcG9pbnQoKSArXG4gICAgICAgICAgIGdlb21fbGFiZWwobGFiZWwgPSBpZmVsc2Uoc3dpc3MkQWdyaWN1bHR1cmUgPiA4MCwgcm93bmFtZXMoc3dpc3MpLCBOQSkpXG5wbG90KHAyMSkifQ==

Bar plots

There are two different type of bar plots.
Summarising numeric variables across several categories (with or without error bars), e.g. mean birth weight for mothers' of different age groups.
Main ggplot2 functions:
geom_bar() stat_summary()

Calculating the proportions of a binary or categorical variable across several categories, e.g. proportion of children with extremely low birth weight, very low birth weight, low birth weight and normal birth weight for mothers' of different age groups.
Main ggplot2 function:
geom_col()

In this chapter we will use the birthwt data from package MASS a dataset on risk factors (as race or smoking status) on low infant birth weight.
First, we would like to generate a bar plot showing the mean birth weight by different races (1: white, 2: black, 3: other). You can generate the graph in two different ways.
IMPORTANT NOTE: Since the release of ggplot2 3.3.0 the arguments: fun.ymin, fun.y, fun.ymax have been replaced by fun.min, fun, fun.max !!!

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5oZWFkKGJpcnRod3QpXG5cblxucDIyIDwtIGdncGxvdChiaXJ0aHd0LCBhZXMoeD1yYWNlLCB5PWJ3dCkpICtcbiAgICAgIGdlb21fYmFyKHN0YXQgPSBcInN1bW1hcnlcIiwgZnVuLnkgPSBcIm1lYW5cIikgICMgdXNlICdmdW4gPSBcIm1lYW5cIicgaW4gbmV3ZXIgZ2dwbG90IHZlcnNpb25zXG5wbG90KHAyMilcblxuI3NhbWUgZ3JhcGggYXMgYWJvdmVcblxucDIyIDwtIGdncGxvdChiaXJ0aHd0LCBhZXMoeD1yYWNlLCB5PWJ3dCkpICtcbiAgICAgIHN0YXRfc3VtbWFyeShnZW9tID1cImJhclwiKVxucGxvdChwMjIpIn0=

error bars

To add error bars you can add simply add stat_summary().

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5oZWFkKGJpcnRod3QpXG5cbnAyMyA8LSBnZ3Bsb3QoYmlydGh3dCwgYWVzKHg9cmFjZSwgeT1id3QpKSArXG4gICAgICBnZW9tX2JhcihzdGF0ID0gXCJzdW1tYXJ5XCIsIGZ1bi55ID0gXCJtZWFuXCIpICsgICMgdXNlICdmdW4gPSBcIm1lYW5cIicgaW4gbmV3ZXIgZ2dwbG90IHZlcnNpb25zXG4gICAgICBzdGF0X3N1bW1hcnkoKVxucGxvdChwMjMpIn0=

As the notification in the R.console indicates, the default stat_summary() presents mean and standard error. This is rarely useful. Usually you would like to show the standard deviation or the confidence interval. For confidence intervals and the standard deviation ggplot2 has some additional functions. However, they require that the Hmisc package is installed:
stat_summary(fun.data = "mean_cl_normal") # confidence interval stat_summary(fun.data = "mean_cl_boot") # confidence interval estimated by bootstrapping stat_summary(fun.data = "mean_sdl") # mean and standard deviation stat_summary(fun.data = "median_hilow") # median and min/max

Another way to show error bars is to calculate them first and to add them afterwards:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5saWJyYXJ5KGRwbHlyKVxubXlkYXRhIDwtIGJpcnRod3QgJT4lIGdyb3VwX2J5KHJhY2UpICU+JSBcbiAgICAgICAgICBzdW1tYXJpemUobWVhbmJ3dCA9IG1lYW4oYnd0LCBuYS5ybT1UKSxcbiAgICAgICAgICAgICAgICAgICAgc2Rid3QgPSBzZChid3QsIG5hLnJtPVQpKVxubXlkYXRhXG5cblxucDI0IDwtIGdncGxvdChteWRhdGEsIGFlcyh4PXJhY2UsIHk9bWVhbmJ3dCkpICtcbiAgICAgIGdlb21fYmFyKHN0YXQgPSBcImlkZW50aXR5XCIpICtcbiAgICAgIGdlb21fZXJyb3JiYXIoYWVzKHltaW4gPSBtZWFuYnd0LXNkYnd0LCB5bWF4ID0gbWVhbmJ3dCtzZGJ3dCksIHdpZHRoPS4yKSBcbnBsb3QocDI0KSJ9

Grouped

For grouped bar plots you have to specify a 2nd grouping variable with the group aesthetic. You have now two stratification variables. The one defined with 'x' and the one defined with 'group'. Usually, you would like to fill the bars with different colors
for different groups.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5oZWFkKGJpcnRod3QpXG5cbnAyNSA8LSBnZ3Bsb3QoYmlydGh3dCwgYWVzKHg9cmFjZSwgeT1id3QsIGZpbGw9ZmFjdG9yKHNtb2tlKSwgZ3JvdXAgPSBzbW9rZSkpICtcbiAgICAgIGdlb21fYmFyKHN0YXQgPSBcInN1bW1hcnlcIiwgZnVuLnkgPSBcIm1lYW5cIiwgICMgdXNlICdmdW4gPSBcIm1lYW5cIicgaW4gbmV3ZXIgZ2dwbG90IHZlcnNpb25zXG4gICAgICAgICAgICAgICBwb3NpdGlvbj1cImRvZGdlXCIpICArIFxuICAgICAgc3RhdF9zdW1tYXJ5KGZ1bi5kYXRhID0gXCJtZWFuX2NsX2Jvb3RcIiwgcG9zaXRpb249cG9zaXRpb25fZG9kZ2UoMC45KSlcbiAgICAgIFxucGxvdChwMjUpIn0=

3 strata

Commonly you have not only one but several stratification variables. Of course you could generate a new variable with one category for each combination of the single stratification variables. However, it is easier to do this directly in ggplot2 via interaction(var1, var2) as shown below:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5oZWFkKGJpcnRod3QpXG5cbnAyNiA8LSBnZ3Bsb3QoYmlydGh3dCwgYWVzKHg9cmFjZSwgeT1id3QsIGZpbGw9aW50ZXJhY3Rpb24oc21va2UsIGh0KSwgZ3JvdXAgPSBpbnRlcmFjdGlvbihzbW9rZSwgaHQpKSkgK1xuICAgICAgZ2VvbV9iYXIoc3RhdCA9IFwic3VtbWFyeVwiLCBmdW4ueSA9IFwibWVhblwiLCAgIyB1c2UgJ2Z1biA9IFwibWVhblwiJyBpbiBuZXdlciBnZ3Bsb3QgdmVyc2lvbnMgXG4gICAgICAgICAgICAgICAgICAgIHBvc2l0aW9uPVwiZG9kZ2VcIikgICsgIFxuICAgICAgc3RhdF9zdW1tYXJ5KGZ1bi5kYXRhID0gXCJtZWFuX2NsX2Jvb3RcIiwgcG9zaXRpb249cG9zaXRpb25fZG9kZ2UoMC45KSlcbiAgICAgIFxucGxvdChwMjYpIn0=

You can even specify more than 2 variables. Try:
... fill=interaction(ui, smoke, ht), group = interaction(ui, smoke, ht)))

Proportions

Proportions can be visualized in various ways. A common approach is to generate first a new dataset wich contains the proportion, e.g. with prop.table(table(..., and afterwards use this dataset for the graph.

There is an alternative way whcih works if the binary variable is either coded 0/1 orT/F. In this case the mean is equal to the proportion. Therfore you could show the proprtions with:
ggplot(data = birthwt, aes(x = race, y = low)) + stat_summary(geom = "bar")

With percentages

You can easily add the numbers to the plot. The argument vjust specifies if they should be printed above ot below the top of each bar.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5oZWFkKGJpcnRod3QpXG5cbm15ZGF0YSA8LSBkYXRhLmZyYW1lKHByb3AudGFibGUodGFibGUoYmlydGh3dCRyYWNlLCBiaXJ0aHd0JGxvdyksMSkpXG5cbnAyOCA8LSBnZ3Bsb3QoZGF0YSA9IHN1YnNldChteWRhdGEsIFZhcjIgPT0gMSksIGFlcyh4ID0gVmFyMSwgeSA9IEZyZXEpKSArIFxuICAgICAgICAgICBnZW9tX2NvbCgpICtcbiAgICAgICAgICAgZ2VvbV90ZXh0KGFlcyhsYWJlbD1yb3VuZCguLnkuLiAqIDEwMCkpLCB2anVzdD0yLCBjb2xvciA9XCJ3aGl0ZVwiLCBzaXplPTEwKVxuICAgICAgICAgICAgXG5wbG90KHAyOCkifQ==

..y.. is a so called internal variable, i.e. a variable existing only temporarily within the graph. Unfortunately, this approach does not work very well together with geom_bar or stat_summary.

Several categories

If you have not only the proportion but several categories and you would like to visualize them in an stacked bar plot you can use the same approach. First calculate the proportions and than plot them. In the code below we first generate a new categorical variable with different birth weights.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5oZWFkKGJpcnRod3QpXG5cbmJpcnRod3QkYnd0Y2F0IDwtIGN1dChiaXJ0aHd0JGJ3dCwgc2VxKDUwMCwgNTUwMCwgMTAwMCkpXG5teWRhdGEgPC0gZGF0YS5mcmFtZShwcm9wLnRhYmxlKHRhYmxlKGJpcnRod3QkcmFjZSwgYmlydGh3dCRid3RjYXQpLDEpKVxubXlkYXRhXG5cbnAyOSA8LSBnZ3Bsb3QoZGF0YSA9IG15ZGF0YSwgYWVzKHggPSBWYXIxLCB5ID0gRnJlcSwgZmlsbCA9IFZhcjIpKSArIGdlb21fY29sKCkgXG5wbG90KHAyOSkifQ==

It is worth mentioning that R's basic barplot function does also a good job in generating stackjed bar plots:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5oZWFkKGJpcnRod3QpXG5cbmJpcnRod3QkYnd0Y2F0IDwtIGN1dChiaXJ0aHd0JGJ3dCwgc2VxKDUwMCwgNTUwMCwgMTAwMCkpXG5cbmJhcnBsb3QocHJvcC50YWJsZSh0YWJsZShiaXJ0aHd0JGJ3dGNhdCwgYmlydGh3dCRyYWNlKSwyKSwgXG4gIGNvbD1yYWluYm93KDUpLCAgICAgIyBjaGFuZ2UgY29sb3JzIGZyb20gZ3JheXNjYWxlXG4gIGxlZ2VuZC50ZXh0ID0gVFJVRSwgIyBhZGQgbGVnZW5kXG4gIHhsaW09YygwLDYpKSAgICAgICAgIyBtYWtlIHN1cmUgdGhhdCB0aGVyZSBpcyBzcGFjZSBmb3IgdGhlIGxlZ2VuZCAifQ==

Line plots

Time series

Time series or other repeated measures over time can be displayed with geom_line. There is only one tricky issue: If there are missing data in the data series, ggplot2 will interrupt the series. So you have to drop first the missing observations. With subset in the data argument you can drop the specific records. However, if you have several variables in the graph with different missing patterns, it is often better to 'update' the data in the geom. Compare the code of the two graphs below. Note we use first the function ISOdate to generate a correct date variable.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiIjIENvbnZlcnQgdmFyaWJsZXMgTW9udGggYW5kIERheSBpbnRvIERhdGVcbmFpcnF1YWxpdHkkRGF0ZSA8LSBJU09kYXRlKDE5NzMsIGFpcnF1YWxpdHkkTW9udGgsIGFpcnF1YWxpdHkkRGF5KVxuXG5wMzAgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh4ID0gRGF0ZSwgeSA9IE96b25lKSkgK1xuICAgICAgICAgICBnZW9tX2xpbmUoKVxucGxvdChwMzApXG5cbiMgT21pdCBtaXNzaW5nIGRhdGEgdG8gaGF2ZSBjb25uZWN0ZWQgbGluZXNcbnAzMSA8LSBnZ3Bsb3QoYWlycXVhbGl0eSwgYWVzKHggPSBEYXRlLCB5ID0gT3pvbmUpKSArXG4gICAgICAgICAgIGdlb21fbGluZShkYXRhID0gc3Vic2V0KGFpcnF1YWxpdHksICFpcy5uYShPem9uZSkpKVxucGxvdChwMzEpIn0=

2nd Y-axis

It is also possible to use a 2nd Y-axis:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiIjIENvbnZlcnQgdmFyaWJsZXMgTW9udGggYW5kIERheSBpbnRvIERhdGVcbmFpcnF1YWxpdHkkRGF0ZSA8LSBJU09kYXRlKDE5NzMsIGFpcnF1YWxpdHkkTW9udGgsIGFpcnF1YWxpdHkkRGF5KVxuXG5wMzIgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh4ID0gRGF0ZSwgeSA9IE96b25lKSkgK1xuICAgICAgICAgICBnZW9tX2xpbmUoZGF0YSA9IHN1YnNldChhaXJxdWFsaXR5LCAhaXMubmEoT3pvbmUpKSkgK1xuICAgICAgICAgICBzY2FsZV95X2NvbnRpbnVvdXMoc2VjLmF4aXMgPSBzZWNfYXhpcyh+LiAvIDMgKyA1MCwgbmFtZSA9IFwiVG1lcGVyYXR1cmUgW0ZdXCIpKSArXG4gICAgICAgICAgIGdlb21fbGluZShhZXMoeSA9ICgoVGVtcCAtIDUwKSAqIDMpKSwgY29sb3I9XCJyZWRcIilcbiAgICAgICAgICAgICAgICAgICAgICBcbnBsb3QocDMyKSJ9

Line and barplot

Sometimes you would like to have a line plot in front of a bar plot. In this case it is important to use geom_col

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiIjIENvbnZlcnQgdmFyaWJsZXMgTW9udGggYW5kIERheSBpbnRvIERhdGVcbmFpcnF1YWxpdHkkRGF0ZSA8LSBJU09kYXRlKDE5NzMsIGFpcnF1YWxpdHkkTW9udGgsIGFpcnF1YWxpdHkkRGF5KVxuXG5wMzMgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh4ID0gRGF0ZSwgeSA9IE96b25lKSkgK1xuICAgICAgICAgICBzY2FsZV95X2NvbnRpbnVvdXMoc2VjLmF4aXMgPSBzZWNfYXhpcyh+LiAvIDMgKyA1MCwgbmFtZSA9IFwiVG1lcGVyYXR1cmUgW0ZdXCIpKSArXG4gICAgICAgICAgIGdlb21fY29sKGFlcyh5ID0gKChUZW1wIC0gNTApICogMykpLCBjb2xvcj1cInJlZFwiKSArXG4gICAgICAgICAgIGdlb21fbGluZShkYXRhID0gc3Vic2V0KGFpcnF1YWxpdHksICFpcy5uYShPem9uZSkpLCBsd2Q9MiwgY29sb3IgPSBcImJsdWVcIikgXG4gICAgICAgICAgICAgXG5wbG90KHAzMykgICAgICAgICAgIn0=

Spaghetti plots

If you have many time series, as in the ChickWeight dataset. You might would like to show the individual time series and in addition a summary, e.g. the mean of all time series. In the example below, we generate first a new data frame with the mean weights for each time point stratified by chicken diet. Afterwards, we generate two line plots: one with the individual data and a 2nd with the group means. Because we use a different variable from a different data set, we have to update data and aes(y = ...

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KGRwbHlyKVxubWVhbkNoaWNrIDwtIENoaWNrV2VpZ2h0ICU+JSBncm91cF9ieShEaWV0LCBUaW1lKSAlPiVcbiAgICAgICAgICAgICBzdW1tYXJpc2UobWVhbldlaWdodCA9IG1lYW4od2VpZ2h0KSlcblxucDM0IDwtIGdncGxvdChDaGlja1dlaWdodCwgYWVzKHggPSBUaW1lLCB5ID0gd2VpZ2h0LCBjb2xvcj1EaWV0KSkgK1xuICAgICAgICAgICBnZW9tX2xpbmUoYWVzKGdyb3VwPUNoaWNrKSkgK1xuICAgICAgICAgICBnZW9tX2xpbmUoZGF0YSA9ICBtZWFuQ2hpY2ssIGFlcyh5ID0gbWVhbldlaWdodCksIGx3ZD0yKSAgICAgICAgICAgIFxucGxvdChwMzQpIn0=

Multiple plots

Subplots for 1 variable

Multiple plots - or more precisely - conditional plots, are called facets in ggplot2 terminology. If you have only one stratification variable use the function facet_wrap

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwMzUgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh5ID0gT3pvbmUsIHggPSBUZW1wKSkgK1xuICAgICAgICAgICBnZW9tX3BvaW50KCkgK1xuICAgICAgICAgICBnZW9tX3Ntb290aCgpICtcbiAgICAgICAgICAgZmFjZXRfd3JhcCh+TW9udGgpXG5wbG90KHAzNSkifQ==

Subplots for several variables

Subplots for all value combinations of two stratification variables are created with facet_grid:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5wMzYgPC0gZ2dwbG90KGJpcnRod3QsIGFlcyh4ID0gYWdlLCB5ID0gYnd0KSkgK1xuICAgICAgICAgICBnZW9tX3BvaW50KCkgK1xuICAgICAgICAgICBnZW9tX3Ntb290aCgpICtcbiAgICAgICAgICAgZmFjZXRfZ3JpZChzbW9rZSB+IHVpLCBsYWJlbGxlcj1sYWJlbF9ib3RoKVxucGxvdChwMzYpIn0=

In newer ggplot2 versions, the specification of the variables can also be done via:
facet_grid(rows=vars(ui), cols=vars(smoke))
However, the formula syntax is a more powerful because it can be easily extended to more than 2 strata:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5wMzcgPC0gZ2dwbG90KGJpcnRod3QsIGFlcyh4ID0gYWdlLCB5ID0gYnd0KSkgK1xuICAgICAgICAgICBnZW9tX3BvaW50KCkgK1xuICAgICAgICAgICBmYWNldF9ncmlkKGFnZSA8IDIxIH4gYWdlIDwgMjEgKyBzbW9rZSArIHVpLCBsYWJlbGxlcj1sYWJlbF9ib3RoKVxucGxvdChwMzcpIn0=

It might happen that not all combination of 2 stratification variables occur. To save space in the figure it might be better to show only those combination, which actually occur. With scales = "free" non existing combinations are dropped and space = "free" allows different subplot dimensions. Have a look at the 3 different plots generated below:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJsaWJyYXJ5KE1BU1MpXG5wMzggPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh4ID0gT3pvbmUsIHk9ZmFjdG9yKE1vbnRoKSkpICtcbiAgc3RhdF9zdW1tYXJ5KGZ1bi5kYXRhID0gXCJtZWRpYW5faGlsb3dcIikgK1xuICBmYWNldF9ncmlkKGN1dChUZW1wLCBzZXEoNTAsIDEwMCwgMTApKSB+IC4pICAgIFxucGxvdChwMzgpXG5cbnAzOSA8LWdncGxvdChhaXJxdWFsaXR5LCBhZXMoeCA9IE96b25lLCB5PWZhY3RvcihNb250aCkpKSArXG4gIHN0YXRfc3VtbWFyeShmdW4uZGF0YSA9IFwibWVkaWFuX2hpbG93XCIpICtcbiAgZmFjZXRfZ3JpZChjdXQoVGVtcCwgc2VxKDUwLCAxMDAsIDEwKSkgfiAuLCBzY2FsZXMgPSBcImZyZWVcIikgICBcbnBsb3QocDM5KVxuXG5wNDAgPC1nZ3Bsb3QoYWlycXVhbGl0eSwgYWVzKHggPSBPem9uZSwgeT1mYWN0b3IoTW9udGgpKSkgK1xuICBzdGF0X3N1bW1hcnkoZnVuLmRhdGEgPSBcIm1lZGlhbl9oaWxvd1wiKSArXG4gIGZhY2V0X2dyaWQoY3V0KFRlbXAsIHNlcSg1MCwgMTAwLCAxMCkpIH4gLiwgc2NhbGVzID0gXCJmcmVlXCIsIHNwYWNlID0gXCJmcmVlXCIpIFxucGxvdChwNDApIn0=

Colors

ggplot2 provides various options to customize colors. The package distinguishs carefully between categorical (discrete) and numeric (continuous) variables. For numeric variables ggplot2 expects a colour gradient, e.g. from red to blue, for categorical variables a vector of colors.

If you specify a set of discrete colors for a numeric variable or vice versa you will get one of the following error messages:

Error: Discrete value supplied to continuous scale Error: Continuous value supplied to discrete scale
You can specify colors via the RGB color space via "#RRGGBB", whereby RR GG BB represent the red, green and blue color levels from 0 to 255 specified as hexadecimal values (0 to FF). This sounds a bit complicated and in fact it is. However, if you use only colors of your institute's corporate identity in your presentation, it looks very professional. E.g. the colors of the SwissTPH logo are R:191, G:50, B:39 (red) and R:70, G:138, B:178 which translate to "#BF3227" and "#468AB2".

Otherwise R has a huge number of beautiful colors pre-specified.

Let us revisit out first scatter plot example:
ggplot(airquality, aes(y = Ozone, x = Temp, color = factor(Month)))
Month has been converted to a categorical variable via factor(Month), consequently discrete colors are expected (at least as many as unique values of Month = 5).
scale_colour_manual(values = c("red","navy","gold","tan","plum"), na.value="gray")

color (or colour) specifies the colors of symbols and all lines in a plot. For bar plots, histograms or filled symbols (pch 21-25) it specifies the border around the filled areas (this is different compared to most base R graphics). The color of areas as bars, confidence bands etc can be specified similarly via scale_fill_manual.

If you don't want to specify the colors by hand. You can use some pre specified color palettes, e.g. scale_colour_brewer(palette = "Set1")

To get an overview which palettes are available use the following code:
library(ggplot2) library(RColorBrewer) display.brewer.all()

The color palettes in the middle are called "Qualitative sets". They are especially useful for unordered categorical data (Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3).
All color schemes provided are from the ColorBrewer project:
https://colorbrewer2.org
You sould definitely visit this site, because you can identify schemes with special features, like color-blind friendly or printer friendly sets.

In the example below we would like to produce a scatter plot with different symbols for each month, symbol fill colors according to different levels of wind speed (manully specified colors) and symbol border lines colord according to different levels of solar radiation using a pre specified color-blind friendly palette:

For numeric (continuouse) variables we can specify a gradient from one color to another with:
scale_colour_gradient(low = "green", high = "red")
You have also an alternative function to specify additional colors for the mid level values, e.g. to have the the red, yellow, green traffic light color codes:
scale_colour_gradientn(colors = c("green", "yellow", "red"))

The equivalent of the function scale_colour_brewer for numeric data is scale_colour_distiller.
Compare the two examples below:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwNDIgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh5ID0gT3pvbmUsIHggPSBUZW1wLCBmaWxsID0gU29sYXIuUiwgY29sb3IgPSBTb2xhci5SKSkgK1xuICAgICAgICAgICAgICBnZW9tX3BvaW50KHN0cm9rZSA9IDIsIHNoYXBlPTIxKSArICAgICMgJ3N0cm9rZScgZm9yIHRoaWNrZXIgc3ltYm9sIGxpbmVzXG4gICAgICAgICAgICAgIHNjYWxlX2NvbG91cl9ncmFkaWVudChsb3cgPSBcImdyZWVuXCIsIGhpZ2ggPSBcInJlZFwiKSArXG4gICAgICAgICAgICAgIHNjYWxlX2ZpbGxfZ3JhZGllbnRuKGNvbG9ycyA9IGMoXCJncmVlblwiLCBcInllbGxvd1wiLCBcInJlZFwiKSlcblxucGxvdChwNDIpXG5cbnA0MyA8LSBnZ3Bsb3QoYWlycXVhbGl0eSwgYWVzKHkgPSBPem9uZSwgeCA9IFRlbXAsIGZpbGwgPSBTb2xhci5SLCBjb2xvciA9IFNvbGFyLlIpKSArXG4gICAgICAgICAgICAgIGdlb21fcG9pbnQoc3Ryb2tlID0gMiwgc2hhcGU9MjEpICsgICAgIyAnc3Ryb2tlJyBmb3IgdGhpY2tlciBzeW1ib2wgbGluZXNcbiAgICAgICAgICAgICAgc2NhbGVfY29sb3VyX2Rpc3RpbGxlcihwYWxldHRlID0gXCJSZFlsR25cIikgK1xuICAgICAgICAgICAgICBzY2FsZV9maWxsX2Rpc3RpbGxlcihwYWxldHRlID0gXCJTZXQxXCIpIFxuXG5wbG90KHA0MykifQ==

Labels & legends

If you have only one legend, it is relatively easy to manipulate it. If you have multiple legends, the fine tuning requires a bit experience. Some legend features are manipulated directly within the scale_... functions, e.g. to change the title of the legend or the labels of the different items) or with the function theme. For instance theme(legend.position = "top") will display the legend on top of the graph whereas 2 numbers between 0 and 1 will place the legend within the plotting area. If you have more than 1 legend legend.direction = "vertical" will arrange the different legends next to each other. In contrast, legend.box = "vertical" will arrange the single items of a legend next to each other.
The labs function is self-explaining. Simply check out the code below:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiIjIGdlbmVyYXRlIGNhdGVnb2llcyBmcm9tIHZhcmlhYmxlcyBXaW5kIGFuZCBTb2xhci5SXG5haXJxdWFsaXR5JFdpbmQuY2F0IDwtIGN1dChhaXJxdWFsaXR5JFdpbmQsIGMoMCw1LDEwLDIwLDQwKSlcbmFpcnF1YWxpdHkkU29sYXIuY2F0IDwtIGN1dChhaXJxdWFsaXR5JFNvbGFyLlIsIGMoMCwxMDAsMjAwLDMwMCw0MDApKVxuXG5wNDQgPC0gZ2dwbG90KGFpcnF1YWxpdHksIGFlcyh5ID0gT3pvbmUsIHggPSBUZW1wLCBzaGFwZSA9IGZhY3RvcihNb250aCksXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICBjb2xvciA9IFdpbmQuY2F0LCBmaWxsID0gU29sYXIuY2F0LCBzaXplID0gVGVtcCkpICtcbiAgICAgICAgICBnZW9tX3BvaW50KHN0cm9rZSA9IDIpICsgICAgICAgICAgICAgICAgICAgICMgJ3N0cm9rZScgZm9yIHRoaWNrZXIgc3ltYm9sIGxpbmVzXG4gICAgICAgICAgc2NhbGVfZmlsbF9tYW51YWwodmFsdWVzID0gYyhcIiNCRjMyMjdcIixcIiM0NjhBQjJcIixcInRhblwiLFwicGx1bVwiKSwgbmEudmFsdWU9XCJncmF5XCIsXG4gICAgICAgICAgICAgIGd1aWRlID0gZ3VpZGVfbGVnZW5kKG92ZXJyaWRlLmFlcyA9IGxpc3Qoc2hhcGUgPSAyMSkpKSArICMgY2hhbmdlIHN5bWJvbCBvZiBmaWxsIGxlZ2VuZFxuICAgICAgICAgIHNjYWxlX2NvbG91cl9icmV3ZXIocGFsZXR0ZSA9IFwiRGFyazJcIikgKyAgIyBjb2xvci1ibGluZCBzYXZlIHBhbGV0dGVcbiAgICAgICAgICBzY2FsZV9zaGFwZV9tYW51YWwodmFsdWVzID0gMjE6MjUsIG5hbWUgPSBcIlwiLCBsYWJlbHMgPSBjKFwiTVwiLFwiSlwiLFwiSlwiLFwiQVwiLFwiU1wiKSkgK1xuICAgICAgICAgIHNjYWxlX3NpemVfY29udGludW91cyhuYW1lID0gXCJGXCIpICtcbiAgICAgICAgICB0aGVtZShsZWdlbmQucG9zaXRpb249YygwLjIyLCAwLjc1KSwgIyBsZWdlbmQgcG9zaXRpb25cbiAgICAgICAgICAgICAgICBsZWdlbmQudGV4dCA9IGVsZW1lbnRfdGV4dChjb2xvciA9IFwiZ3JheVwiLCBzaXplID0gOSksIFxuICAgICAgICAgICAgICAgIGxlZ2VuZC5kaXJlY3Rpb24gPSBcInZlcnRpY2FsXCIsXG4gICAgICAgICAgICAgICAgbGVnZW5kLmJveCA9IFwiaG9yaXpvbnRhbFwiLFxuICAgICAgICAgICAgICAgIGxlZ2VuZC5zcGFjaW5nLnggPSB1bml0KDAuMDIsICdjbScpKSArXG4gICAgICAgICAgbGFicyh0aXRsZSA9IFwiRWZmZWN0IG9mIHRlbXBlcmF0dXJlIG9uIG96b25lXCIsXG4gICAgICAgICAgICAgIHN1YnRpdGxlID0gXCJDb2xvcnMgcmVwcmVzZW50IHdpbmQgc3BlZWQgYW5kIHNvbGFyIHJhZGlhdGlvblwiLFxuICAgICAgICAgICAgICBjYXB0aW9uID0gXCJEYXRhIHNvdXJjZTogYWlycXVhbGl0eVwiLFxuICAgICAgICAgICAgICB4ID0gZXhwcmVzc2lvbihwYXN0ZShcIlRlbXBlcmF0dXJlIFtcIixkZWdyZWUsXCJGXVwiKSksXG4gICAgICAgICAgICAgIHkgPSBcIk96b25lIFtwcGJdXCIsXG4gICAgICAgICAgICAgIHRhZyA9IFwiQVwiKVxuXG5wbG90KHA0NCkifQ==

Finally, you can use:
theme(legend.position="none") to hide the legend.

Axis limits

If you would like to zoom into the plotting area, don't use
scale_y_continuous(ylim = ...
because this throws the data points outside the plotting region away, which is likely not what you want. Instead, use coord_cartesian. Although this has no impact on the individual data points shown, it has an impact on anything calculated from the data, e.g. a regression line or the median for a boxplot.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnJhcnkoZ2dwbG90MikiLCJzYW1wbGUiOiJwNiA8LSBnZ3Bsb3QoQ2hpY2tXZWlnaHQsIGFlcyh5ID0gd2VpZ2h0LCB4ID0gZmFjdG9yKFRpbWUpKSkgK1xuICAgICAgICAgIGdlb21fYm94cGxvdCgpXG5wbG90KHA2KVxuXG5wNDIgPC0gZ2dwbG90KENoaWNrV2VpZ2h0LCBhZXMoeSA9IHdlaWdodCwgeCA9IGZhY3RvcihUaW1lKSkpICtcbiAgICAgICAgICBnZW9tX2JveHBsb3QoKSArXG4gICAgICAgICAgY29vcmRfY2FydGVzaWFuKHlsaW0gPSBjKDQwLCAyNjApKSAjIEJveCBpcyBjb3JyZWN0IGNhbGN1bGF0ZWRcbnBsb3QocDQyKVxuXG5wNDMgPC0gZ2dwbG90KENoaWNrV2VpZ2h0LCBhZXMoeSA9IHdlaWdodCwgeCA9IGZhY3RvcihUaW1lKSkpICtcbiAgICAgICAgICBnZW9tX2JveHBsb3QoKSArXG4gICAgICAgICAgc2NhbGVfeV9jb250aW51b3VzKGxpbWl0cyA9IGMoNDAsIDI2MCkpICMgQm94IGlzIGluY29ycmVjdCBcbnBsb3QocDQzKSJ9

Themes

ggplot2 has several standard themes. The default theme theme_grey is useful for screens or presentations but not for publications or printing. For publications you would usually choose theme_classic or theme_bw.

themes

You can find additional examples here:
https://ggplot2.tidyverse.org/reference/ggtheme.html

Export

You can export the figures in high quality for publications with the function ggsave. Provide a name for the file and the name of the plot to save it to yout hardsisk. The function uses the file extension to save the plot in the right format. Allowed file extensions include pdf, jpg or tiff. The latter one is the format of choice for many scientific journals. Add always the argument compression = lzw if you export figures as tiff. If you simply save the figure, the figure will be saved with a dimension of 7 x 7 inch which is likly to large. If you change width and height, it may happen that the text is now too large ot too small. play a bit with the arguments height, width, unit, scale, and dpi to get it right, e.g.
ggsave("My_scatter.jpg", p43, width = 7, height = 4, unit = "in", dpi = 300)
Note: the figure above will be saved at 7 x 4 inch at 300 dpi resolution. If you think that text and symbols should be larger: you can modify the figure directly in ggsave, e.g. to increase the size use:
ggsave("My_scatter.jpg", p43 + theme_bw(base_size = 14), width=7, height=4, unit="in", dpi=300)

Now have fun with ggplot2

Graphics with ggplot2

Introduction to ggplot2