Who uses ggplot2

Introduction to R

Graphics are very important for data analysis. On the one hand, we can use them for exploratory data analysis in order to discover any hidden connections or simply to get an overview. On the other hand, we need graphics to show results and communicate them to others.

We have already created graphics with the package several times in this script without looking closely at the code. In this chapter we will now get to know the syntax of.

\(~\)

In contrast, it is based on an intuitive syntax, the so-called grammar of graphics. Once you get used to it, you can create very complex graphics using an elegant and consistent “grammar”. is designed to work with data, i.e. we need data sets in long Format. Graphics are now always created according to the same principle:

Step 1: We start with a data set and create a plot object with the function.

Step 2: We define so-called “aesthetic mappings”, i.e. we determine which variables are to be displayed on the X or Y axes, and which variables are used to group the data. The function we use for this is called.

Step 3: We add one or more “layers” or “layers” to the plot. These layers define how something should be displayed, e.g. as a line or as a histogram. The functions start with the prefix, e.g..

To use it, we now need an additional operator:. You already know this as a mathematical operator, but in this context the use of means that we combine individual elements of a plot object.

After this somewhat abstract introduction, we will illustrate these steps with a practical example.

At the end of the last chapter we examined the relationship between psychological stress and gender. We now load the data set again:

and create a data set that only contains the variables, and.

In this data set we have a numeric variable, and a grouping variable,. Our question was whether the variable is related to the grouping variable. We could graphically represent this relationship in different ways: with dots, a box plot or a violin plot. These three methods are different in language and can be used like:, or. There is also a function that does not draw the points on top of each other in a point diagram, but with spatial “jittering” (flickering).

We can load the package either individually or as part of the:

5.1 Step 1: Create a plot object

We start with a data set and create a plot object with the function. This function has a data frame as its first argument. This means that we can use the operator:

So we have two options. We prefer the notation here, but it is of course also possible to specify the data frame within the function as an argument. At the same time we assign the object to a variable and name it.

5.2 Step 2: Aesthetic mappings

Now we define the “aesthetic mappings” with the second argument. These determine how the variables are used to represent the data and are defined with the function. We want to represent the grouping variable on the X-axis and should be displayed on the Y-axis. Additionally can have further arguments:,,,,. These are used to assign different colors, shapes, lines, etc. to the levels of the grouping variables.

In this example we have the grouping variable and we want the two levels of to have different colors and to be “filled in” with different colors.

\(~\)

If we define the “aesthetic mappings” within the function, they apply to all “layers”, i.e. for all elements of the plot. We could also define these mappings separately for each “layer”.

is now an “empty” plot object. We can look at it, but nothing is shown because it does not contain any “Layers” yet. An object is displayed by displaying the object on the console, either with or without it.

We see that we have already labeled the axes based on the variable names.

5.3 Step 3: add geoms

We can now add “Layers” to the plot object using functions. The syntax works like this: We “add” to the plot object:.

5.3.1 Scatter diagram

We first try to represent the observations as points:

The points are now shown in different colors, but points within a gender may be plotted on top of each other if they have the same value (overplotting). In this case there is a function that draws points next to each other with a "jittering":

has an argument that we can use to determine how wide the spread of the points is.

has further arguments: determines the diameter of the points, and determines the transparency.

5.3.2 Representing the distribution graphically

Another possibility would be to show the central tendency and dispersion of the data with a box plot or violin diagram.

The median is shown in a box plot, the rectangle represents the middle 50%, and the “whiskers” show 1.5 * the interquartile range. Outliers are shown with dots. To see the median it is better if we leave out the attribute:

A violin plot is similar to a box plot, but does not show the quantiles, but a “kernel density estimate”. A violin plot looks best when we use the attribute.

If we find that a mapping should not apply to all “layers”, then we can define it individually for each “layer” instead of in the function:

5.3.3 Combine several layers

We can also use several “layers”. We just need to combine several functions with one:

Take a look at the graphic examples in the previous chapters. Do you understand the code now?

In the previous examples we did not create a plot object, but rather sent the data record to the function with the operator, and then added the directly with. We have also used other functions, e.g. to display the background in white.

5.4 Geoms for different data types

We summarize: so far we have learned that we put together a plot in several steps. We start with a data frame and define an object with the function. With the function we assign variables of a data frame to the X or Y axis and define further "aesthetic mappings", e.g. a color coding based on a grouping variable. Then we add graphic elements with functions as “layers” to the plot object.

Now let's look at a selection for different combinations of variables. We can either represent one variable on the X-axis or two variables on the X- and Y-axes and these variables can either be continuous or categorical.

\(~\)

For the following examples we will use the data sets and:

5.4.1 One variable

If we only want to graph one variable on the X-axis, we still have to display values ​​on the Y-axis. This will often be a descriptive summary such as frequencies.

Categorical variables

When we're graphing a categorical variable, we often use one bar chart or bar graph. This shows, for example, the frequencies of the various categories using a rectangle (rectangular bar). The function that is used for this is called.

As an example, let's plot the frequencies of the father's four levels of education.

\(~\)

The function gives you an overview of the possible color names. There are 657 of them, we are showing here with only 15 randomly selected:

Here, too, we can also specify a grouping variable that we use to color-code the rectangles.

By default, creates one stacked Bar chart, i.e. the rectangles are stacked on top of each other. If this is not desired, we can use the function's argument. With this we are communicating that the bars should be drawn next to each other.

As a third variant we can use; this is how the bars are drawn on top of each other. Since the back rectangle is no longer visible, we use the argument to make the bars transparent.

Continuous variables

If the variable that we want to display graphically is not categorical but continuous, a histogram can be used; we generate this with the function. As an example, consider mental stress.

A histogram provides a graphical representation of the distribution of a numeric variable. To do this, the values ​​of these variables are divided into discrete intervals, or. The frequencies in the respective intervals are then displayed on the Y-axis, similar to a bar chart. Determining the size of the intervals is critical. If we don't specify anything, choose one yourself, but we can also specify it ourselves with the argument.

The determination of the depends of course on the scale of the variables and should be neither too fine nor too rough.

Try several values ​​for the binwidth. What is optimal?

If we want to see the relative frequencies on the Y axis instead of the absolute frequencies, we can use as as an argument of the function.

Of course, there is also the option of using a grouping variable for histograms.

As with the bar chart, the histograms are “stacked” on top of each other. If we want them on top of each other, we use them.

Side by side also works:

5.4.2 Two variables

Now we show two variables of a data record together. Here, too, the possible ones depend on the data type of the variables.

X and Y continuously

If both variables are continuous, we can show their relationship using a ‘scatter plot’ or a line diagram. We use the functions, respectively.

As an example we want to visualize the connection between psychological stress and life satisfaction.

We already got to know the arguments and the above, as well as the possibility to avoid ‘overplotting’ with the function. Both and also have one or an argument.

The grouping based on a categorical variable also works here. We use both the color and the shape of the dots to better distinguish the categories.

With the function we can create line graphs. As an example, we want to calculate the mean school grade for the father's various educational levels in a new data frame, and then display it graphically. Before we plot the average grade, let's convert the factor to a numeric variable.

We could also use it as a factor, but would then have to define a grouping variable for. In this case, since we only have one group, we would use:

\(~\)

As in the “Honeymoon or Hangover” example, we can now add dots to the line diagram:

Also has arguments to change the properties. In this case we use the argument which can take the values.

X categorical and Y continuous

If one of the variables is categorical, instead of using it as a grouping variable, we can plot it on an axis.

We have already seen examples of this above: there we represented the variables and and used the functions and. But we can also use the function for two variables. In this case, the variable on the Y-axis is totaled for all observations in the categories on the X-axis. Since this does not require a statistical transformation, we use the argument.

As an example, consider the fertility record. In this test subjects were asked whether they want a child or not (binary answer). In addition, the intimacy with one's own mother, the emotional attitude towards children and the gender were surveyed. On the Y-axis we show the absolute frequencies of a “yes” answer.

In order to better understand this graph, we also calculate the relative frequencies of a “yes” answer per gender.

X and Y categorical

Finally, the variables can be categorical on both the X and Y axes. In this case, it would make sense to display the common frequencies graphically. That's what the function is for.

As an example, let us consider the common frequency distribution of the education of the father and the education of the mother.

counts the common frequencies of the categories of the two variables and displays them as the diameter of the points.

Examples

We now consider 2 exercise examples.

Job satisfaction

In the first example we reload the job satisfaction data set from the last chapter and take a closer look at the graphic.

We now use a line diagram to show the mean job satisfaction across the measurement times, and we use the as a grouping factor. is important here, the other two arguments, and are only there for aesthetic reasons and could be left out.

As already indicated above, the argument is needed if we want to create a line chart with a categorical variable on the X-axis.

\(~\)

Line charts are often used to show the course of a variable over time. This means that we represent the time on the X-axis, as in this example. Most of the time, however, time is used as a continuous variable and not as a factor, as is the case here.

Wide vs. Long: Parental Education

Using the next example, let's look at the difference between the long and the wide Format. We used when we plotted the common frequency distribution of education of the father and mother and used it as separate variables to plot them on separate axes. However, we could also summarize and as levels of a repetition factor (), and the educational levels as a measurement variable (), i.e. as / pair. We do this when we want to use it as a variable on an axis and as a grouping variable.

This may not be easy to understand, so let's look at a specific example in a moment. As above, we now want to graphically display the mean school grade for the various educational levels of the parents. This time we do this for both parents. However, we want different lines for father and mother. Now it is important that we have one long Form data record.