VizW(h)iz 1: Plotting Raw Data

R-Ladies Sydney

2018/12/20

Before we begin, a preamble…

Why plot raw data?

Until I1 started learning R about a year ago, I had only ever plotted averaged data. My plots looked like this…

Or this….

In 2016, I remember seeing #barbarplots t-shirts and tote bags at a conference, but back then I didn’t know that the figures were made in R.

Check out this video for more info about the #barbarplots kickstarter campaign. And learn how to explode bar plots using code from here.

So, if not bar plots, then what?

Bar plots with standard error bars obviously have their limitations, but what is the alternative? In this lesson, we’re going to learn about how to plot raw data using ggplot2. As in prior units, we’ll be using the sydneybeaches dataset, so be sure to have that on hand!

Lesson Outcomes

By the end of this lesson, you should:

    1.1 Be able to use geom_point, geom_jitter, and geom_quasirandom to plot raw data     1.2 Be able to use colour to differentiate subsets of data     1.3 Be able to use facet_wrap to plot subsets of data separately     1.4 Be able to combine dpylr functions like filter with ggplot     1.5 Be able to use ggsave to export plots

1.1 Plotting bug levels by year

If the councils that look after our beaches are doing their job, beaches should be less contaminated than they were 5 years ago, right? Let’s plot bug levels over time and see whether things seem to be getting better.

In this screencast, we’ll review:

Here’s the plot for reference:

Watch the video and then carry out the following steps:

  1. Use geom_point to plot buglevels by year
  2. Play with geom_jitter and geom_quasirandom; which one is your favourite?
  3. Check out the ggbeeswarm vignette and try to make your quasirandom plot “smiley” or “frowny”
  4. Try geom_beeswarm too!

1.2 Using colour to plot bug levels by site

There doesn’t seem to be an obvious improvement in bug levels in the past 5 years. I wonder whether some beaches are just more variable than others. Let’s use colour to differentiate between different sites.

In this screencast, we’ll review:

Here’s the plot for reference:

Watch the video and then carry out the following steps:

  1. Use geom_jitter to plot bug levels for each site, differentiating between values from 2013-2018 with different coloured points
  2. Try colouring the points by another variable (perhaps council or month). How does the visualisation change?

1.3 Using facet_wrap() to plot sites separately

Using colour is one way to differentiate between data points associated with different variables. Alternatively, you can group the data into different mini-plots using facet_wrap. In this screencast, we’ll review:

Here’s the plot for reference:

Watch the video and perform the following steps:

  1. Use geom_jitter and facet_wrap to plot buglevels by year, separately for each site
  2. Try plotting buglevels by site, separately for each year. Hint: maybe using coord_flip is a good idea!

1.4 Putting it all together (dplyr + ggplot)

What if you only want to compare a couple of sites? Or restrict the range of scores to exclude obvious outliers? You can combine dyplyr functions like filter with ggplot using the pipe %>%. (The pipe was covered in #RYouWithMe Unit Clean It Up Lesson 1! Need a refresher? Click here!)

In this screencast, we’ll review:

Here’s the plot for reference:

Watch the video and perform the following steps:

  1. Pipe a filter function and a ggplot together to plot only values that are less than 1000
  2. Try changing the filter value to only plot values less than 500
  3. Use filter and ggplot to compare data from Coogee and Bondi
  4. Pick your two favorite beaches and make a plot that compares the variability in beachbug values

1.5 How to get your plots out of R Studio

Of course, these beautiful plots are not much use if they are stuck in R Studio. The easiest way to export your plots and save them elsewhere on your computer is by using ggsave.

In this screencast, we’ll review:

Watch the video and perform the following steps:

  1. Add a ggsave call after each chunk of code in your script to save your plots to a folder in your RYouWithMe project called “output”
  2. Use the source button to run your code from top to bottom.
  3. For Sydney-based R-Ladies: Pick your favourite exported plot and post it to the #ryouwithme_3_vizwhiz channel on Slack!

Challenge - plot your own data

The next step is to plot your own data as raw values.

Steps to plotting your own raw data:

  1. Remember that ggplot only likes “long” data, so if you have observations across several columns, go back to Clean it up Lesson 4 and brush up on how to convert your wide data into long format
  2. Pick a categorical variable for your x axis
  3. Pick a continuous variable for your y axis
  4. Try out geom_point, geom_jitter, or geom_quasirandom and see which one makes the most sense for your data
  5. Export using ggsave

On to Lesson 2

If you are one of our Sydney-based RLadies, share your success (and /or your frustration!) in our Slack channel #ryouwithme_3_vizwhiz!

Looking for more?

Here are some additional resources:

  • Paul van der Laken makes a good argument against bar graphs in his blog post here
  • Hadley Wickham’s book R for Data Science is a good place to start re data visualisation.
  • Nick Tierney and Saskia Freytag were talking data visualisation with Di Cook on the Credibly Curious podcast recently. Check out the episode here.

  1. Jen Richmond, R-Ladies Sydney co-founder and #RYouWithMe screencaster! (Twitter: @JenRichmondPhD)