Clean It Up 2: Exploring Rows

R-Ladies Sydney


Now that we have cleaned up our column names to make them easier to work with, we can start to answer some questions about what’s in those rows! In this lesson, we’re going to filter, arrange, group_by and summarise the beaches data to answer the following questions:

Lesson Outcomes

By the end of the lesson, you should:

    2.1 Know how to use arrange to sort a dataframe and filter to select parts of the dataframe     2.2 Know how to use group_by and summarise to get summary statistics     2.2 Be able to pipe these functions together to answer questions about your data

Question A: Which beach has the highest recorded bacteria levels?

When we first looked at a summary of the sydneybeaches data, we could see that the highest value of beach bacteria in the dataset was 4900. I wonder which beach that came from? Here, we use arrange to sort the beach bugs data in descending order. We can also use the pipe to combine filter and arrange to look at extreme values within a particular site.

In this screencast, we’ll review:

Watch the video and then carry out the following steps:

  1. Sort the sydney beaches data by beachbugs in descending order
  2. Pick your favourite beach and determine whether its most extreme beachbug values are higher or lower than the worst day at Coogee.

Question B: Does Coogee or Bondi have more extreme bacteria levels? Which beach has the worst bacteria levels on average?

“Where should I swim?”" you might ask… Well, to answer that question we need to compare bacteria levels across sites.

To do this, ou can put more than one argument into a filter function. For example, you can filter for either Coogee or Bondi.

In this screencast, we’ll review:

Watch the video and then carry out the following steps:

  1. Pick two beaches to compare, use filter and the %in% operator
  2. Use group_by and summarise to work out which beach has the worst bacteria levels on average.

Question C: Which council does the worst job at keeping their beaches clean?

Lets practice our new dplyr skills, using group_by council (instead of site) and summarise to see which council does the best job at keeping its beaches clean.

In this screencast, we’ll review:

Now have a go with your own data!

Sydney-based R-Ladies - share your successes and any challenges you’ve faced in the #ryouwithme_2_cleaning Slack channel!

Next up - Clean It Up Lesson 3: Making New Variables

P.S. Interested in more dplyr tutorials?

Check out this blog series by R-Lady Suzan Baert!