CleanItUp 3

Making New Variables

What if the data you are really interested in isn’t in your dataframe yet? Perhaps you want to break data contained in one variable across many variables, or to combine data from several columns into one. Maybe you want to transform your data or compute difference scores.

In this lesson, we will continue to explore the sydneybeaches data, learning how to make new variables in using separate, unite, mutate and other functions from dpylr.

Lesson Outcomes

By the end of the lesson, you should:

  • 3.1 Know how to use separate and unite to create new variables in your data
  • 3.2 Know how to use mutate to compute new variables (numeric and logical)
  • 3.3 Know how to pipe filter, arrange, group_by, and mutate together to accomplish a lot, with relatively few lines of code.

3.1 Use separate and unite to create new variables

We are going to cheat a little bit with the date column here. We will learn how to use the lubridate package eventually, but for now, we can capitalise on the fact that R thinks our date column contains characters to practice splitting a single variable into several variables using the separate function.

In this screencast, we’ll review:

  • How to separate the date column into day, month, year
  • How to unite data from the site and council columns to create a new variable called site_council

Watch the video and then carry out the following steps:

  1. Split the date column into a day, month, and year column
  2. Combine the site and council columns into a single variable

3.2 Use mutate to compute new variables

Sometimes the data you are most interested are not in your dataframe yet, you need to computer them. The mutate function allows you to compute a new variable and add it to your dataframe.

In this screencast, we’ll review:

  • How to use the mutate function to
    • transform your data
    • compute numeric variables
    • compute logical variables

Watch the video and then carry out the following steps:

  1. Compute a variable that log transforms the beachbugs data
  2. Compute a variable that contains beachbugs difference scores
  3. Compute a variable that contains TRUE/FALSE according to whether each reading is greater than the mean bug levels

3.3 Pipe it all together

In Clean It Up Lesson 1 you learned about the pipe %>% - which can help you to string a whole series of wrangling functions together. To review, you can take your data, apply a function, take that output, apply another function, etc etc until you have added a series of new variables, all in a single chunk of code.

In this screencast, we’ll review:

  • How to pipe together a sequence of dplyr functions and assign the output to a new object in your environment

Watch the video and then create a new dataframe called cleanbeaches_new by piping together the following steps…

  1. Separate the date column into day, month, year
  2. Create a new column that contains the log transformed beach bugs data
  3. Create a new column that contains the difference scores
  4. Create a new column that contains a logical vector re whether each beachbug reading is higher than average
  5. Group_by site
  6. Create a new column that contains a logical vector re whether each beachbug reading is higher than average, for each site

Now have a go with your own data!

  • Choose a variable in character format and separate it into several columns
  • Pick two character vectors, and combine them using the unite function
  • Use mutate to transform your data, compute a numeric variable, and compute a new logical variable.

Next up - Clean It Up Lesson 4: Wide to Long

Sydney-based R-Ladies - share your successes and any challenges you’ve faced in the #ryouwithme_2_cleaning Slack channel!

Previous
Next