A 'How-To' on Data Wrangling
Katherine Wood is a graduate student in the University of Illinois at Urbana-Champaign Psychology Department studying failures of awareness and is an advocate for open, reproducible science and open-source software. Also a programmer and statistical consultant, she blogs at Inattentional Coffee, and this post from that blog is reposted with permission. For a complete listing of her posts at Inattentional Coffee, click HERE.
***I found this Twitter thread on the vagaries of data wrangling killing momentum interesting, particularly the notion of how frustrating it must be to be at point A with your data, see Point B, where you’d like to go, and have no idea how to get there. Even for those with programming experience, data wrangling can be an enormous chore.
To that end, I thought I’d walk through a basic overview of how to accomplish some of the operations you might commonly encounter when you first get a data set. If you’re generating the data yourself, you can try to make your life easier by saving it in the format you want.Where possible, I’ll show multiple ways to accomplish something and try to highlight packages that will make things easier.MANDATORY DISCLAIMER: There are often at least six ways to do anything in R. If you don’t like any of the methods here, rest assured that there are others (and probably better ones, too); you can almost certainly find something that suits your style.
Reading in data
CSV Files
First, I’m going to create some data to actually read in. The following code chunk will write three csv files to the current directory, each with 3 columns of random data. This is meant to simulate raw data from three different subjects.
#Generate some dummy datadata <- replicate(3, mapply(rnorm, c(100, 100, 100), c(10, 100, 1), c(2, 25, .5)),simplify = FALSE)catch_output <- mapply(write.csv, data,mapply(paste, rep("data", times=length(data)),seq(1, length(data)),rep(".csv"), sep=""), row.names=FALSE)
Here’s a common situation: one spreadsheet has all of a subject’s raw data, and you have a few dozen spreadsheets. First, let’s talk about an easy way to read that in as painlessly as possible. No for-loops needed here; we’ll just use the trusty apply()
family from base R.
#Note that if your data are not in your current directory, you need to either:#Call, for ex., setwd('~/my_data_folder') to set the data folder as the current directory#Specify the path in list.files and then have it return the full path name of each file, rather than #the relative path.alldata <- lapply(list.files(pattern = '*.csv'), read.csv)
We’re accomplishing a few things in one line. First, the call to list.files
simply lists all of the files in your current directory. It has a bunch of optional arguments, too. You can specify a pattern, which is just a regex expression that specifies what you want. Here, I only want .csv files, so I specify that I want any file (“*" is a wildcard symbol allowing anything) that ends in the extension .csv. I can specify other arguments, like whether I want the full path names returned, whether I want it to also search sub-directories for files, and so on.After we have this list of files, we simply iterate over it and call read.csv
on each one. The end result is a list, wherein each element is one subject’s data frame.Now, a list of data frames is not the most useful data format to have. Fortunately, it’s easy to bind this list together into one big data frame. Here’s how to bring it all together in base R.
subjects_all <- do.call('rbind', alldata)head(subjects_all)
## V1 V2 V3## 1 8.949957 124.80946 1.3867713## 2 11.926203 117.26313 1.0590077## 3 6.922328 78.01893 1.1814264## 4 9.323430 91.66125 0.5276688## 5 11.648274 140.93698 1.6975092## 6 6.048526 65.44654 1.3584207
Note that the information about which data belong to each subject is lost here. You’ll need to add an identifier column, or make sure that each file has one, before reading it in this way.How about some good old dplyr?
library(dplyr)subjects_all <- bind_rows(alldata, .id='subject')head(subjects_all)
## subject V1 V2 V3## 1 1 8.949957 124.80946 1.3867713## 2 1 11.926203 117.26313 1.0590077## 3 1 6.922328 78.01893 1.1814264## 4 1 9.323430 91.66125 0.5276688## 5 1 11.648274 140.93698 1.6975092## 6 1 6.048526 65.44654 1.3584207
unique(subjects_all$subject)
## [1] "1" "2" "3"
We can use the.id
argument to specify an ID column, which will keep track of where the data comes from.We can also use the handy rbindlist function from the data.table/dtplyr package. This will label the data automatically for us according to which data frame it came from; we can call this new column (specified by the id argument) anything we like.
library(data.table)subjects_all <- rbindlist(alldata, idcol='subject')head(subjects_all)
## subject V1 V2 V3## 1: 1 8.949957 124.80946 1.3867713## 2: 1 11.926203 117.26313 1.0590077## 3: 1 6.922328 78.01893 1.1814264## 4: 1 9.323430 91.66125 0.5276688## 5: 1 11.648274 140.93698 1.6975092## 6: 1 6.048526 65.44654 1.3584207
unique(subjects_all$subject)
## [1] 1 2 3
Note also that rbindlist()
is an order of magnitude faster than do.call
. If you’ve got a lot of data, you’ll probably want to go with this function. data.tables
are extremely fast and memory efficient in general, and might be a good option if you’re working with truly huge amounts of data. For most uses, though, this kind of optimization isn’t really necessary.
Text files
Same process. I’ll make some real quick:
catch_output <- mapply(write.table, data,mapply(paste, rep("data", times=length(data)),seq(1, length(data)),rep(".txt"), sep=""), row.names=FALSE)
To read it in, we just call read.table
instead.
allsubjs <- lapply(list.files(pattern = '*.txt'), read.table, header=TRUE, colClasses=c('double'))head(allsubjs[[1]])
## V1 V2 V3## 1 8.949957 124.80946 1.3867713## 2 11.926203 117.26313 1.0590077## 3 6.922328 78.01893 1.1814264## 4 9.323430 91.66125 0.5276688## 5 11.648274 140.93698 1.6975092## 6 6.048526 65.44654 1.3584207
Just like before, we end up with a list of data frames, one for each subject. In read.table()
, unlike read.csv
, the header
argument defaults to FALSE
, so be sure to change that. I also specify colClasses
here to tell R what type of data the content of the columns is. Without that, these doubles get read in as factors; doing it now saves a little work later.We can then bind these together with rbindlist
just like we did when we used read.csv
.
XLS(X)
Need to read in Excel files, or read in each sheet in one file as a separate set of data?Jenny Bryan et al. to the rescue.
Some general notes
There are a lot of arguments you can specify in the read.csv
function call that can save you work down the line–I used some of them when I was reading in the text files, but there are many more. You can even tell the function what strings should be read as NA values! This is really handy if you have NULL and what R to treat that as NA. You can also read in only part of the file, which is useful if you have a monster file and want to read it in in chunks.
Reshaping Data
It’s helpful to be able to switch at will between data in wide format, where each row is a subject and each column contains a variable, to long format, where each row is a value at a given time, or measure (for repeated measures).Here’s a very simple data set. Each row is a subject’s scores. Column 1 is their subject number, followed by their scores in the control, treatment 1, and treatment 2 conditions. We tend to be most accustomed to seeing data this way. This is “wide” format.
traits <- data.frame('id'=seq(1, 10),'control'=floor(rnorm(10, 30, 5)),'treat1'=floor(rnorm(10, 10, 2)),'treat2'=floor(rnorm(10, 15, 3)))print(traits)
## id control treat1 treat2## 1 1 36 8 21## 2 2 33 11 9## 3 3 27 9 13## 4 4 26 6 14## 5 5 23 9 8## 6 6 33 12 13## 7 7 30 9 19## 8 8 29 13 11## 9 9 22 12 11## 10 10 23 10 16
Wide to Long
Now, when we cast this to “long” format, we will have the id column (the subject number), a variable column (in this case, which test was taken), and the value column (the score on each test). Here it is, melted two ways. In base:
traits_long_base <- reshape(traits, idvar="id", direction='long', v.names=c('score'),timevar='test',times=c('control', 'treat1', 'treat2'), varying=seq(2, 4))print(traits_long_base)
## id test score## 1.control 1 control 36## 2.control 2 control 33## 3.control 3 control 27## 4.control 4 control 26## 5.control 5 control 23## 6.control 6 control 33## 7.control 7 control 30## 8.control 8 control 29## 9.control 9 control 22## 10.control 10 control 23## 1.treat1 1 treat1 8## 2.treat1 2 treat1 11## 3.treat1 3 treat1 9## 4.treat1 4 treat1 6## 5.treat1 5 treat1 9## 6.treat1 6 treat1 12## 7.treat1 7 treat1 9## 8.treat1 8 treat1 13## 9.treat1 9 treat1 12## 10.treat1 10 treat1 10## 1.treat2 1 treat2 21## 2.treat2 2 treat2 9## 3.treat2 3 treat2 13## 4.treat2 4 treat2 14## 5.treat2 5 treat2 8## 6.treat2 6 treat2 13## 7.treat2 7 treat2 19## 8.treat2 8 treat2 11## 9.treat2 9 treat2 11## 10.treat2 10 treat2 16
We have to be pretty careful about specifying our arguments here. The idvar
indicates which column we want to map over the data. Here, we want the subject number; we want each subject’s score on each test labeled with their unique ID. Direction is fairly self-explanatory; we’re going to long form here. v.names
is the name (or names) of the new columns. Here, we’re collapsing everybody’s scores into a single column, so we call it 'score'
. timevar
is the variable that changes over time, or over repeated measures. Here it’s which test they took, so we call the new column 'test'
. Then we tell it which values to use in this new times column with times
; we want the name of the test. Then we tell it which columns of our data are varying over time/are our repeated measures; here it’s the final three columns (you can also specify a vector of strings).Here are the same results with the melt()
function from data.table or reshape2. We specify again which column represents our data labels, and then we tell it which columns we want it to treat as our “measures,” which in our case is our three tests (if unspecified, it just uses all non-id variables, so we could have left it out here):
traits_long_m <- melt(traits, id.vars="id", measure.vars=c('control', 'treat1', 'treat2'),variable.name='test', value.name='score')print(traits_long_m)
## id test score## 1 1 control 36## 2 2 control 33## 3 3 control 27## 4 4 control 26## 5 5 control 23## 6 6 control 33## 7 7 control 30## 8 8 control 29## 9 9 control 22## 10 10 control 23## 11 1 treat1 8## 12 2 treat1 11## 13 3 treat1 9## 14 4 treat1 6## 15 5 treat1 9## 16 6 treat1 12## 17 7 treat1 9## 18 8 treat1 13## 19 9 treat1 12## 20 10 treat1 10## 21 1 treat2 21## 22 2 treat2 9## 23 3 treat2 13## 24 4 treat2 14## 25 5 treat2 8## 26 6 treat2 13## 27 7 treat2 19## 28 8 treat2 11## 29 9 treat2 11## 30 10 treat2 16
And let’s not leave out tidyr
:
library(tidyr)traits_long_t <- gather(traits, key=test, value=score, control, treat1, treat2)print(traits_long_t)
## id test score## 1 1 control 36## 2 2 control 33## 3 3 control 27## 4 4 control 26## 5 5 control 23## 6 6 control 33## 7 7 control 30## 8 8 control 29## 9 9 control 22## 10 10 control 23## 11 1 treat1 8## 12 2 treat1 11## 13 3 treat1 9## 14 4 treat1 6## 15 5 treat1 9## 16 6 treat1 12## 17 7 treat1 9## 18 8 treat1 13## 19 9 treat1 12## 20 10 treat1 10## 21 1 treat2 21## 22 2 treat2 9## 23 3 treat2 13## 24 4 treat2 14## 25 5 treat2 8## 26 6 treat2 13## 27 7 treat2 19## 28 8 treat2 11## 29 9 treat2 11## 30 10 treat2 16
Here, the key/value pairing tells us about our outcome columns, and then we just list the columns to gather up.Three roads, same destination. I find melt
and gather
both much more intuitive than reshape
, with gather
the easiest of them all to use, but your mileage may vary.Data is really easy to plot this way:
library(ggplot2)plot <- ggplot(traits_long_t, aes(x=test, y=score, color=test)) +geom_point()print(plot)
Simplicity itself.
Long to Wide
Now, if we want to go the other direction (long to wide), in base R, we call the same function with different arguments:
traits_wide_base <- reshape(traits_long_base, direction='wide', timevar='test', idvar='id')print(traits_wide_base)
## id score.control score.treat1 score.treat2## 1.control 1 36 8 21## 2.control 2 33 11 9## 3.control 3 27 9 13## 4.control 4 26 6 14## 5.control 5 23 9 8## 6.control 6 33 12 13## 7.control 7 30 9 19## 8.control 8 29 13 11## 9.control 9 22 12 11## 10.control 10 23 10 16
Now we have the original structure of our data back.The inverse of melt()
is dcast
:
traits_wide_m <- dcast(traits_long_m, id ~ test, value.var='score')print(traits_wide_m)
## id control treat1 treat2## 1 1 36 8 21## 2 2 33 11 9## 3 3 27 9 13## 4 4 26 6 14## 5 5 23 9 8## 6 6 33 12 13## 7 7 30 9 19## 8 8 29 13 11## 9 9 22 12 11## 10 10 23 10 16
Right back to where we were.And to undo gather
, we spread
:
traits_wide_t <- spread(traits_long_t, test, score)print(traits_wide_t)
## id control treat1 treat2## 1 1 36 8 21## 2 2 33 11 9## 3 3 27 9 13## 4 4 26 6 14## 5 5 23 9 8## 6 6 33 12 13## 7 7 30 9 19## 8 8 29 13 11## 9 9 22 12 11## 10 10 23 10 16
Reshaping with more variables
Here’s a more complex example.
traittest <- data.frame('traitA'=factor(rep(c('high', 'med', 'low'), each=4)),'traitB'=factor(rep(c('positive', 'negative'), times=6)),'test1'=floor(rnorm(12, 10, 2)), 'test2'=floor(rnorm(12, 15, 2)))head(traittest)
## traitA traitB test1 test2## 1 high positive 9 14## 2 high negative 7 16## 3 high positive 11 18## 4 high negative 12 16## 5 med positive 8 16## 6 med negative 8 16
There are a lot of ways to melt this data. Maybe we want to collpase the tests into a single column–in this case the traits are the identifier variables.In base:
tt_bytrait_base <- reshape(traittest, direction='long', v.names='score',timevar='test', times=c('test1', 'test2'), varying=c('test1','test2'))print(tt_bytrait_base)
## traitA traitB test score id## 1.test1 high positive test1 9 1## 2.test1 high negative test1 7 2## 3.test1 high positive test1 11 3## 4.test1 high negative test1 12 4## 5.test1 med positive test1 8 5## 6.test1 med negative test1 8 6## 7.test1 med positive test1 10 7## 8.test1 med negative test1 9 8## 9.test1 low positive test1 6 9## 10.test1 low negative test1 11 10## 11.test1 low positive test1 13 11## 12.test1 low negative test1 9 12## 1.test2 high positive test2 14 1## 2.test2 high negative test2 16 2## 3.test2 high positive test2 18 3## 4.test2 high negative test2 16 4## 5.test2 med positive test2 16 5## 6.test2 med negative test2 16 6## 7.test2 med positive test2 16 7## 8.test2 med negative test2 11 8## 9.test2 low positive test2 17 9## 10.test2 low negative test2 13 10## 11.test2 low positive test2 18 11## 12.test2 low negative test2 12 12
With melt()
:
tt_bytrait_m <- melt(traittest, measure.vars=c('test1', 'test2'), variable.name='test',value.name='score')head(tt_bytrait_m)
## traitA traitB test score## 1 high positive test1 9## 2 high negative test1 7## 3 high positive test1 11## 4 high negative test1 12## 5 med positive test1 8## 6 med negative test1 8
With gather
:
tt_bytrait_t <- gather(traittest, test, score, test1, test2)head(tt_bytrait_t)
## traitA traitB test score## 1 high positive test1 9## 2 high negative test1 7## 3 high positive test1 11## 4 high negative test1 12## 5 med positive test1 8## 6 med negative test1 8
Or, we can let the tests be the identifiers, and collapse the traits into a single column.Base:
tt_bytest_base <- reshape(traittest, direction='long', v.names='rating',timevar='trait', times=c('traitA', 'traitB'),varying=c('traitA','traitB'))print(tt_bytest_base)
## test1 test2 trait rating id## 1.traitA 9 14 traitA high 1## 2.traitA 7 16 traitA high 2## 3.traitA 11 18 traitA high 3## 4.traitA 12 16 traitA high 4## 5.traitA 8 16 traitA med 5## 6.traitA 8 16 traitA med 6## 7.traitA 10 16 traitA med 7## 8.traitA 9 11 traitA med 8## 9.traitA 6 17 traitA low 9## 10.traitA 11 13 traitA low 10## 11.traitA 13 18 traitA low 11## 12.traitA 9 12 traitA low 12## 1.traitB 9 14 traitB positive 1## 2.traitB 7 16 traitB negative 2## 3.traitB 11 18 traitB positive 3## 4.traitB 12 16 traitB negative 4## 5.traitB 8 16 traitB positive 5## 6.traitB 8 16 traitB negative 6## 7.traitB 10 16 traitB positive 7## 8.traitB 9 11 traitB negative 8## 9.traitB 6 17 traitB positive 9## 10.traitB 11 13 traitB negative 10## 11.traitB 13 18 traitB positive 11## 12.traitB 9 12 traitB negative 12
melt
:
tt_bytest_m <- melt(traittest, measure.vars=c('traitA', 'traitB'),variable.name='trait', value.name='rating')
## Warning: attributes are not identical across measure variables; they will## be dropped
head(tt_bytest_m)
## test1 test2 trait rating## 1 9 14 traitA high## 2 7 16 traitA high## 3 11 18 traitA high## 4 12 16 traitA high## 5 8 16 traitA med## 6 8 16 traitA med
(We can ignore the warning; it’s warning us about the fact that we’re combining two factors that don’t share levels, so it’s coercing them all to characters.)With gather
:
tt_bytest_t <- gather(traittest, trait, rating, traitA, traitB)
## Warning: attributes are not identical across measure variables; they will## be dropped
head(tt_bytest_t)
## test1 test2 trait rating## 1 9 14 traitA high## 2 7 16 traitA high## 3 11 18 traitA high## 4 12 16 traitA high## 5 8 16 traitA med## 6 8 16 traitA med
(Same warning as above.)
Reformatting Data
So we’ve read data in, and can flip it between long and wide at will. Great, but what if the data itself needs to be fixed?
Recoding values
Let’s say you have some data that look like this:
yesno <- data.frame('subj'=seq(1,10), 'resp'=rep(c('Y','N'), each=5))print(yesno)
## subj resp## 1 1 Y## 2 2 Y## 3 3 Y## 4 4 Y## 5 5 Y## 6 6 N## 7 7 N## 8 8 N## 9 9 N## 10 10 N
So we have 10 subjects, and each one responded either yes (Y) or no (N) to… something. But maybe we don’t like the way this is coded; Y and N are hard to work with if we want to find average accuracy, for example. Maybe we want 1’s and 0’s instead, with which it is easy to do calculations.If we want to recode these values, we have a few options. We can use indexing, of course, but there are also some functions that will save you some work.Base has the ifelse
function, which performs a logical comparison, and if true, returns the first value; else, the second:
yesno$resp <- ifelse(yesno$resp == 'Y', 1, 0)print(yesno)
## subj resp## 1 1 1## 2 2 1## 3 3 1## 4 4 1## 5 5 1## 6 6 0## 7 7 0## 8 8 0## 9 9 0## 10 10 0
If we have more than two alternatives, you’ll have to use something like a switch
statement:
yesnomaybe <- data.frame('subj'=seq(1,15), 'resp'=rep(c('Y','N','M'), each=5))print(yesnomaybe)
## subj resp## 1 1 Y## 2 2 Y## 3 3 Y## 4 4 Y## 5 5 Y## 6 6 N## 7 7 N## 8 8 N## 9 9 N## 10 10 N## 11 11 M## 12 12 M## 13 13 M## 14 14 M## 15 15 M
Now we have three options. Maybe we want ‘yes’ to be 1, ‘no’ to be -1, and ‘maybe’ to be 0. Here’s how you can do it with a switch
statement and sapply
to call it on each element:
yesnomaybe$resp <- sapply(yesnomaybe$resp, function(x) switch(as.character(x), 'Y'=1, 'N'=-1,'M'=0))print(yesnomaybe)
## subj resp## 1 1 1## 2 2 1## 3 3 1## 4 4 1## 5 5 1## 6 6 -1## 7 7 -1## 8 8 -1## 9 9 -1## 10 10 -1## 11 11 0## 12 12 0## 13 13 0## 14 14 0## 15 15 0
In dplyr
, we have the recode
function:
yesnomaybe$dplyr_recode <- recode(yesnomaybe$resp, `1`='yes', `-1`='no', `0`='maybe')print(yesnomaybe)
## subj resp dplyr_recode## 1 1 1 yes## 2 2 1 yes## 3 3 1 yes## 4 4 1 yes## 5 5 1 yes## 6 6 -1 no## 7 7 -1 no## 8 8 -1 no## 9 9 -1 no## 10 10 -1 no## 11 11 0 maybe## 12 12 0 maybe## 13 13 0 maybe## 14 14 0 maybe## 15 15 0 maybe
Recoding, assuming you don’t have to do it for a huge number of possibilities, goes pretty fast.
Adding variables
Variables can be added to an existing data frame just with the $
operator:
df <- data.frame('x'=rnorm(20, 6), 'y'=rnorm(20))print(df)
## x y## 1 6.182055 0.34012358## 2 7.436347 -0.59966661## 3 6.315306 0.36833177## 4 6.896922 1.50183433## 5 6.941559 2.49651251## 6 5.310434 -0.03238655## 7 5.571108 -1.14902213## 8 6.196074 0.84410316## 9 6.070575 -0.98276613## 10 7.307604 -0.41943093## 11 6.639640 0.85786488## 12 6.036730 0.59530225## 13 5.319429 0.51228832## 14 4.451193 -0.78561172## 15 5.352494 0.87007150## 16 8.215930 -0.67282494## 17 5.485161 -0.17377787## 18 6.126534 -0.03902882## 19 7.063122 -0.57183073## 20 5.599426 0.61108155
df$z <- rnorm(20, 10)print(df)
## x y z## 1 6.182055 0.34012358 10.298640## 2 7.436347 -0.59966661 9.799344## 3 6.315306 0.36833177 9.668752## 4 6.896922 1.50183433 10.097066## 5 6.941559 2.49651251 10.815749## 6 5.310434 -0.03238655 9.597003## 7 5.571108 -1.14902213 11.245294## 8 6.196074 0.84410316 8.256783## 9 6.070575 -0.98276613 10.099128## 10 7.307604 -0.41943093 11.489885## 11 6.639640 0.85786488 9.110691## 12 6.036730 0.59530225 8.862010## 13 5.319429 0.51228832 10.119818## 14 4.451193 -0.78561172 10.767057## 15 5.352494 0.87007150 9.525280## 16 8.215930 -0.67282494 9.805977## 17 5.485161 -0.17377787 10.306639## 18 6.126534 -0.03902882 9.573121## 19 7.063122 -0.57183073 9.948351## 20 5.599426 0.61108155 9.301278
If you need to manipulate two data vectors that are numeric, you can just add, multiply, etc. your columns together to perform these operations elementwise:
df$total <- with(df, x + y + z)head(df)
## x y z total## 1 6.182055 0.34012358 10.298640 16.82082## 2 7.436347 -0.59966661 9.799344 16.63602## 3 6.315306 0.36833177 9.668752 16.35239## 4 6.896922 1.50183433 10.097066 18.49582## 5 6.941559 2.49651251 10.815749 20.25382## 6 5.310434 -0.03238655 9.597003 14.87505
You also have a lot of options in the dplyr
library, notably transform
:
df <- transform(df, x = -x)head(df)
## x y z total## 1 -6.182055 0.34012358 10.298640 16.82082## 2 -7.436347 -0.59966661 9.799344 16.63602## 3 -6.315306 0.36833177 9.668752 16.35239## 4 -6.896922 1.50183433 10.097066 18.49582## 5 -6.941559 2.49651251 10.815749 20.25382## 6 -5.310434 -0.03238655 9.597003 14.87505
But now that we’ve updated a column, our total is wrong. Let’s fix it with transmute
:
df <- mutate(df, corrected_total = x + y + z)head(df)
## x y z total corrected_total## 1 -6.182055 0.34012358 10.298640 16.82082 4.456709## 2 -7.436347 -0.59966661 9.799344 16.63602 1.763331## 3 -6.315306 0.36833177 9.668752 16.35239 3.721778## 4 -6.896922 1.50183433 10.097066 18.49582 4.701978## 5 -6.941559 2.49651251 10.815749 20.25382 6.370702## 6 -5.310434 -0.03238655 9.597003 14.87505 4.254183
Maybe I now want a dataframe just of the even numbers in the x column, and the residuals from total and corrected total (for… reasons). transmute
is like mutate, but it throws away all the extra:
df_even <- transmute(df, x_ev=floor(x)%%2==0, residuals=total-corrected_total)head(df_even)
## x_ev residuals## 1 FALSE 12.36411## 2 TRUE 14.87269## 3 FALSE 12.63061## 4 FALSE 13.79384## 5 FALSE 13.88312## 6 TRUE 10.62087
If none of these methods fit the bill, you can call apply
along all the columns or rows of your data frame and write a custom function to do whatever processing you need.
Factor levels as column labels
Let’s take the unfortunate case of levels-as-columns, in which all the levels of a factor are columns, and people get a 1 or a 0 for each level instead of their value. Here’s some example data:
levs <- data.frame('subj'=seq(1, 4), 'a'=c(0, 0, 1, 0), 'b'=c(1, 0, 0, 1), 'c'=c(0, 1, 0, 0))print(levs)
## subj a b c## 1 1 0 1 0## 2 2 0 0 1## 3 3 1 0 0## 4 4 0 1 0
So, what we have are three subjects, and a factor with three possible levels: A, B, and C. What we want is the subject and the actual level of their factor, so we need a 2-column matrix.Here’s one way we might do that (there are others) that uses some procedures I’ve already shown. First, we’ll reshape the dataframe so that the factors end up in one column. This has the advantage of putting the actual values of the factors we want all in one place. Then we filter out the 0s, leaving behind only the levels the subject actually selected, drop the redundant ones colum, then put the subjects back in the right order.For these examples, I’ll print out each intermediate stage of manipulation so that you can see what’s happening.All about that base:
(lev_long <- reshape(levs, idvar='subj', direction='long', v.names='value', timevar='trait',times=c('a', 'b', 'c'), varying=c('a', 'b', 'c')))
## subj trait value## 1.a 1 a 0## 2.a 2 a 0## 3.a 3 a 1## 4.a 4 a 0## 1.b 1 b 1## 2.b 2 b 0## 3.b 3 b 0## 4.b 4 b 1## 1.c 1 c 0## 2.c 2 c 1## 3.c 3 c 0## 4.c 4 c 0
First, we reshape the data. We need all of the factor-related pieces of information in a single column. We have a column with the possible factor levels, and a column indicating 0 (not the subject’s level) or 1 (the subject’s level).
(lev_filtered <- with(lev_long, lev_long[value == 1, 1:2]))
## subj trait## 3.a 3 a## 1.b 1 b## 4.b 4 b## 2.c 2 c
The second step just uses good old-fashioned indexing to keep all rows where the value is 1 (aka, the subject has that level), and to keep only the useful subject and trait columns; what the with
function does is tell R to perform all operations with the supplied data set, so we can reference columns by isolated names rather than having to do the verbose data_frame$column
syntax.
(lev_reformed_base <- lev_filtered[order(lev_filtered$subj),])
## subj trait## 1.b 1 b## 2.c 2 c## 3.a 3 a## 4.b 4 b
The final step is reordering the data according to the subject column in ascending order. Now we’ve got our data in a much more sensible format.Tidyr and dplyr make quick work of this. First, we gather:
(lev_g <- gather(levs, trait, value, a, b, c))
## subj trait value## 1 1 a 0## 2 2 a 0## 3 3 a 1## 4 4 a 0## 5 1 b 1## 6 2 b 0## 7 3 b 0## 8 4 b 1## 9 1 c 0## 10 2 c 1## 11 3 c 0## 12 4 c 0
Filter out the 0s:
(lev_f <- filter(lev_g, value != 0))
## subj trait value## 1 3 a 1## 2 1 b 1## 3 4 b 1## 4 2 c 1
Retain only the useful columns:
(lev_s <- select(lev_f, subj, trait))
## subj trait## 1 3 a## 2 1 b## 3 4 b## 4 2 c
Finally, put the subjects back in order:
(lev_reform <- arrange(lev_s, subj))
## subj trait## 1 1 b## 2 2 c## 3 3 a## 4 4 b
Here are those steps strung together with piping and thus obviating the need for all those separate variable assignments:
levs_reformed <- gather(levs, trait, value, a, b, c) %>%filter(value != 0) %>%select(subj, trait) %>%arrange(subj)print(levs_reformed)
## subj trait## 1 1 b## 2 2 c## 3 3 a## 4 4 b
Charming!What about if we have multiple factors? Here we have a test and a report, each of which has three possible levels: ABC and XYZ, respectively.
mfac <- data.frame('subj'=seq(1, 4), 'test.A'=c(0, 1, 0, 1), 'test.B'=c(1, 0, 0, 0),'test.C'=c(0, 0, 1, 0), 'report.X'=c(1, 0, 0, 0),'report.Y'=c(0, 1, 1, 0), 'report.Z'=c(0, 0, 0, 1))print(mfac)
## subj test.A test.B test.C report.X report.Y report.Z## 1 1 0 1 0 1 0 0## 2 2 1 0 0 0 1 0## 3 3 0 0 1 0 1 0## 4 4 1 0 0 0 0 1
So what we want is a dataframe with three columns: subject number, test, and report. Subject 1 picked test A and report X, subject 2 picked test A and report Y, and so on.This gets a little more complicated. If we collapse everything into one column, we’re going to have to then spread it back out to separate the factors. We’ve also got the item label merged to its type, which is a problem if we only want the letter designation.Let’s try with base. Here’s the reshape-filter method:
mfac_long <- reshape(mfac, idvar='subj', direction='long', v.names='value', timevar='measure',times=colnames(mfac)[-1], varying=colnames(mfac)[-1])mfac_filtered <- with(mfac_long, mfac_long[value == 1, 1:2])type_splits <- do.call(rbind, strsplit(mfac_filtered$measure, '.', fixed=TRUE))mfac_sep <- data.frame('subj'=mfac_filtered$subj,'type'=type_splits[,1],'version'=type_splits[,2])mfac_wide <- reshape(mfac_sep, idvar='subj', direction='wide', timevar='type')(mfac_reformed_base <- mfac_wide[order(mfac_wide$subj),])
## subj version.test version.report## 3 1 B X## 1 2 A Y## 4 3 C Y## 2 4 A Z
Pulling this off takes more finagling. Things are fine when we reshape and filter (note the trick used to save some verbage in reshape(); indexing with a negative excludes that item, so we’re saying we want all column names except the first), but then we have to recover whether our factor was a test or a report separately of its type. This means we have to split the string using strsplit
, bind the results into a matrix (because they automatically come out as a list), and then take those newly-made factors and reshape it wide again with the test type and report type as their own columns. One nice thing about this approach, in spite of its many steps, is that it’s totally blind to the content of the labels (provided they are consistently delimited). If they’re labeled in a cooperative way, you don’t need to know how many labels there are or what they say, and they can be in any order.Here’s another base approach, from my BFF Kelly Chang. This one uses the apply
function to sweep a filter down the dataframe, then repackage the results:
labels <- c('A', 'B', 'C', 'X', 'Y', 'Z')filtered <- t(apply(mfac[,2:ncol(mfac)], 1, function(x) labels[x==1]))mfac_kc <- data.frame(mfac$subj, filtered)colnames(mfac_kc) <- c('subj', 'test', 'report')print(mfac_kc)
## subj test report## 1 1 B X## 2 2 A Y## 3 3 C Y## 4 4 A Z
Here, you would supply the labels, rather than recovering them from the data itself (as was done in the previous approach). Here, order is important; the labels need to be in the same order as the corresponding columns for the filter to work.With tidyr
and dplyr
, this approach can look something like this (still agnostic to the label content):
mfac_reformed <- gather(mfac, measure, value, -subj) %>%filter(value != 0) %>%select(subj, measure) %>%separate(measure, c('test', 'type')) %>%spread(test, type) %>%arrange(subj)print(mfac_reformed)
## subj report test## 1 1 X B## 2 2 Y A## 3 3 Y C## 4 4 Z A
The first few steps are the same; melt everything down and toss the zero values. Then, we need a step to yank apart the measure’s letter designation and its type. Fortunately, tidyr
has a handy separate
function that does just this; it pulls apart the joined values into two columns that we can label right away. Then, we need to spread our now distinct factor types back into columns–one for the test and one for the report–and sort by subject.Note also that the intermediate steps in this last example, when we had to separate the two types of factors and get two separate ones back from the report.X
format, which involved splitting the string and reshaping the data, can also be useful if you have data in this form, or if you have one big code for a condition or trial and at some point want to split it into its components. You can also use the colsplit()
function from the reshape2
package for this purpose.
Parting Thoughts
And there you have it–a brief introduction to some common data manipulation tasks, and a few ways to handle them. This is only the thinnest of samples of methods. There are lots of different ways to accomplish things, and packages to help you do it. Many of these methods will undoubtedly have my fingerprints all over them; one of the reasons I approached these problems the way I did is to show how learning a skill in one context–reshaping data for plotting, for example–can be useful in other contexts, like changing a data frame’s fundamental structure. Many roads lead to the same place, and if you don’t like this one, another will get you there just as comfortably, if not more so.