Split columns in dataframe with NA
Clash Royale CLAN TAG#URR8PPP
up vote
11
down vote
favorite
I have a df
like this:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
> df
FOO
1 A|B|C
2 A|B
3 B|C
4 A
5 C
And I would like to have an output like this:
> df
X1 X2 X3
1 A B C
2 A B
3 B C
4 A
5 C
So far I tried with this example: Split column at delimiter in data frame but it is not spliting the columns without repeating values, what I get there is:
df <- data.frame(do.call('rbind', strsplit(as.character(df$FOO),'|',fixed=TRUE)))
> df
X1 X2 X3
1 A B C
2 A B A
3 B C B
4 A A A
5 C C C
And I also get this warning:
Warning message: In rbind(c("A", "B", "C"), c("A", "B"), c("B", "C"),
"A", "C") : number of columns of result is not a multiple of vector
length (arg 2)
What can I do in those cases? Preferably with baseR
r dataframe split
 |Â
show 3 more comments
up vote
11
down vote
favorite
I have a df
like this:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
> df
FOO
1 A|B|C
2 A|B
3 B|C
4 A
5 C
And I would like to have an output like this:
> df
X1 X2 X3
1 A B C
2 A B
3 B C
4 A
5 C
So far I tried with this example: Split column at delimiter in data frame but it is not spliting the columns without repeating values, what I get there is:
df <- data.frame(do.call('rbind', strsplit(as.character(df$FOO),'|',fixed=TRUE)))
> df
X1 X2 X3
1 A B C
2 A B A
3 B C B
4 A A A
5 C C C
And I also get this warning:
Warning message: In rbind(c("A", "B", "C"), c("A", "B"), c("B", "C"),
"A", "C") : number of columns of result is not a multiple of vector
length (arg 2)
What can I do in those cases? Preferably with baseR
r dataframe split
Would the dataframe have values A, B, C... always ? I mean how do we decide which column a particular value should go?
â Ronak Shah
Aug 16 at 8:08
No, values might change but names remain the same. For example:row1 = A|B|C
androw22 = B|D|F
.
â Biocrazy
Aug 16 at 8:12
So if you have only these two rows how many columns would you have? 6 (A, B, C, D, E, F) or 5 (A, B, C, D, F) ?
â Ronak Shah
Aug 16 at 8:14
At the begining (indf
) I have only one column with a changing number of rows. What it is inside the rows (A|B|C
) always changes. Sometimes the max length can be 7 (A|B|C|D|E|F|G
), others 1 (A
), others 2 (A|C
), and so on.
â Biocrazy
Aug 16 at 8:18
1
One last thing I want to tell you: When you look atdf
you see your column is of typefactor
. That's often a very unwanted thing. So you could add,stringsAsFactors=F
into the data.frame definition.
â Andre Elrico
Aug 16 at 8:54
 |Â
show 3 more comments
up vote
11
down vote
favorite
up vote
11
down vote
favorite
I have a df
like this:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
> df
FOO
1 A|B|C
2 A|B
3 B|C
4 A
5 C
And I would like to have an output like this:
> df
X1 X2 X3
1 A B C
2 A B
3 B C
4 A
5 C
So far I tried with this example: Split column at delimiter in data frame but it is not spliting the columns without repeating values, what I get there is:
df <- data.frame(do.call('rbind', strsplit(as.character(df$FOO),'|',fixed=TRUE)))
> df
X1 X2 X3
1 A B C
2 A B A
3 B C B
4 A A A
5 C C C
And I also get this warning:
Warning message: In rbind(c("A", "B", "C"), c("A", "B"), c("B", "C"),
"A", "C") : number of columns of result is not a multiple of vector
length (arg 2)
What can I do in those cases? Preferably with baseR
r dataframe split
I have a df
like this:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
> df
FOO
1 A|B|C
2 A|B
3 B|C
4 A
5 C
And I would like to have an output like this:
> df
X1 X2 X3
1 A B C
2 A B
3 B C
4 A
5 C
So far I tried with this example: Split column at delimiter in data frame but it is not spliting the columns without repeating values, what I get there is:
df <- data.frame(do.call('rbind', strsplit(as.character(df$FOO),'|',fixed=TRUE)))
> df
X1 X2 X3
1 A B C
2 A B A
3 B C B
4 A A A
5 C C C
And I also get this warning:
Warning message: In rbind(c("A", "B", "C"), c("A", "B"), c("B", "C"),
"A", "C") : number of columns of result is not a multiple of vector
length (arg 2)
What can I do in those cases? Preferably with baseR
r dataframe split
edited Aug 16 at 8:20
asked Aug 16 at 7:58
Biocrazy
14312
14312
Would the dataframe have values A, B, C... always ? I mean how do we decide which column a particular value should go?
â Ronak Shah
Aug 16 at 8:08
No, values might change but names remain the same. For example:row1 = A|B|C
androw22 = B|D|F
.
â Biocrazy
Aug 16 at 8:12
So if you have only these two rows how many columns would you have? 6 (A, B, C, D, E, F) or 5 (A, B, C, D, F) ?
â Ronak Shah
Aug 16 at 8:14
At the begining (indf
) I have only one column with a changing number of rows. What it is inside the rows (A|B|C
) always changes. Sometimes the max length can be 7 (A|B|C|D|E|F|G
), others 1 (A
), others 2 (A|C
), and so on.
â Biocrazy
Aug 16 at 8:18
1
One last thing I want to tell you: When you look atdf
you see your column is of typefactor
. That's often a very unwanted thing. So you could add,stringsAsFactors=F
into the data.frame definition.
â Andre Elrico
Aug 16 at 8:54
 |Â
show 3 more comments
Would the dataframe have values A, B, C... always ? I mean how do we decide which column a particular value should go?
â Ronak Shah
Aug 16 at 8:08
No, values might change but names remain the same. For example:row1 = A|B|C
androw22 = B|D|F
.
â Biocrazy
Aug 16 at 8:12
So if you have only these two rows how many columns would you have? 6 (A, B, C, D, E, F) or 5 (A, B, C, D, F) ?
â Ronak Shah
Aug 16 at 8:14
At the begining (indf
) I have only one column with a changing number of rows. What it is inside the rows (A|B|C
) always changes. Sometimes the max length can be 7 (A|B|C|D|E|F|G
), others 1 (A
), others 2 (A|C
), and so on.
â Biocrazy
Aug 16 at 8:18
1
One last thing I want to tell you: When you look atdf
you see your column is of typefactor
. That's often a very unwanted thing. So you could add,stringsAsFactors=F
into the data.frame definition.
â Andre Elrico
Aug 16 at 8:54
Would the dataframe have values A, B, C... always ? I mean how do we decide which column a particular value should go?
â Ronak Shah
Aug 16 at 8:08
Would the dataframe have values A, B, C... always ? I mean how do we decide which column a particular value should go?
â Ronak Shah
Aug 16 at 8:08
No, values might change but names remain the same. For example:
row1 = A|B|C
and row22 = B|D|F
.â Biocrazy
Aug 16 at 8:12
No, values might change but names remain the same. For example:
row1 = A|B|C
and row22 = B|D|F
.â Biocrazy
Aug 16 at 8:12
So if you have only these two rows how many columns would you have? 6 (A, B, C, D, E, F) or 5 (A, B, C, D, F) ?
â Ronak Shah
Aug 16 at 8:14
So if you have only these two rows how many columns would you have? 6 (A, B, C, D, E, F) or 5 (A, B, C, D, F) ?
â Ronak Shah
Aug 16 at 8:14
At the begining (in
df
) I have only one column with a changing number of rows. What it is inside the rows (A|B|C
) always changes. Sometimes the max length can be 7 (A|B|C|D|E|F|G
), others 1 (A
), others 2 (A|C
), and so on.â Biocrazy
Aug 16 at 8:18
At the begining (in
df
) I have only one column with a changing number of rows. What it is inside the rows (A|B|C
) always changes. Sometimes the max length can be 7 (A|B|C|D|E|F|G
), others 1 (A
), others 2 (A|C
), and so on.â Biocrazy
Aug 16 at 8:18
1
1
One last thing I want to tell you: When you look at
df
you see your column is of type factor
. That's often a very unwanted thing. So you could add ,stringsAsFactors=F
into the data.frame definition.â Andre Elrico
Aug 16 at 8:54
One last thing I want to tell you: When you look at
df
you see your column is of type factor
. That's often a very unwanted thing. So you could add ,stringsAsFactors=F
into the data.frame definition.â Andre Elrico
Aug 16 at 8:54
 |Â
show 3 more comments
5 Answers
5
active
oldest
votes
up vote
7
down vote
accepted
Simply do:
splt <- strsplit(as.character(df$FOO),"\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x)all_val[!(all_val %in% x)]<-NA;all_val))
# [,1] [,2] [,3]
#[1,] "A" "B" "C"
#[2,] "A" "B" NA
#[3,] NA "B" "C"
#[4,] "A" NA NA
#[5,] NA NA "C"
data:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
Please note:
My version is base::
(no libraries needed) and general:
It would also work with:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
1
Short, general and usingbase::
... It works great! Thank you.
â Biocrazy
Aug 16 at 8:42
3
It was a well-posed question and an interesting problem!
â Andre Elrico
Aug 16 at 8:44
1
Please note the result is amatrix
if that is problematic for you, just wrapas.data.frame()
around it.
â Andre Elrico
Aug 16 at 8:47
add a comment |Â
up vote
5
down vote
Overlooked that OP asked for a base R
solution. Please try @AndreElrico's, @r.user.05apr's or @milan's solutions.
This can be done with cSplit_e
from the splitstackshape
package:
library(splitstackshape)
cSplit_e(
data = df,
split.col = "FOO",
sep = "|",
mode = "value",
type = "character",
fill = " ",
drop = TRUE
)
# FOO_A FOO_B FOO_C
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
Does also work in case of the following df (see OP's comment above).
(df1 <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F')))
# FOO
#1 A|B|C
#2 A|B
#3 B|C
#4 A
#5 C
#6 B|D|F
cSplit_e(df1, "FOO", "|", "value", "character", TRUE, fill = " ")
# FOO_A FOO_B FOO_C FOO_D FOO_F
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
#6 B D F
add a comment |Â
up vote
4
down vote
In base R:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
dummy <- strsplit(as.character(df$FOO), "[|]")
want <- data.frame(values = unlist(dummy),
ids = rep(1:length(dummy), unlist(lapply(dummy, length))),
stringsAsFactors = FALSE)
library(reshape2)
want <- dcast(want, ids ~ values, value.var = "values", fill = " ")[, -1] # first col removed
names(want) <- paste0("X", seq_along(unique(unlist(dummy))))
want
# X1 X2 X3
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
Thanks for your help, but why does this retrieves the values as rows? Is there a way to keep the columns of thedf
?
â Biocrazy
Aug 16 at 8:28
Sorry! I just saw you used a library to "reshape" the columns. I thought they were different examples :-P
â Biocrazy
Aug 16 at 8:30
add a comment |Â
up vote
2
down vote
Use unique
and strsplit
to find all unique values (A
, B
and C
in this case). Use grep
to search for the unique values, and return the values when there's a match or character(0)
otherwise. cbind
the resulting characters. Use apply
and ifelse
to replace character(0)
with NA
.
vals <- unique(unlist(sapply(a1, function(x) strsplit(x, '|', fixed = T))))
out <- NULL
for(i in vals)
out <- cbind(out, as.character((lapply(df$FOO, function(x) grep(x, i, value=T)))))
apply(out, 2, function(x) ifelse(x=="character(0)", NA, x))
[,1] [,2] [,3]
[1,] "A" "B" "C"
[2,] "A" "B" NA
[3,] NA "B" "C"
[4,] "A" NA NA
[5,] NA NA "C"
Let's say I have 80 letters (A
,B
,C
,D
...) then do I have to usegrep
for each one of the cases?
â Biocrazy
Aug 16 at 9:01
Changed it now so that you can easily apply it to many values.
â milan
Aug 16 at 9:06
add a comment |Â
up vote
0
down vote
You can try a tidyverse
as well
library(tidyverse)
df %>%
rownames_to_column() %>%
separate_rows(FOO, sep="[|]") %>%
mutate(L=factor(FOO, labels = paste0("X",1:length(unique(FOO))))) %>%
spread(L, FOO) %>%
select(-1)
X1 X2 X3
1 A B C
2 A B <NA>
3 <NA> B C
4 A <NA> <NA>
5 <NA> <NA> C
It is also generally working e.g. df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
. In addition you can set the levels e.g. B>C>A by yourself using levels = c("B", "C", "A")
in the factor function in the mutate step.
add a comment |Â
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
7
down vote
accepted
Simply do:
splt <- strsplit(as.character(df$FOO),"\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x)all_val[!(all_val %in% x)]<-NA;all_val))
# [,1] [,2] [,3]
#[1,] "A" "B" "C"
#[2,] "A" "B" NA
#[3,] NA "B" "C"
#[4,] "A" NA NA
#[5,] NA NA "C"
data:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
Please note:
My version is base::
(no libraries needed) and general:
It would also work with:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
1
Short, general and usingbase::
... It works great! Thank you.
â Biocrazy
Aug 16 at 8:42
3
It was a well-posed question and an interesting problem!
â Andre Elrico
Aug 16 at 8:44
1
Please note the result is amatrix
if that is problematic for you, just wrapas.data.frame()
around it.
â Andre Elrico
Aug 16 at 8:47
add a comment |Â
up vote
7
down vote
accepted
Simply do:
splt <- strsplit(as.character(df$FOO),"\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x)all_val[!(all_val %in% x)]<-NA;all_val))
# [,1] [,2] [,3]
#[1,] "A" "B" "C"
#[2,] "A" "B" NA
#[3,] NA "B" "C"
#[4,] "A" NA NA
#[5,] NA NA "C"
data:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
Please note:
My version is base::
(no libraries needed) and general:
It would also work with:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
1
Short, general and usingbase::
... It works great! Thank you.
â Biocrazy
Aug 16 at 8:42
3
It was a well-posed question and an interesting problem!
â Andre Elrico
Aug 16 at 8:44
1
Please note the result is amatrix
if that is problematic for you, just wrapas.data.frame()
around it.
â Andre Elrico
Aug 16 at 8:47
add a comment |Â
up vote
7
down vote
accepted
up vote
7
down vote
accepted
Simply do:
splt <- strsplit(as.character(df$FOO),"\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x)all_val[!(all_val %in% x)]<-NA;all_val))
# [,1] [,2] [,3]
#[1,] "A" "B" "C"
#[2,] "A" "B" NA
#[3,] NA "B" "C"
#[4,] "A" NA NA
#[5,] NA NA "C"
data:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
Please note:
My version is base::
(no libraries needed) and general:
It would also work with:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
Simply do:
splt <- strsplit(as.character(df$FOO),"\|")
all_val <- sort(unique(unlist(splt)))
t(sapply(splt,function(x)all_val[!(all_val %in% x)]<-NA;all_val))
# [,1] [,2] [,3]
#[1,] "A" "B" "C"
#[2,] "A" "B" NA
#[3,] NA "B" "C"
#[4,] "A" NA NA
#[5,] NA NA "C"
data:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
Please note:
My version is base::
(no libraries needed) and general:
It would also work with:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
edited Aug 16 at 8:38
answered Aug 16 at 8:32
Andre Elrico
2,758723
2,758723
1
Short, general and usingbase::
... It works great! Thank you.
â Biocrazy
Aug 16 at 8:42
3
It was a well-posed question and an interesting problem!
â Andre Elrico
Aug 16 at 8:44
1
Please note the result is amatrix
if that is problematic for you, just wrapas.data.frame()
around it.
â Andre Elrico
Aug 16 at 8:47
add a comment |Â
1
Short, general and usingbase::
... It works great! Thank you.
â Biocrazy
Aug 16 at 8:42
3
It was a well-posed question and an interesting problem!
â Andre Elrico
Aug 16 at 8:44
1
Please note the result is amatrix
if that is problematic for you, just wrapas.data.frame()
around it.
â Andre Elrico
Aug 16 at 8:47
1
1
Short, general and using
base::
... It works great! Thank you.â Biocrazy
Aug 16 at 8:42
Short, general and using
base::
... It works great! Thank you.â Biocrazy
Aug 16 at 8:42
3
3
It was a well-posed question and an interesting problem!
â Andre Elrico
Aug 16 at 8:44
It was a well-posed question and an interesting problem!
â Andre Elrico
Aug 16 at 8:44
1
1
Please note the result is a
matrix
if that is problematic for you, just wrap as.data.frame()
around it.â Andre Elrico
Aug 16 at 8:47
Please note the result is a
matrix
if that is problematic for you, just wrap as.data.frame()
around it.â Andre Elrico
Aug 16 at 8:47
add a comment |Â
up vote
5
down vote
Overlooked that OP asked for a base R
solution. Please try @AndreElrico's, @r.user.05apr's or @milan's solutions.
This can be done with cSplit_e
from the splitstackshape
package:
library(splitstackshape)
cSplit_e(
data = df,
split.col = "FOO",
sep = "|",
mode = "value",
type = "character",
fill = " ",
drop = TRUE
)
# FOO_A FOO_B FOO_C
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
Does also work in case of the following df (see OP's comment above).
(df1 <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F')))
# FOO
#1 A|B|C
#2 A|B
#3 B|C
#4 A
#5 C
#6 B|D|F
cSplit_e(df1, "FOO", "|", "value", "character", TRUE, fill = " ")
# FOO_A FOO_B FOO_C FOO_D FOO_F
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
#6 B D F
add a comment |Â
up vote
5
down vote
Overlooked that OP asked for a base R
solution. Please try @AndreElrico's, @r.user.05apr's or @milan's solutions.
This can be done with cSplit_e
from the splitstackshape
package:
library(splitstackshape)
cSplit_e(
data = df,
split.col = "FOO",
sep = "|",
mode = "value",
type = "character",
fill = " ",
drop = TRUE
)
# FOO_A FOO_B FOO_C
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
Does also work in case of the following df (see OP's comment above).
(df1 <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F')))
# FOO
#1 A|B|C
#2 A|B
#3 B|C
#4 A
#5 C
#6 B|D|F
cSplit_e(df1, "FOO", "|", "value", "character", TRUE, fill = " ")
# FOO_A FOO_B FOO_C FOO_D FOO_F
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
#6 B D F
add a comment |Â
up vote
5
down vote
up vote
5
down vote
Overlooked that OP asked for a base R
solution. Please try @AndreElrico's, @r.user.05apr's or @milan's solutions.
This can be done with cSplit_e
from the splitstackshape
package:
library(splitstackshape)
cSplit_e(
data = df,
split.col = "FOO",
sep = "|",
mode = "value",
type = "character",
fill = " ",
drop = TRUE
)
# FOO_A FOO_B FOO_C
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
Does also work in case of the following df (see OP's comment above).
(df1 <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F')))
# FOO
#1 A|B|C
#2 A|B
#3 B|C
#4 A
#5 C
#6 B|D|F
cSplit_e(df1, "FOO", "|", "value", "character", TRUE, fill = " ")
# FOO_A FOO_B FOO_C FOO_D FOO_F
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
#6 B D F
Overlooked that OP asked for a base R
solution. Please try @AndreElrico's, @r.user.05apr's or @milan's solutions.
This can be done with cSplit_e
from the splitstackshape
package:
library(splitstackshape)
cSplit_e(
data = df,
split.col = "FOO",
sep = "|",
mode = "value",
type = "character",
fill = " ",
drop = TRUE
)
# FOO_A FOO_B FOO_C
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
Does also work in case of the following df (see OP's comment above).
(df1 <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F')))
# FOO
#1 A|B|C
#2 A|B
#3 B|C
#4 A
#5 C
#6 B|D|F
cSplit_e(df1, "FOO", "|", "value", "character", TRUE, fill = " ")
# FOO_A FOO_B FOO_C FOO_D FOO_F
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
#6 B D F
edited Aug 16 at 17:46
answered Aug 16 at 8:22
markus
4,971626
4,971626
add a comment |Â
add a comment |Â
up vote
4
down vote
In base R:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
dummy <- strsplit(as.character(df$FOO), "[|]")
want <- data.frame(values = unlist(dummy),
ids = rep(1:length(dummy), unlist(lapply(dummy, length))),
stringsAsFactors = FALSE)
library(reshape2)
want <- dcast(want, ids ~ values, value.var = "values", fill = " ")[, -1] # first col removed
names(want) <- paste0("X", seq_along(unique(unlist(dummy))))
want
# X1 X2 X3
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
Thanks for your help, but why does this retrieves the values as rows? Is there a way to keep the columns of thedf
?
â Biocrazy
Aug 16 at 8:28
Sorry! I just saw you used a library to "reshape" the columns. I thought they were different examples :-P
â Biocrazy
Aug 16 at 8:30
add a comment |Â
up vote
4
down vote
In base R:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
dummy <- strsplit(as.character(df$FOO), "[|]")
want <- data.frame(values = unlist(dummy),
ids = rep(1:length(dummy), unlist(lapply(dummy, length))),
stringsAsFactors = FALSE)
library(reshape2)
want <- dcast(want, ids ~ values, value.var = "values", fill = " ")[, -1] # first col removed
names(want) <- paste0("X", seq_along(unique(unlist(dummy))))
want
# X1 X2 X3
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
Thanks for your help, but why does this retrieves the values as rows? Is there a way to keep the columns of thedf
?
â Biocrazy
Aug 16 at 8:28
Sorry! I just saw you used a library to "reshape" the columns. I thought they were different examples :-P
â Biocrazy
Aug 16 at 8:30
add a comment |Â
up vote
4
down vote
up vote
4
down vote
In base R:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
dummy <- strsplit(as.character(df$FOO), "[|]")
want <- data.frame(values = unlist(dummy),
ids = rep(1:length(dummy), unlist(lapply(dummy, length))),
stringsAsFactors = FALSE)
library(reshape2)
want <- dcast(want, ids ~ values, value.var = "values", fill = " ")[, -1] # first col removed
names(want) <- paste0("X", seq_along(unique(unlist(dummy))))
want
# X1 X2 X3
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
In base R:
df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C'))
dummy <- strsplit(as.character(df$FOO), "[|]")
want <- data.frame(values = unlist(dummy),
ids = rep(1:length(dummy), unlist(lapply(dummy, length))),
stringsAsFactors = FALSE)
library(reshape2)
want <- dcast(want, ids ~ values, value.var = "values", fill = " ")[, -1] # first col removed
names(want) <- paste0("X", seq_along(unique(unlist(dummy))))
want
# X1 X2 X3
#1 A B C
#2 A B
#3 B C
#4 A
#5 C
edited Aug 16 at 12:34
answered Aug 16 at 8:24
r.user.05apr
1,5231619
1,5231619
Thanks for your help, but why does this retrieves the values as rows? Is there a way to keep the columns of thedf
?
â Biocrazy
Aug 16 at 8:28
Sorry! I just saw you used a library to "reshape" the columns. I thought they were different examples :-P
â Biocrazy
Aug 16 at 8:30
add a comment |Â
Thanks for your help, but why does this retrieves the values as rows? Is there a way to keep the columns of thedf
?
â Biocrazy
Aug 16 at 8:28
Sorry! I just saw you used a library to "reshape" the columns. I thought they were different examples :-P
â Biocrazy
Aug 16 at 8:30
Thanks for your help, but why does this retrieves the values as rows? Is there a way to keep the columns of the
df
?â Biocrazy
Aug 16 at 8:28
Thanks for your help, but why does this retrieves the values as rows? Is there a way to keep the columns of the
df
?â Biocrazy
Aug 16 at 8:28
Sorry! I just saw you used a library to "reshape" the columns. I thought they were different examples :-P
â Biocrazy
Aug 16 at 8:30
Sorry! I just saw you used a library to "reshape" the columns. I thought they were different examples :-P
â Biocrazy
Aug 16 at 8:30
add a comment |Â
up vote
2
down vote
Use unique
and strsplit
to find all unique values (A
, B
and C
in this case). Use grep
to search for the unique values, and return the values when there's a match or character(0)
otherwise. cbind
the resulting characters. Use apply
and ifelse
to replace character(0)
with NA
.
vals <- unique(unlist(sapply(a1, function(x) strsplit(x, '|', fixed = T))))
out <- NULL
for(i in vals)
out <- cbind(out, as.character((lapply(df$FOO, function(x) grep(x, i, value=T)))))
apply(out, 2, function(x) ifelse(x=="character(0)", NA, x))
[,1] [,2] [,3]
[1,] "A" "B" "C"
[2,] "A" "B" NA
[3,] NA "B" "C"
[4,] "A" NA NA
[5,] NA NA "C"
Let's say I have 80 letters (A
,B
,C
,D
...) then do I have to usegrep
for each one of the cases?
â Biocrazy
Aug 16 at 9:01
Changed it now so that you can easily apply it to many values.
â milan
Aug 16 at 9:06
add a comment |Â
up vote
2
down vote
Use unique
and strsplit
to find all unique values (A
, B
and C
in this case). Use grep
to search for the unique values, and return the values when there's a match or character(0)
otherwise. cbind
the resulting characters. Use apply
and ifelse
to replace character(0)
with NA
.
vals <- unique(unlist(sapply(a1, function(x) strsplit(x, '|', fixed = T))))
out <- NULL
for(i in vals)
out <- cbind(out, as.character((lapply(df$FOO, function(x) grep(x, i, value=T)))))
apply(out, 2, function(x) ifelse(x=="character(0)", NA, x))
[,1] [,2] [,3]
[1,] "A" "B" "C"
[2,] "A" "B" NA
[3,] NA "B" "C"
[4,] "A" NA NA
[5,] NA NA "C"
Let's say I have 80 letters (A
,B
,C
,D
...) then do I have to usegrep
for each one of the cases?
â Biocrazy
Aug 16 at 9:01
Changed it now so that you can easily apply it to many values.
â milan
Aug 16 at 9:06
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Use unique
and strsplit
to find all unique values (A
, B
and C
in this case). Use grep
to search for the unique values, and return the values when there's a match or character(0)
otherwise. cbind
the resulting characters. Use apply
and ifelse
to replace character(0)
with NA
.
vals <- unique(unlist(sapply(a1, function(x) strsplit(x, '|', fixed = T))))
out <- NULL
for(i in vals)
out <- cbind(out, as.character((lapply(df$FOO, function(x) grep(x, i, value=T)))))
apply(out, 2, function(x) ifelse(x=="character(0)", NA, x))
[,1] [,2] [,3]
[1,] "A" "B" "C"
[2,] "A" "B" NA
[3,] NA "B" "C"
[4,] "A" NA NA
[5,] NA NA "C"
Use unique
and strsplit
to find all unique values (A
, B
and C
in this case). Use grep
to search for the unique values, and return the values when there's a match or character(0)
otherwise. cbind
the resulting characters. Use apply
and ifelse
to replace character(0)
with NA
.
vals <- unique(unlist(sapply(a1, function(x) strsplit(x, '|', fixed = T))))
out <- NULL
for(i in vals)
out <- cbind(out, as.character((lapply(df$FOO, function(x) grep(x, i, value=T)))))
apply(out, 2, function(x) ifelse(x=="character(0)", NA, x))
[,1] [,2] [,3]
[1,] "A" "B" "C"
[2,] "A" "B" NA
[3,] NA "B" "C"
[4,] "A" NA NA
[5,] NA NA "C"
edited Aug 17 at 5:59
answered Aug 16 at 8:38
milan
2,1441721
2,1441721
Let's say I have 80 letters (A
,B
,C
,D
...) then do I have to usegrep
for each one of the cases?
â Biocrazy
Aug 16 at 9:01
Changed it now so that you can easily apply it to many values.
â milan
Aug 16 at 9:06
add a comment |Â
Let's say I have 80 letters (A
,B
,C
,D
...) then do I have to usegrep
for each one of the cases?
â Biocrazy
Aug 16 at 9:01
Changed it now so that you can easily apply it to many values.
â milan
Aug 16 at 9:06
Let's say I have 80 letters (
A
, B
, C
, D
...) then do I have to use grep
for each one of the cases?â Biocrazy
Aug 16 at 9:01
Let's say I have 80 letters (
A
, B
, C
, D
...) then do I have to use grep
for each one of the cases?â Biocrazy
Aug 16 at 9:01
Changed it now so that you can easily apply it to many values.
â milan
Aug 16 at 9:06
Changed it now so that you can easily apply it to many values.
â milan
Aug 16 at 9:06
add a comment |Â
up vote
0
down vote
You can try a tidyverse
as well
library(tidyverse)
df %>%
rownames_to_column() %>%
separate_rows(FOO, sep="[|]") %>%
mutate(L=factor(FOO, labels = paste0("X",1:length(unique(FOO))))) %>%
spread(L, FOO) %>%
select(-1)
X1 X2 X3
1 A B C
2 A B <NA>
3 <NA> B C
4 A <NA> <NA>
5 <NA> <NA> C
It is also generally working e.g. df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
. In addition you can set the levels e.g. B>C>A by yourself using levels = c("B", "C", "A")
in the factor function in the mutate step.
add a comment |Â
up vote
0
down vote
You can try a tidyverse
as well
library(tidyverse)
df %>%
rownames_to_column() %>%
separate_rows(FOO, sep="[|]") %>%
mutate(L=factor(FOO, labels = paste0("X",1:length(unique(FOO))))) %>%
spread(L, FOO) %>%
select(-1)
X1 X2 X3
1 A B C
2 A B <NA>
3 <NA> B C
4 A <NA> <NA>
5 <NA> <NA> C
It is also generally working e.g. df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
. In addition you can set the levels e.g. B>C>A by yourself using levels = c("B", "C", "A")
in the factor function in the mutate step.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
You can try a tidyverse
as well
library(tidyverse)
df %>%
rownames_to_column() %>%
separate_rows(FOO, sep="[|]") %>%
mutate(L=factor(FOO, labels = paste0("X",1:length(unique(FOO))))) %>%
spread(L, FOO) %>%
select(-1)
X1 X2 X3
1 A B C
2 A B <NA>
3 <NA> B C
4 A <NA> <NA>
5 <NA> <NA> C
It is also generally working e.g. df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
. In addition you can set the levels e.g. B>C>A by yourself using levels = c("B", "C", "A")
in the factor function in the mutate step.
You can try a tidyverse
as well
library(tidyverse)
df %>%
rownames_to_column() %>%
separate_rows(FOO, sep="[|]") %>%
mutate(L=factor(FOO, labels = paste0("X",1:length(unique(FOO))))) %>%
spread(L, FOO) %>%
select(-1)
X1 X2 X3
1 A B C
2 A B <NA>
3 <NA> B C
4 A <NA> <NA>
5 <NA> <NA> C
It is also generally working e.g. df <- data.frame(FOO = c('A|B|C', 'A|B', 'B|C', 'A', 'C', 'B|D|F'))
. In addition you can set the levels e.g. B>C>A by yourself using levels = c("B", "C", "A")
in the factor function in the mutate step.
edited Aug 16 at 10:18
answered Aug 16 at 9:59
Jimbou
8,58311028
8,58311028
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51872267%2fsplit-columns-in-dataframe-with-na%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Would the dataframe have values A, B, C... always ? I mean how do we decide which column a particular value should go?
â Ronak Shah
Aug 16 at 8:08
No, values might change but names remain the same. For example:
row1 = A|B|C
androw22 = B|D|F
.â Biocrazy
Aug 16 at 8:12
So if you have only these two rows how many columns would you have? 6 (A, B, C, D, E, F) or 5 (A, B, C, D, F) ?
â Ronak Shah
Aug 16 at 8:14
At the begining (in
df
) I have only one column with a changing number of rows. What it is inside the rows (A|B|C
) always changes. Sometimes the max length can be 7 (A|B|C|D|E|F|G
), others 1 (A
), others 2 (A|C
), and so on.â Biocrazy
Aug 16 at 8:18
1
One last thing I want to tell you: When you look at
df
you see your column is of typefactor
. That's often a very unwanted thing. So you could add,stringsAsFactors=F
into the data.frame definition.â Andre Elrico
Aug 16 at 8:54