Extract substring in R using grepl
Clash Royale CLAN TAG#URR8PPP
up vote
8
down vote
favorite
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work*.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
r string dataframe substring
add a comment |Â
up vote
8
down vote
favorite
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work*.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
r string dataframe substring
2
please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â Andre Elrico
Aug 28 at 15:20
add a comment |Â
up vote
8
down vote
favorite
up vote
8
down vote
favorite
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work*.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
r string dataframe substring
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work*.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
r string dataframe substring
edited Aug 28 at 15:18
Andre Elrico
3,058723
3,058723
asked Aug 28 at 14:57
ajax2000
197210
197210
2
please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â Andre Elrico
Aug 28 at 15:20
add a comment |Â
2
please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â Andre Elrico
Aug 28 at 15:20
2
2
please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â Andre Elrico
Aug 28 at 15:20
please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â Andre Elrico
Aug 28 at 15:20
add a comment |Â
4 Answers
4
active
oldest
votes
up vote
5
down vote
accepted
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub
. Because I first remove ^.*Work
then \.csv$
.
For [\s\S]
or \d\D
... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
.
matches also n
when using the R engine.
pretty much all the solutions work, but i think this is more concise. thanks
â ajax2000
Sep 5 at 13:04
add a comment |Â
up vote
9
down vote
I think you need sub
or gsub
(substitute/extract) instead of grepl
(find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA
(or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
add a comment |Â
up vote
7
down vote
With str_extract
from stringr
. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"
add a comment |Â
up vote
4
down vote
Here is an option using regmatches/regexpr
from base R
. Using a regex lookaround to match all characters that are not a .
after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
1
To be more precise, you can use"(?<=Work).*(?=.csv)"
.
â r2evans
Aug 28 at 15:18
@avid_useR But, I am usingregmatches/regexpr
â akrun
Aug 28 at 15:24
@avid_useR Okay, that is right
â akrun
Aug 28 at 15:25
1
@AndreElrico, doesn't[\s\S]
match any character? Isn't is more concise to use.
?
â r2evans
Aug 28 at 15:32
1
@r2evans I use both[.]
or\.
, though I feel easier to type the former.
â akrun
Aug 28 at 15:35
 |Â
show 7 more comments
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub
. Because I first remove ^.*Work
then \.csv$
.
For [\s\S]
or \d\D
... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
.
matches also n
when using the R engine.
pretty much all the solutions work, but i think this is more concise. thanks
â ajax2000
Sep 5 at 13:04
add a comment |Â
up vote
5
down vote
accepted
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub
. Because I first remove ^.*Work
then \.csv$
.
For [\s\S]
or \d\D
... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
.
matches also n
when using the R engine.
pretty much all the solutions work, but i think this is more concise. thanks
â ajax2000
Sep 5 at 13:04
add a comment |Â
up vote
5
down vote
accepted
up vote
5
down vote
accepted
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub
. Because I first remove ^.*Work
then \.csv$
.
For [\s\S]
or \d\D
... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
.
matches also n
when using the R engine.
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub
. Because I first remove ^.*Work
then \.csv$
.
For [\s\S]
or \d\D
... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
.
matches also n
when using the R engine.
edited Aug 28 at 15:57
answered Aug 28 at 15:07
Andre Elrico
3,058723
3,058723
pretty much all the solutions work, but i think this is more concise. thanks
â ajax2000
Sep 5 at 13:04
add a comment |Â
pretty much all the solutions work, but i think this is more concise. thanks
â ajax2000
Sep 5 at 13:04
pretty much all the solutions work, but i think this is more concise. thanks
â ajax2000
Sep 5 at 13:04
pretty much all the solutions work, but i think this is more concise. thanks
â ajax2000
Sep 5 at 13:04
add a comment |Â
up vote
9
down vote
I think you need sub
or gsub
(substitute/extract) instead of grepl
(find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA
(or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
add a comment |Â
up vote
9
down vote
I think you need sub
or gsub
(substitute/extract) instead of grepl
(find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA
(or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
add a comment |Â
up vote
9
down vote
up vote
9
down vote
I think you need sub
or gsub
(substitute/extract) instead of grepl
(find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA
(or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
I think you need sub
or gsub
(substitute/extract) instead of grepl
(find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA
(or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
answered Aug 28 at 15:02
r2evans
20.8k32649
20.8k32649
add a comment |Â
add a comment |Â
up vote
7
down vote
With str_extract
from stringr
. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"
add a comment |Â
up vote
7
down vote
With str_extract
from stringr
. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"
add a comment |Â
up vote
7
down vote
up vote
7
down vote
With str_extract
from stringr
. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"
With str_extract
from stringr
. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"
edited Aug 28 at 15:13
answered Aug 28 at 15:07
avid_useR
10.4k41730
10.4k41730
add a comment |Â
add a comment |Â
up vote
4
down vote
Here is an option using regmatches/regexpr
from base R
. Using a regex lookaround to match all characters that are not a .
after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
1
To be more precise, you can use"(?<=Work).*(?=.csv)"
.
â r2evans
Aug 28 at 15:18
@avid_useR But, I am usingregmatches/regexpr
â akrun
Aug 28 at 15:24
@avid_useR Okay, that is right
â akrun
Aug 28 at 15:25
1
@AndreElrico, doesn't[\s\S]
match any character? Isn't is more concise to use.
?
â r2evans
Aug 28 at 15:32
1
@r2evans I use both[.]
or\.
, though I feel easier to type the former.
â akrun
Aug 28 at 15:35
 |Â
show 7 more comments
up vote
4
down vote
Here is an option using regmatches/regexpr
from base R
. Using a regex lookaround to match all characters that are not a .
after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
1
To be more precise, you can use"(?<=Work).*(?=.csv)"
.
â r2evans
Aug 28 at 15:18
@avid_useR But, I am usingregmatches/regexpr
â akrun
Aug 28 at 15:24
@avid_useR Okay, that is right
â akrun
Aug 28 at 15:25
1
@AndreElrico, doesn't[\s\S]
match any character? Isn't is more concise to use.
?
â r2evans
Aug 28 at 15:32
1
@r2evans I use both[.]
or\.
, though I feel easier to type the former.
â akrun
Aug 28 at 15:35
 |Â
show 7 more comments
up vote
4
down vote
up vote
4
down vote
Here is an option using regmatches/regexpr
from base R
. Using a regex lookaround to match all characters that are not a .
after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
Here is an option using regmatches/regexpr
from base R
. Using a regex lookaround to match all characters that are not a .
after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
edited Aug 28 at 15:22
answered Aug 28 at 15:11
akrun
374k13161237
374k13161237
1
To be more precise, you can use"(?<=Work).*(?=.csv)"
.
â r2evans
Aug 28 at 15:18
@avid_useR But, I am usingregmatches/regexpr
â akrun
Aug 28 at 15:24
@avid_useR Okay, that is right
â akrun
Aug 28 at 15:25
1
@AndreElrico, doesn't[\s\S]
match any character? Isn't is more concise to use.
?
â r2evans
Aug 28 at 15:32
1
@r2evans I use both[.]
or\.
, though I feel easier to type the former.
â akrun
Aug 28 at 15:35
 |Â
show 7 more comments
1
To be more precise, you can use"(?<=Work).*(?=.csv)"
.
â r2evans
Aug 28 at 15:18
@avid_useR But, I am usingregmatches/regexpr
â akrun
Aug 28 at 15:24
@avid_useR Okay, that is right
â akrun
Aug 28 at 15:25
1
@AndreElrico, doesn't[\s\S]
match any character? Isn't is more concise to use.
?
â r2evans
Aug 28 at 15:32
1
@r2evans I use both[.]
or\.
, though I feel easier to type the former.
â akrun
Aug 28 at 15:35
1
1
To be more precise, you can use
"(?<=Work).*(?=.csv)"
.â r2evans
Aug 28 at 15:18
To be more precise, you can use
"(?<=Work).*(?=.csv)"
.â r2evans
Aug 28 at 15:18
@avid_useR But, I am using
regmatches/regexpr
â akrun
Aug 28 at 15:24
@avid_useR But, I am using
regmatches/regexpr
â akrun
Aug 28 at 15:24
@avid_useR Okay, that is right
â akrun
Aug 28 at 15:25
@avid_useR Okay, that is right
â akrun
Aug 28 at 15:25
1
1
@AndreElrico, doesn't
[\s\S]
match any character? Isn't is more concise to use .
?â r2evans
Aug 28 at 15:32
@AndreElrico, doesn't
[\s\S]
match any character? Isn't is more concise to use .
?â r2evans
Aug 28 at 15:32
1
1
@r2evans I use both
[.]
or \.
, though I feel easier to type the former.â akrun
Aug 28 at 15:35
@r2evans I use both
[.]
or \.
, though I feel easier to type the former.â akrun
Aug 28 at 15:35
 |Â
show 7 more comments
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52060891%2fextract-substring-in-r-using-grepl%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â Andre Elrico
Aug 28 at 15:20