Extract substring in R using grepl

up vote
8
down vote

favorite

I have a table with a string column formatted like this

abcdWorkstart.csv
abcdWorkcomplete.csv

And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.

grepl("Work*.csv", data$filename)

Basically I want to extract whatever between Work and .csv

desired outcome:

start
complete

edited Aug 28 at 15:18

Andre Elrico

3,058723

asked Aug 28 at 14:57

ajax2000

197210

2

please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â€“Â Andre Elrico
Aug 28 at 15:20

add a commentÂ |Â

up vote
8
down vote

favorite

I have a table with a string column formatted like this

abcdWorkstart.csv
abcdWorkcomplete.csv

grepl("Work*.csv", data$filename)

Basically I want to extract whatever between Work and .csv

desired outcome:

start
complete

edited Aug 28 at 15:18

Andre Elrico

3,058723

asked Aug 28 at 14:57

ajax2000

197210

2

please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â€“Â Andre Elrico
Aug 28 at 15:20

add a commentÂ |Â

up vote
8
down vote

favorite

I have a table with a string column formatted like this

abcdWorkstart.csv
abcdWorkcomplete.csv

grepl("Work*.csv", data$filename)

Basically I want to extract whatever between Work and .csv

desired outcome:

start
complete

edited Aug 28 at 15:18

Andre Elrico

3,058723

asked Aug 28 at 14:57

ajax2000

197210

I have a table with a string column formatted like this

abcdWorkstart.csv
abcdWorkcomplete.csv

grepl("Work*.csv", data$filename)

Basically I want to extract whatever between Work and .csv

desired outcome:

start
complete

edited Aug 28 at 15:18

Andre Elrico

3,058723

asked Aug 28 at 14:57

ajax2000

197210

edited Aug 28 at 15:18

Andre Elrico

3,058723

edited Aug 28 at 15:18

Andre Elrico

3,058723

edited Aug 28 at 15:18

Andre Elrico

3,058723

asked Aug 28 at 14:57

ajax2000

197210

asked Aug 28 at 14:57

ajax2000

197210

asked Aug 28 at 14:57

ajax2000

197210

2

please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â€“Â Andre Elrico
Aug 28 at 15:20

add a commentÂ |Â

2

please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â€“Â Andre Elrico
Aug 28 at 15:20

please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
â€“Â Andre Elrico
Aug 28 at 15:20

add a commentÂ |Â

4 Answers
4

active

oldest

votes

up vote
5
down vote

accepted

Just as an alternative way, remove everything you don't want.

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"

please note:
I have to use gsub. Because I first remove ^.*Work then \.csv$.

For [\s\S] or \d\D ... (does not work with [g]?sub)

https://regex101.com/r/wFgkgG/1

Works with akruns approach:

regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))

str1<-
'12
.2
12'

gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)

. matches also n when using the R engine.

edited Aug 28 at 15:57

answered Aug 28 at 15:07

Andre Elrico

3,058723

pretty much all the solutions work, but i think this is more concise. thanks
â€“Â ajax2000
Sep 5 at 13:04

add a commentÂ |Â

up vote
9
down vote

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:

fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"

You can work around this by filtering out the unchanged ones:

out[ out != fn ]
# [1] "start" "complete"

Or marking them invalid with NA (or something else):

out[ out == fn ] <- NA
out
# [1] "start" "complete" NA

answered Aug 28 at 15:02

r2evans

20.8k32649

add a commentÂ |Â

up vote
7
down vote

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"

edited Aug 28 at 15:13

answered Aug 28 at 15:07

avid_useR

10.4k41730

add a commentÂ |Â

up vote
4
down vote

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches

regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"

data

v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

edited Aug 28 at 15:22

answered Aug 28 at 15:11

akrun

374k13161237

1

To be more precise, you can use "(?<=Work).*(?=.csv)".
â€“Â r2evans
Aug 28 at 15:18

@avid_useR But, I am using regmatches/regexpr
â€“Â akrun
Aug 28 at 15:24

@avid_useR Okay, that is right
â€“Â akrun
Aug 28 at 15:25

1

@AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
â€“Â r2evans
Aug 28 at 15:32

1

@r2evans I use both [.] or \., though I feel easier to type the former.
â€“Â akrun
Aug 28 at 15:35

Â |Â
show 7 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52060891%2fextract-substring-in-r-using-grepl%23new-answer', 'question_page');

);

Post as a guest

Name

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
5
down vote

accepted

Just as an alternative way, remove everything you don't want.

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"

please note:
I have to use gsub. Because I first remove ^.*Work then \.csv$.

For [\s\S] or \d\D ... (does not work with [g]?sub)

https://regex101.com/r/wFgkgG/1

Works with akruns approach:

regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))

str1<-
'12
.2
12'

gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)

. matches also n when using the R engine.

edited Aug 28 at 15:57

answered Aug 28 at 15:07

Andre Elrico

3,058723

pretty much all the solutions work, but i think this is more concise. thanks
â€“Â ajax2000
Sep 5 at 13:04

add a commentÂ |Â

up vote
5
down vote

accepted

Just as an alternative way, remove everything you don't want.

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"

please note:
I have to use gsub. Because I first remove ^.*Work then \.csv$.

For [\s\S] or \d\D ... (does not work with [g]?sub)

https://regex101.com/r/wFgkgG/1

Works with akruns approach:

regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))

str1<-
'12
.2
12'

gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)

. matches also n when using the R engine.

edited Aug 28 at 15:57

answered Aug 28 at 15:07

Andre Elrico

3,058723

pretty much all the solutions work, but i think this is more concise. thanks
â€“Â ajax2000
Sep 5 at 13:04

add a commentÂ |Â

up vote
5
down vote

accepted

Just as an alternative way, remove everything you don't want.

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"

please note:
I have to use gsub. Because I first remove ^.*Work then \.csv$.

For [\s\S] or \d\D ... (does not work with [g]?sub)

https://regex101.com/r/wFgkgG/1

Works with akruns approach:

regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))

str1<-
'12
.2
12'

gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)

. matches also n when using the R engine.

edited Aug 28 at 15:57

answered Aug 28 at 15:07

Andre Elrico

3,058723

Just as an alternative way, remove everything you don't want.

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"

please note:
I have to use gsub. Because I first remove ^.*Work then \.csv$.

For [\s\S] or \d\D ... (does not work with [g]?sub)

https://regex101.com/r/wFgkgG/1

Works with akruns approach:

regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))

str1<-
'12
.2
12'

gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)

. matches also n when using the R engine.

edited Aug 28 at 15:57

answered Aug 28 at 15:07

Andre Elrico

3,058723

edited Aug 28 at 15:57

answered Aug 28 at 15:07

Andre Elrico

3,058723

answered Aug 28 at 15:07

Andre Elrico

3,058723

answered Aug 28 at 15:07

Andre Elrico

3,058723

pretty much all the solutions work, but i think this is more concise. thanks
â€“Â ajax2000
Sep 5 at 13:04

add a commentÂ |Â

pretty much all the solutions work, but i think this is more concise. thanks
â€“Â ajax2000
Sep 5 at 13:04

pretty much all the solutions work, but i think this is more concise. thanks
â€“Â ajax2000
Sep 5 at 13:04

add a commentÂ |Â

up vote
9
down vote

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:

fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"

You can work around this by filtering out the unchanged ones:

out[ out != fn ]
# [1] "start" "complete"

Or marking them invalid with NA (or something else):

out[ out == fn ] <- NA
out
# [1] "start" "complete" NA

answered Aug 28 at 15:02

r2evans

20.8k32649

add a commentÂ |Â

up vote
9
down vote

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:

fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"

You can work around this by filtering out the unchanged ones:

out[ out != fn ]
# [1] "start" "complete"

Or marking them invalid with NA (or something else):

out[ out == fn ] <- NA
out
# [1] "start" "complete" NA

answered Aug 28 at 15:02

r2evans

20.8k32649

add a commentÂ |Â

up vote
9
down vote

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:

fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"

You can work around this by filtering out the unchanged ones:

out[ out != fn ]
# [1] "start" "complete"

Or marking them invalid with NA (or something else):

out[ out == fn ] <- NA
out
# [1] "start" "complete" NA

answered Aug 28 at 15:02

r2evans

20.8k32649

I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:

fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"

You can work around this by filtering out the unchanged ones:

out[ out != fn ]
# [1] "start" "complete"

Or marking them invalid with NA (or something else):

out[ out == fn ] <- NA
out
# [1] "start" "complete" NA

answered Aug 28 at 15:02

r2evans

20.8k32649

answered Aug 28 at 15:02

r2evans

20.8k32649

answered Aug 28 at 15:02

r2evans

20.8k32649

answered Aug 28 at 15:02

r2evans

20.8k32649

add a commentÂ |Â

up vote
7
down vote

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"

edited Aug 28 at 15:13

answered Aug 28 at 15:07

avid_useR

10.4k41730

add a commentÂ |Â

up vote
7
down vote

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"

edited Aug 28 at 15:13

answered Aug 28 at 15:07

avid_useR

10.4k41730

add a commentÂ |Â

up vote
7
down vote

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"

edited Aug 28 at 15:13

answered Aug 28 at 15:07

avid_useR

10.4k41730

With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":

x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

library(stringr)
str_extract(x, "(?<=Work).+(?=\.csv)")
# [1] "start" "complete"

edited Aug 28 at 15:13

answered Aug 28 at 15:07

avid_useR

10.4k41730

edited Aug 28 at 15:13

answered Aug 28 at 15:07

avid_useR

10.4k41730

answered Aug 28 at 15:07

avid_useR

10.4k41730

answered Aug 28 at 15:07

avid_useR

10.4k41730

add a commentÂ |Â

up vote
4
down vote

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches

regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"

data

v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

edited Aug 28 at 15:22

answered Aug 28 at 15:11

akrun

374k13161237

1

To be more precise, you can use "(?<=Work).*(?=.csv)".
â€“Â r2evans
Aug 28 at 15:18

@avid_useR But, I am using regmatches/regexpr
â€“Â akrun
Aug 28 at 15:24

@avid_useR Okay, that is right
â€“Â akrun
Aug 28 at 15:25

1

@AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
â€“Â r2evans
Aug 28 at 15:32

1

@r2evans I use both [.] or \., though I feel easier to type the former.
â€“Â akrun
Aug 28 at 15:35

Â |Â
show 7 more comments

up vote
4
down vote

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches

regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"

data

v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

edited Aug 28 at 15:22

answered Aug 28 at 15:11

akrun

374k13161237

1

To be more precise, you can use "(?<=Work).*(?=.csv)".
â€“Â r2evans
Aug 28 at 15:18

@avid_useR But, I am using regmatches/regexpr
â€“Â akrun
Aug 28 at 15:24

@avid_useR Okay, that is right
â€“Â akrun
Aug 28 at 15:25

1

@AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
â€“Â r2evans
Aug 28 at 15:32

1

@r2evans I use both [.] or \., though I feel easier to type the former.
â€“Â akrun
Aug 28 at 15:35

Â |Â
show 7 more comments

up vote
4
down vote

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches

regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"

data

v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

edited Aug 28 at 15:22

answered Aug 28 at 15:11

akrun

374k13161237

Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches

regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"

data

v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')

edited Aug 28 at 15:22

answered Aug 28 at 15:11

akrun

374k13161237

edited Aug 28 at 15:22

answered Aug 28 at 15:11

akrun

374k13161237

answered Aug 28 at 15:11

akrun

374k13161237

answered Aug 28 at 15:11

akrun

374k13161237

1

To be more precise, you can use "(?<=Work).*(?=.csv)".
â€“Â r2evans
Aug 28 at 15:18

@avid_useR But, I am using regmatches/regexpr
â€“Â akrun
Aug 28 at 15:24

@avid_useR Okay, that is right
â€“Â akrun
Aug 28 at 15:25

1

@AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
â€“Â r2evans
Aug 28 at 15:32

1

@r2evans I use both [.] or \., though I feel easier to type the former.
â€“Â akrun
Aug 28 at 15:35

Â |Â
show 7 more comments

1

To be more precise, you can use "(?<=Work).*(?=.csv)".
â€“Â r2evans
Aug 28 at 15:18

@avid_useR But, I am using regmatches/regexpr
â€“Â akrun
Aug 28 at 15:24

@avid_useR Okay, that is right
â€“Â akrun
Aug 28 at 15:25

1

@AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
â€“Â r2evans
Aug 28 at 15:32

1

@r2evans I use both [.] or \., though I feel easier to type the former.
â€“Â akrun
Aug 28 at 15:35

To be more precise, you can use "(?<=Work).*(?=.csv)".
â€“Â r2evans
Aug 28 at 15:18

@avid_useR But, I am using regmatches/regexpr
â€“Â akrun
Aug 28 at 15:24

@avid_useR Okay, that is right
â€“Â akrun
Aug 28 at 15:25

@AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
â€“Â r2evans
Aug 28 at 15:32

@r2evans I use both [.] or \., though I feel easier to type the former.
â€“Â akrun
Aug 28 at 15:35

Â |Â
show 7 more comments

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

Vtyjkyuk