Extract substring in R using grepl

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
8
down vote

favorite












I have a table with a string column formatted like this



abcdWorkstart.csv
abcdWorkcomplete.csv


And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.



grepl("Work*.csv", data$filename)


Basically I want to extract whatever between Work and .csv



desired outcome:



start
complete






share|improve this question


















  • 2




    please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
    – Andre Elrico
    Aug 28 at 15:20














up vote
8
down vote

favorite












I have a table with a string column formatted like this



abcdWorkstart.csv
abcdWorkcomplete.csv


And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.



grepl("Work*.csv", data$filename)


Basically I want to extract whatever between Work and .csv



desired outcome:



start
complete






share|improve this question


















  • 2




    please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
    – Andre Elrico
    Aug 28 at 15:20












up vote
8
down vote

favorite









up vote
8
down vote

favorite











I have a table with a string column formatted like this



abcdWorkstart.csv
abcdWorkcomplete.csv


And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.



grepl("Work*.csv", data$filename)


Basically I want to extract whatever between Work and .csv



desired outcome:



start
complete






share|improve this question














I have a table with a string column formatted like this



abcdWorkstart.csv
abcdWorkcomplete.csv


And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.



grepl("Work*.csv", data$filename)


Basically I want to extract whatever between Work and .csv



desired outcome:



start
complete








share|improve this question













share|improve this question




share|improve this question








edited Aug 28 at 15:18









Andre Elrico

3,058723




3,058723










asked Aug 28 at 14:57









ajax2000

197210




197210







  • 2




    please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
    – Andre Elrico
    Aug 28 at 15:20












  • 2




    please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
    – Andre Elrico
    Aug 28 at 15:20







2




2




please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
– Andre Elrico
Aug 28 at 15:20




please have a look at my edit @ajax2000. It's always a good practice to add the desired outcome to your question. This makes everything so much easier and ppl know exactly what you want. I encourage you to do this in your next question ;-).
– Andre Elrico
Aug 28 at 15:20












4 Answers
4






active

oldest

votes

















up vote
5
down vote



accepted










Just as an alternative way, remove everything you don't want.



x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

gsub("^.*Work|\.csv$", "", x)
#[1] "start" "complete"


please note:
I have to use gsub. Because I first remove ^.*Work then \.csv$.




For [\s\S] or \d\D ... (does not work with [g]?sub)



https://regex101.com/r/wFgkgG/1



Works with akruns approach:



regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))



str1<-
'12
.2
12'

gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)


. matches also n when using the R engine.






share|improve this answer






















  • pretty much all the solutions work, but i think this is more concise. thanks
    – ajax2000
    Sep 5 at 13:04

















up vote
9
down vote













I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:



fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\.csv$", "\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"


You can work around this by filtering out the unchanged ones:



out[ out != fn ]
# [1] "start" "complete"


Or marking them invalid with NA (or something else):



out[ out == fn ] <- NA
out
# [1] "start" "complete" NA





share|improve this answer



























    up vote
    7
    down vote













    With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":



    x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

    library(stringr)
    str_extract(x, "(?<=Work).+(?=\.csv)")
    # [1] "start" "complete"





    share|improve this answer





























      up vote
      4
      down vote













      Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches



      regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
      #[1] "start" "complete"


      data



      v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')





      share|improve this answer


















      • 1




        To be more precise, you can use "(?<=Work).*(?=.csv)".
        – r2evans
        Aug 28 at 15:18










      • @avid_useR But, I am using regmatches/regexpr
        – akrun
        Aug 28 at 15:24










      • @avid_useR Okay, that is right
        – akrun
        Aug 28 at 15:25






      • 1




        @AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
        – r2evans
        Aug 28 at 15:32






      • 1




        @r2evans I use both [.] or \., though I feel easier to type the former.
        – akrun
        Aug 28 at 15:35










      Your Answer





      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52060891%2fextract-substring-in-r-using-grepl%23new-answer', 'question_page');

      );

      Post as a guest






























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      5
      down vote



      accepted










      Just as an alternative way, remove everything you don't want.



      x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

      gsub("^.*Work|\.csv$", "", x)
      #[1] "start" "complete"


      please note:
      I have to use gsub. Because I first remove ^.*Work then \.csv$.




      For [\s\S] or \d\D ... (does not work with [g]?sub)



      https://regex101.com/r/wFgkgG/1



      Works with akruns approach:



      regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))



      str1<-
      '12
      .2
      12'

      gsub("[^.]","m",str1,perl=T)
      gsub(".","m",str1,perl=T)
      gsub(".","m",str1,perl=F)


      . matches also n when using the R engine.






      share|improve this answer






















      • pretty much all the solutions work, but i think this is more concise. thanks
        – ajax2000
        Sep 5 at 13:04














      up vote
      5
      down vote



      accepted










      Just as an alternative way, remove everything you don't want.



      x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

      gsub("^.*Work|\.csv$", "", x)
      #[1] "start" "complete"


      please note:
      I have to use gsub. Because I first remove ^.*Work then \.csv$.




      For [\s\S] or \d\D ... (does not work with [g]?sub)



      https://regex101.com/r/wFgkgG/1



      Works with akruns approach:



      regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))



      str1<-
      '12
      .2
      12'

      gsub("[^.]","m",str1,perl=T)
      gsub(".","m",str1,perl=T)
      gsub(".","m",str1,perl=F)


      . matches also n when using the R engine.






      share|improve this answer






















      • pretty much all the solutions work, but i think this is more concise. thanks
        – ajax2000
        Sep 5 at 13:04












      up vote
      5
      down vote



      accepted







      up vote
      5
      down vote



      accepted






      Just as an alternative way, remove everything you don't want.



      x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

      gsub("^.*Work|\.csv$", "", x)
      #[1] "start" "complete"


      please note:
      I have to use gsub. Because I first remove ^.*Work then \.csv$.




      For [\s\S] or \d\D ... (does not work with [g]?sub)



      https://regex101.com/r/wFgkgG/1



      Works with akruns approach:



      regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))



      str1<-
      '12
      .2
      12'

      gsub("[^.]","m",str1,perl=T)
      gsub(".","m",str1,perl=T)
      gsub(".","m",str1,perl=F)


      . matches also n when using the R engine.






      share|improve this answer














      Just as an alternative way, remove everything you don't want.



      x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

      gsub("^.*Work|\.csv$", "", x)
      #[1] "start" "complete"


      please note:
      I have to use gsub. Because I first remove ^.*Work then \.csv$.




      For [\s\S] or \d\D ... (does not work with [g]?sub)



      https://regex101.com/r/wFgkgG/1



      Works with akruns approach:



      regmatches(v1, regexpr("(?<=Work)[\s\S]+(?=[.]csv)", v1, perl = T))



      str1<-
      '12
      .2
      12'

      gsub("[^.]","m",str1,perl=T)
      gsub(".","m",str1,perl=T)
      gsub(".","m",str1,perl=F)


      . matches also n when using the R engine.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Aug 28 at 15:57

























      answered Aug 28 at 15:07









      Andre Elrico

      3,058723




      3,058723











      • pretty much all the solutions work, but i think this is more concise. thanks
        – ajax2000
        Sep 5 at 13:04
















      • pretty much all the solutions work, but i think this is more concise. thanks
        – ajax2000
        Sep 5 at 13:04















      pretty much all the solutions work, but i think this is more concise. thanks
      – ajax2000
      Sep 5 at 13:04




      pretty much all the solutions work, but i think this is more concise. thanks
      – ajax2000
      Sep 5 at 13:04












      up vote
      9
      down vote













      I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:



      fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
      out <- sub(".*Work(.*)\.csv$", "\1", fn)
      out
      # [1] "start" "complete" "abcdNothing.csv"


      You can work around this by filtering out the unchanged ones:



      out[ out != fn ]
      # [1] "start" "complete"


      Or marking them invalid with NA (or something else):



      out[ out == fn ] <- NA
      out
      # [1] "start" "complete" NA





      share|improve this answer
























        up vote
        9
        down vote













        I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:



        fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
        out <- sub(".*Work(.*)\.csv$", "\1", fn)
        out
        # [1] "start" "complete" "abcdNothing.csv"


        You can work around this by filtering out the unchanged ones:



        out[ out != fn ]
        # [1] "start" "complete"


        Or marking them invalid with NA (or something else):



        out[ out == fn ] <- NA
        out
        # [1] "start" "complete" NA





        share|improve this answer






















          up vote
          9
          down vote










          up vote
          9
          down vote









          I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:



          fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
          out <- sub(".*Work(.*)\.csv$", "\1", fn)
          out
          # [1] "start" "complete" "abcdNothing.csv"


          You can work around this by filtering out the unchanged ones:



          out[ out != fn ]
          # [1] "start" "complete"


          Or marking them invalid with NA (or something else):



          out[ out == fn ] <- NA
          out
          # [1] "start" "complete" NA





          share|improve this answer












          I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:



          fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
          out <- sub(".*Work(.*)\.csv$", "\1", fn)
          out
          # [1] "start" "complete" "abcdNothing.csv"


          You can work around this by filtering out the unchanged ones:



          out[ out != fn ]
          # [1] "start" "complete"


          Or marking them invalid with NA (or something else):



          out[ out == fn ] <- NA
          out
          # [1] "start" "complete" NA






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Aug 28 at 15:02









          r2evans

          20.8k32649




          20.8k32649




















              up vote
              7
              down vote













              With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":



              x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

              library(stringr)
              str_extract(x, "(?<=Work).+(?=\.csv)")
              # [1] "start" "complete"





              share|improve this answer


























                up vote
                7
                down vote













                With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":



                x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

                library(stringr)
                str_extract(x, "(?<=Work).+(?=\.csv)")
                # [1] "start" "complete"





                share|improve this answer
























                  up vote
                  7
                  down vote










                  up vote
                  7
                  down vote









                  With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":



                  x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

                  library(stringr)
                  str_extract(x, "(?<=Work).+(?=\.csv)")
                  # [1] "start" "complete"





                  share|improve this answer














                  With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":



                  x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")

                  library(stringr)
                  str_extract(x, "(?<=Work).+(?=\.csv)")
                  # [1] "start" "complete"






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Aug 28 at 15:13

























                  answered Aug 28 at 15:07









                  avid_useR

                  10.4k41730




                  10.4k41730




















                      up vote
                      4
                      down vote













                      Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches



                      regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
                      #[1] "start" "complete"


                      data



                      v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')





                      share|improve this answer


















                      • 1




                        To be more precise, you can use "(?<=Work).*(?=.csv)".
                        – r2evans
                        Aug 28 at 15:18










                      • @avid_useR But, I am using regmatches/regexpr
                        – akrun
                        Aug 28 at 15:24










                      • @avid_useR Okay, that is right
                        – akrun
                        Aug 28 at 15:25






                      • 1




                        @AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
                        – r2evans
                        Aug 28 at 15:32






                      • 1




                        @r2evans I use both [.] or \., though I feel easier to type the former.
                        – akrun
                        Aug 28 at 15:35














                      up vote
                      4
                      down vote













                      Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches



                      regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
                      #[1] "start" "complete"


                      data



                      v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')





                      share|improve this answer


















                      • 1




                        To be more precise, you can use "(?<=Work).*(?=.csv)".
                        – r2evans
                        Aug 28 at 15:18










                      • @avid_useR But, I am using regmatches/regexpr
                        – akrun
                        Aug 28 at 15:24










                      • @avid_useR Okay, that is right
                        – akrun
                        Aug 28 at 15:25






                      • 1




                        @AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
                        – r2evans
                        Aug 28 at 15:32






                      • 1




                        @r2evans I use both [.] or \., though I feel easier to type the former.
                        – akrun
                        Aug 28 at 15:35












                      up vote
                      4
                      down vote










                      up vote
                      4
                      down vote









                      Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches



                      regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
                      #[1] "start" "complete"


                      data



                      v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')





                      share|improve this answer














                      Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches



                      regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
                      #[1] "start" "complete"


                      data



                      v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Aug 28 at 15:22

























                      answered Aug 28 at 15:11









                      akrun

                      374k13161237




                      374k13161237







                      • 1




                        To be more precise, you can use "(?<=Work).*(?=.csv)".
                        – r2evans
                        Aug 28 at 15:18










                      • @avid_useR But, I am using regmatches/regexpr
                        – akrun
                        Aug 28 at 15:24










                      • @avid_useR Okay, that is right
                        – akrun
                        Aug 28 at 15:25






                      • 1




                        @AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
                        – r2evans
                        Aug 28 at 15:32






                      • 1




                        @r2evans I use both [.] or \., though I feel easier to type the former.
                        – akrun
                        Aug 28 at 15:35












                      • 1




                        To be more precise, you can use "(?<=Work).*(?=.csv)".
                        – r2evans
                        Aug 28 at 15:18










                      • @avid_useR But, I am using regmatches/regexpr
                        – akrun
                        Aug 28 at 15:24










                      • @avid_useR Okay, that is right
                        – akrun
                        Aug 28 at 15:25






                      • 1




                        @AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
                        – r2evans
                        Aug 28 at 15:32






                      • 1




                        @r2evans I use both [.] or \., though I feel easier to type the former.
                        – akrun
                        Aug 28 at 15:35







                      1




                      1




                      To be more precise, you can use "(?<=Work).*(?=.csv)".
                      – r2evans
                      Aug 28 at 15:18




                      To be more precise, you can use "(?<=Work).*(?=.csv)".
                      – r2evans
                      Aug 28 at 15:18












                      @avid_useR But, I am using regmatches/regexpr
                      – akrun
                      Aug 28 at 15:24




                      @avid_useR But, I am using regmatches/regexpr
                      – akrun
                      Aug 28 at 15:24












                      @avid_useR Okay, that is right
                      – akrun
                      Aug 28 at 15:25




                      @avid_useR Okay, that is right
                      – akrun
                      Aug 28 at 15:25




                      1




                      1




                      @AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
                      – r2evans
                      Aug 28 at 15:32




                      @AndreElrico, doesn't [\s\S] match any character? Isn't is more concise to use .?
                      – r2evans
                      Aug 28 at 15:32




                      1




                      1




                      @r2evans I use both [.] or \., though I feel easier to type the former.
                      – akrun
                      Aug 28 at 15:35




                      @r2evans I use both [.] or \., though I feel easier to type the former.
                      – akrun
                      Aug 28 at 15:35

















                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52060891%2fextract-substring-in-r-using-grepl%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      這個網誌中的熱門文章

                      How to combine Bézier curves to a surface?

                      Carbon dioxide

                      Why am i infinitely getting the same tweet with the Twitter Search API?