Python - Extract Code from Text using Regex









up vote
-1
down vote

favorite












I am a Python beginner and looking for help with an extraction problem.



I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".



sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""


What the output should look like (in the output file)



My_desired_output_file = "filename,C123456789,C987654321"


My code so far:



min_file_size = 5

def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)

for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)









share|improve this question























  • You can always test you regex with an online testing tool such as regex101.com
    – jjj
    yesterday






  • 1




    If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
    – mikuszefski
    yesterday











  • Just add -h to avoid printing the filename in grep
    – tripleee
    23 hours ago














up vote
-1
down vote

favorite












I am a Python beginner and looking for help with an extraction problem.



I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".



sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""


What the output should look like (in the output file)



My_desired_output_file = "filename,C123456789,C987654321"


My code so far:



min_file_size = 5

def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)

for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)









share|improve this question























  • You can always test you regex with an online testing tool such as regex101.com
    – jjj
    yesterday






  • 1




    If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
    – mikuszefski
    yesterday











  • Just add -h to avoid printing the filename in grep
    – tripleee
    23 hours ago












up vote
-1
down vote

favorite









up vote
-1
down vote

favorite











I am a Python beginner and looking for help with an extraction problem.



I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".



sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""


What the output should look like (in the output file)



My_desired_output_file = "filename,C123456789,C987654321"


My code so far:



min_file_size = 5

def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)

for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)









share|improve this question















I am a Python beginner and looking for help with an extraction problem.



I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".



sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""


What the output should look like (in the output file)



My_desired_output_file = "filename,C123456789,C987654321"


My code so far:



min_file_size = 5

def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)

for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)






python regex text-extraction






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday

























asked yesterday









Dominik Scheld

448




448











  • You can always test you regex with an online testing tool such as regex101.com
    – jjj
    yesterday






  • 1




    If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
    – mikuszefski
    yesterday











  • Just add -h to avoid printing the filename in grep
    – tripleee
    23 hours ago
















  • You can always test you regex with an online testing tool such as regex101.com
    – jjj
    yesterday






  • 1




    If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
    – mikuszefski
    yesterday











  • Just add -h to avoid printing the filename in grep
    – tripleee
    23 hours ago















You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday




You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday




1




1




If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday





If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday













Just add -h to avoid printing the filename in grep
– tripleee
23 hours ago




Just add -h to avoid printing the filename in grep
– tripleee
23 hours ago












3 Answers
3






active

oldest

votes

















up vote
0
down vote













your regex is '^C[0-9]9$'



^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line





share|improve this answer




















  • thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
    – Dominik Scheld
    yesterday






  • 3




    come on, either include your code attempts or google it
    – sudonym
    yesterday

















up vote
0
down vote













import re

regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)


You can then write this list to a file as needed.






share|improve this answer



























    up vote
    0
    down vote













    How about:



    import re

    sample_text = """Some random text here
    and here
    and here
    C123456789
    some random text here
    C987654321
    and here
    and here"""

    k = re.findall('(Cd9)',sample_text)

    print(k)


    This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:



    Updated:



    import glob
    import os
    import re

    search =
    os.chdir('/FolderWithTxTs')
    for file in glob.glob("*.txt"):
    with open(file,'r') as f:
    data = [re.findall('(Cd9)',i) for i in f]
    search.update(f.name:data)

    print(search)


    This would return a dictionary with file names as keys and a list of found matches.






    share|improve this answer


















    • 1




      9,9 is equivalent to 9
      – tripleee
      23 hours ago






    • 1




      I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
      – Sven Krüger
      23 hours ago











    • Thanks for your comments. I will update the code ASAP
      – Prayson Daniel
      21 hours ago










    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53222200%2fpython-extract-code-from-text-using-regex%23new-answer', 'question_page');

    );

    Post as a guest






























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    your regex is '^C[0-9]9$'



    ^ start of line
    C exact match
    [0-9] any digit
    9 9 times
    $ end of line





    share|improve this answer




















    • thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
      – Dominik Scheld
      yesterday






    • 3




      come on, either include your code attempts or google it
      – sudonym
      yesterday














    up vote
    0
    down vote













    your regex is '^C[0-9]9$'



    ^ start of line
    C exact match
    [0-9] any digit
    9 9 times
    $ end of line





    share|improve this answer




















    • thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
      – Dominik Scheld
      yesterday






    • 3




      come on, either include your code attempts or google it
      – sudonym
      yesterday












    up vote
    0
    down vote










    up vote
    0
    down vote









    your regex is '^C[0-9]9$'



    ^ start of line
    C exact match
    [0-9] any digit
    9 9 times
    $ end of line





    share|improve this answer












    your regex is '^C[0-9]9$'



    ^ start of line
    C exact match
    [0-9] any digit
    9 9 times
    $ end of line






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered yesterday









    sudonym

    1,231824




    1,231824











    • thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
      – Dominik Scheld
      yesterday






    • 3




      come on, either include your code attempts or google it
      – sudonym
      yesterday
















    • thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
      – Dominik Scheld
      yesterday






    • 3




      come on, either include your code attempts or google it
      – sudonym
      yesterday















    thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
    – Dominik Scheld
    yesterday




    thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
    – Dominik Scheld
    yesterday




    3




    3




    come on, either include your code attempts or google it
    – sudonym
    yesterday




    come on, either include your code attempts or google it
    – sudonym
    yesterday












    up vote
    0
    down vote













    import re

    regex = re.compile('(^Cd9)')
    matches =
    with open('file.txt', 'r') as file:
    for line in file:
    line = line.strip()
    if regex.match(line):
    matches.append(line)


    You can then write this list to a file as needed.






    share|improve this answer
























      up vote
      0
      down vote













      import re

      regex = re.compile('(^Cd9)')
      matches =
      with open('file.txt', 'r') as file:
      for line in file:
      line = line.strip()
      if regex.match(line):
      matches.append(line)


      You can then write this list to a file as needed.






      share|improve this answer






















        up vote
        0
        down vote










        up vote
        0
        down vote









        import re

        regex = re.compile('(^Cd9)')
        matches =
        with open('file.txt', 'r') as file:
        for line in file:
        line = line.strip()
        if regex.match(line):
        matches.append(line)


        You can then write this list to a file as needed.






        share|improve this answer












        import re

        regex = re.compile('(^Cd9)')
        matches =
        with open('file.txt', 'r') as file:
        for line in file:
        line = line.strip()
        if regex.match(line):
        matches.append(line)


        You can then write this list to a file as needed.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered yesterday









        Alex

        509318




        509318




















            up vote
            0
            down vote













            How about:



            import re

            sample_text = """Some random text here
            and here
            and here
            C123456789
            some random text here
            C987654321
            and here
            and here"""

            k = re.findall('(Cd9)',sample_text)

            print(k)


            This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:



            Updated:



            import glob
            import os
            import re

            search =
            os.chdir('/FolderWithTxTs')
            for file in glob.glob("*.txt"):
            with open(file,'r') as f:
            data = [re.findall('(Cd9)',i) for i in f]
            search.update(f.name:data)

            print(search)


            This would return a dictionary with file names as keys and a list of found matches.






            share|improve this answer


















            • 1




              9,9 is equivalent to 9
              – tripleee
              23 hours ago






            • 1




              I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
              – Sven Krüger
              23 hours ago











            • Thanks for your comments. I will update the code ASAP
              – Prayson Daniel
              21 hours ago














            up vote
            0
            down vote













            How about:



            import re

            sample_text = """Some random text here
            and here
            and here
            C123456789
            some random text here
            C987654321
            and here
            and here"""

            k = re.findall('(Cd9)',sample_text)

            print(k)


            This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:



            Updated:



            import glob
            import os
            import re

            search =
            os.chdir('/FolderWithTxTs')
            for file in glob.glob("*.txt"):
            with open(file,'r') as f:
            data = [re.findall('(Cd9)',i) for i in f]
            search.update(f.name:data)

            print(search)


            This would return a dictionary with file names as keys and a list of found matches.






            share|improve this answer


















            • 1




              9,9 is equivalent to 9
              – tripleee
              23 hours ago






            • 1




              I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
              – Sven Krüger
              23 hours ago











            • Thanks for your comments. I will update the code ASAP
              – Prayson Daniel
              21 hours ago












            up vote
            0
            down vote










            up vote
            0
            down vote









            How about:



            import re

            sample_text = """Some random text here
            and here
            and here
            C123456789
            some random text here
            C987654321
            and here
            and here"""

            k = re.findall('(Cd9)',sample_text)

            print(k)


            This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:



            Updated:



            import glob
            import os
            import re

            search =
            os.chdir('/FolderWithTxTs')
            for file in glob.glob("*.txt"):
            with open(file,'r') as f:
            data = [re.findall('(Cd9)',i) for i in f]
            search.update(f.name:data)

            print(search)


            This would return a dictionary with file names as keys and a list of found matches.






            share|improve this answer














            How about:



            import re

            sample_text = """Some random text here
            and here
            and here
            C123456789
            some random text here
            C987654321
            and here
            and here"""

            k = re.findall('(Cd9)',sample_text)

            print(k)


            This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:



            Updated:



            import glob
            import os
            import re

            search =
            os.chdir('/FolderWithTxTs')
            for file in glob.glob("*.txt"):
            with open(file,'r') as f:
            data = [re.findall('(Cd9)',i) for i in f]
            search.update(f.name:data)

            print(search)


            This would return a dictionary with file names as keys and a list of found matches.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 21 hours ago

























            answered yesterday









            Prayson Daniel

            1,0431817




            1,0431817







            • 1




              9,9 is equivalent to 9
              – tripleee
              23 hours ago






            • 1




              I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
              – Sven Krüger
              23 hours ago











            • Thanks for your comments. I will update the code ASAP
              – Prayson Daniel
              21 hours ago












            • 1




              9,9 is equivalent to 9
              – tripleee
              23 hours ago






            • 1




              I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
              – Sven Krüger
              23 hours ago











            • Thanks for your comments. I will update the code ASAP
              – Prayson Daniel
              21 hours ago







            1




            1




            9,9 is equivalent to 9
            – tripleee
            23 hours ago




            9,9 is equivalent to 9
            – tripleee
            23 hours ago




            1




            1




            I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
            – Sven Krüger
            23 hours ago





            I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
            – Sven Krüger
            23 hours ago













            Thanks for your comments. I will update the code ASAP
            – Prayson Daniel
            21 hours ago




            Thanks for your comments. I will update the code ASAP
            – Prayson Daniel
            21 hours ago

















             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53222200%2fpython-extract-code-from-text-using-regex%23new-answer', 'question_page');

            );

            Post as a guest














































































            這個網誌中的熱門文章

            How to combine Bézier curves to a surface?

            Mutual Information Always Non-negative

            Why am i infinitely getting the same tweet with the Twitter Search API?