Python - Extract Code from Text using Regex

up vote
-1
down vote

favorite

I am a Python beginner and looking for help with an extraction problem.

I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

What the output should look like (in the output file)

My_desired_output_file = "filename,C123456789,C987654321"

My code so far:

min_file_size = 5

def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
 textfiles = 
 for root, dirs, files in os.walk(directory):
 for name in files:
 filename = os.path.join(root, name)
 if os.stat(filename).st_size > min_file_size:
 textfiles.append(filename)

for filename in list_textfiles(temp_directory, min_file_size): 
 string = str(filename)
 text = infile.read()
 regex = ???
 with open(filename, 'w', encoding="utf-8") as outfile:
 outfile.write(regex)

edited yesterday

asked yesterday

Dominik Scheld

448

You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday

1

If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday

Just add -h to avoid printing the filename in grep
– tripleee
23 hours ago

add a comment |

up vote
-1
down vote

favorite

I am a Python beginner and looking for help with an extraction problem.

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

What the output should look like (in the output file)

My_desired_output_file = "filename,C123456789,C987654321"

My code so far:

min_file_size = 5

def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
 textfiles = 
 for root, dirs, files in os.walk(directory):
 for name in files:
 filename = os.path.join(root, name)
 if os.stat(filename).st_size > min_file_size:
 textfiles.append(filename)

for filename in list_textfiles(temp_directory, min_file_size): 
 string = str(filename)
 text = infile.read()
 regex = ???
 with open(filename, 'w', encoding="utf-8") as outfile:
 outfile.write(regex)

edited yesterday

asked yesterday

Dominik Scheld

448

You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday

1

If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday

Just add -h to avoid printing the filename in grep
– tripleee
23 hours ago

add a comment |

up vote
-1
down vote

favorite

I am a Python beginner and looking for help with an extraction problem.

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

What the output should look like (in the output file)

My_desired_output_file = "filename,C123456789,C987654321"

My code so far:

min_file_size = 5

def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
 textfiles = 
 for root, dirs, files in os.walk(directory):
 for name in files:
 filename = os.path.join(root, name)
 if os.stat(filename).st_size > min_file_size:
 textfiles.append(filename)

for filename in list_textfiles(temp_directory, min_file_size): 
 string = str(filename)
 text = infile.read()
 regex = ???
 with open(filename, 'w', encoding="utf-8") as outfile:
 outfile.write(regex)

edited yesterday

asked yesterday

Dominik Scheld

448

I am a Python beginner and looking for help with an extraction problem.

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

What the output should look like (in the output file)

My_desired_output_file = "filename,C123456789,C987654321"

My code so far:

min_file_size = 5

def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
 textfiles = 
 for root, dirs, files in os.walk(directory):
 for name in files:
 filename = os.path.join(root, name)
 if os.stat(filename).st_size > min_file_size:
 textfiles.append(filename)

for filename in list_textfiles(temp_directory, min_file_size): 
 string = str(filename)
 text = infile.read()
 regex = ???
 with open(filename, 'w', encoding="utf-8") as outfile:
 outfile.write(regex)

python regex text-extraction

edited yesterday

asked yesterday

Dominik Scheld

448

edited yesterday

asked yesterday

Dominik Scheld

448

edited yesterday

asked yesterday

Dominik Scheld

448

asked yesterday

Dominik Scheld

448

asked yesterday

Dominik Scheld

448

You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday

1

If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday

Just add -h to avoid printing the filename in grep
– tripleee
23 hours ago

add a comment |

You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday

1

If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday

Just add -h to avoid printing the filename in grep
– tripleee
23 hours ago

You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday

If you don't mind to have the filename in front of every match: grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday

Just add -h to avoid printing the filename in grep
– tripleee
23 hours ago

add a comment |

3 Answers
3

active

oldest

votes

up vote
0
down vote

your regex is '^C[0-9]9$'

^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line

answered yesterday

sudonym

1,231824

thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday

3

come on, either include your code attempts or google it
– sudonym
yesterday

add a comment |

up vote
0
down vote

import re

regex = re.compile('(^Cd9)')
matches = 
with open('file.txt', 'r') as file:
 for line in file:
 line = line.strip()
 if regex.match(line):
 matches.append(line)

You can then write this list to a file as needed.

answered yesterday

Alex

509318

add a comment |

up vote
0
down vote

How about:

import re

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

k = re.findall('(Cd9)',sample_text)

print(k)

This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:

Updated:

import glob
import os
import re

search = 
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
 with open(file,'r') as f:
 data = [re.findall('(Cd9)',i) for i in f]
 search.update(f.name:data)

print(search)

This would return a dictionary with file names as keys and a list of found matches.

edited 21 hours ago

answered yesterday

Prayson Daniel

1,0431817

1

9,9 is equivalent to 9
– tripleee
23 hours ago

1

I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago

Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53222200%2fpython-extract-code-from-text-using-regex%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
0
down vote

your regex is '^C[0-9]9$'

^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line

answered yesterday

sudonym

1,231824

thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday

3

come on, either include your code attempts or google it
– sudonym
yesterday

add a comment |

up vote
0
down vote

your regex is '^C[0-9]9$'

^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line

answered yesterday

sudonym

1,231824

thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday

3

come on, either include your code attempts or google it
– sudonym
yesterday

add a comment |

up vote
0
down vote

your regex is '^C[0-9]9$'

^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line

answered yesterday

sudonym

1,231824

your regex is '^C[0-9]9$'

^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line

answered yesterday

sudonym

1,231824

answered yesterday

sudonym

1,231824

answered yesterday

sudonym

1,231824

answered yesterday

sudonym

1,231824

thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday

3

come on, either include your code attempts or google it
– sudonym
yesterday

add a comment |

thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday

3

come on, either include your code attempts or google it
– sudonym
yesterday

thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday

come on, either include your code attempts or google it
– sudonym
yesterday

add a comment |

up vote
0
down vote

import re

regex = re.compile('(^Cd9)')
matches = 
with open('file.txt', 'r') as file:
 for line in file:
 line = line.strip()
 if regex.match(line):
 matches.append(line)

You can then write this list to a file as needed.

answered yesterday

Alex

509318

add a comment |

up vote
0
down vote

import re

regex = re.compile('(^Cd9)')
matches = 
with open('file.txt', 'r') as file:
 for line in file:
 line = line.strip()
 if regex.match(line):
 matches.append(line)

You can then write this list to a file as needed.

answered yesterday

Alex

509318

add a comment |

up vote
0
down vote

import re

regex = re.compile('(^Cd9)')
matches = 
with open('file.txt', 'r') as file:
 for line in file:
 line = line.strip()
 if regex.match(line):
 matches.append(line)

You can then write this list to a file as needed.

answered yesterday

Alex

509318

import re

regex = re.compile('(^Cd9)')
matches = 
with open('file.txt', 'r') as file:
 for line in file:
 line = line.strip()
 if regex.match(line):
 matches.append(line)

You can then write this list to a file as needed.

answered yesterday

Alex

509318

answered yesterday

Alex

509318

answered yesterday

Alex

509318

answered yesterday

Alex

509318

add a comment |

up vote
0
down vote

How about:

import re

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

k = re.findall('(Cd9)',sample_text)

print(k)

This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:

Updated:

import glob
import os
import re

search = 
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
 with open(file,'r') as f:
 data = [re.findall('(Cd9)',i) for i in f]
 search.update(f.name:data)

print(search)

This would return a dictionary with file names as keys and a list of found matches.

edited 21 hours ago

answered yesterday

Prayson Daniel

1,0431817

1

9,9 is equivalent to 9
– tripleee
23 hours ago

1

I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago

Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago

add a comment |

up vote
0
down vote

How about:

import re

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

k = re.findall('(Cd9)',sample_text)

print(k)

This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:

Updated:

import glob
import os
import re

search = 
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
 with open(file,'r') as f:
 data = [re.findall('(Cd9)',i) for i in f]
 search.update(f.name:data)

print(search)

This would return a dictionary with file names as keys and a list of found matches.

edited 21 hours ago

answered yesterday

Prayson Daniel

1,0431817

1

9,9 is equivalent to 9
– tripleee
23 hours ago

1

I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago

Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago

add a comment |

up vote
0
down vote

How about:

import re

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

k = re.findall('(Cd9)',sample_text)

print(k)

This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:

Updated:

import glob
import os
import re

search = 
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
 with open(file,'r') as f:
 data = [re.findall('(Cd9)',i) for i in f]
 search.update(f.name:data)

print(search)

This would return a dictionary with file names as keys and a list of found matches.

edited 21 hours ago

answered yesterday

Prayson Daniel

1,0431817

How about:

import re

sample_text = """Some random text here 
and here
and here
C123456789
some random text here
C987654321
and here
and here"""

k = re.findall('(Cd9)',sample_text)

print(k)

This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:

Updated:

import glob
import os
import re

search = 
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
 with open(file,'r') as f:
 data = [re.findall('(Cd9)',i) for i in f]
 search.update(f.name:data)

print(search)

This would return a dictionary with file names as keys and a list of found matches.

edited 21 hours ago

answered yesterday

Prayson Daniel

1,0431817

edited 21 hours ago

answered yesterday

Prayson Daniel

1,0431817

answered yesterday

Prayson Daniel

1,0431817

answered yesterday

Prayson Daniel

1,0431817

1

9,9 is equivalent to 9
– tripleee
23 hours ago

1

I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago

Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago

add a comment |

1

9,9 is equivalent to 9
– tripleee
23 hours ago

1

I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago

Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago

9,9 is equivalent to 9
– tripleee
23 hours ago

I would suggest the following: data = re.findall(r'Cd9', f.read())... Or data = [re.match(r'Cd9', i) for i in f]... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago

Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vtyjkyuk