Python - Extract Code from Text using Regex

Multi tool use
up vote
-1
down vote
favorite
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
python regex text-extraction
add a comment |
up vote
-1
down vote
favorite
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
python regex text-extraction
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
If you don't mind to have the filename in front of every match:grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add-h
to avoid printing the filename ingrep
– tripleee
23 hours ago
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
python regex text-extraction
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
python regex text-extraction
python regex text-extraction
edited yesterday
asked yesterday


Dominik Scheld
448
448
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
If you don't mind to have the filename in front of every match:grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add-h
to avoid printing the filename ingrep
– tripleee
23 hours ago
add a comment |
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
If you don't mind to have the filename in front of every match:grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add-h
to avoid printing the filename ingrep
– tripleee
23 hours ago
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
1
If you don't mind to have the filename in front of every match:
grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
If you don't mind to have the filename in front of every match:
grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add
-h
to avoid printing the filename in grep
– tripleee
23 hours ago
Just add
-h
to avoid printing the filename in grep
– tripleee
23 hours ago
add a comment |
3 Answers
3
active
oldest
votes
up vote
0
down vote
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
up vote
0
down vote
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
add a comment |
up vote
0
down vote
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
up vote
0
down vote
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
up vote
0
down vote
up vote
0
down vote
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
answered yesterday
sudonym
1,231824
1,231824
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
3
come on, either include your code attempts or google it
– sudonym
yesterday
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
up vote
0
down vote
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
add a comment |
up vote
0
down vote
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
add a comment |
up vote
0
down vote
up vote
0
down vote
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
answered yesterday
Alex
509318
509318
add a comment |
add a comment |
up vote
0
down vote
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
up vote
0
down vote
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
up vote
0
down vote
up vote
0
down vote
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
edited 21 hours ago
answered yesterday


Prayson Daniel
1,0431817
1,0431817
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
1
1
9,9
is equivalent to 9
– tripleee
23 hours ago
9,9
is equivalent to 9
– tripleee
23 hours ago
1
1
I would suggest the following:
data = re.findall(r'Cd9', f.read())
... Or data = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.– Sven Krüger
23 hours ago
I would suggest the following:
data = re.findall(r'Cd9', f.read())
... Or data = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53222200%2fpython-extract-code-from-text-using-regex%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
kqAJuc 9kZaa7js4QSaCd2EvVd0CKMb j,gwF,XUNa BW8iwpz,v8dGo
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
If you don't mind to have the filename in front of every match:
grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add
-h
to avoid printing the filename ingrep
– tripleee
23 hours ago