Python - Extract Code from Text using Regex
up vote
-1
down vote
favorite
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
python regex text-extraction
add a comment |
up vote
-1
down vote
favorite
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
python regex text-extraction
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
If you don't mind to have the filename in front of every match:grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add-h
to avoid printing the filename ingrep
– tripleee
23 hours ago
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
python regex text-extraction
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles =
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
python regex text-extraction
python regex text-extraction
edited yesterday
asked yesterday
Dominik Scheld
448
448
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
If you don't mind to have the filename in front of every match:grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add-h
to avoid printing the filename ingrep
– tripleee
23 hours ago
add a comment |
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
If you don't mind to have the filename in front of every match:grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add-h
to avoid printing the filename ingrep
– tripleee
23 hours ago
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
1
If you don't mind to have the filename in front of every match:
grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
If you don't mind to have the filename in front of every match:
grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add
-h
to avoid printing the filename in grep
– tripleee
23 hours ago
Just add
-h
to avoid printing the filename in grep
– tripleee
23 hours ago
add a comment |
3 Answers
3
active
oldest
votes
up vote
0
down vote
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
up vote
0
down vote
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
add a comment |
up vote
0
down vote
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
up vote
0
down vote
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
up vote
0
down vote
up vote
0
down vote
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
your regex is '^C[0-9]9$'
^ start of line
C exact match
[0-9] any digit
9 9 times
$ end of line
answered yesterday
sudonym
1,231824
1,231824
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
come on, either include your code attempts or google it
– sudonym
yesterday
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
thanks a lot! will this regex store each occurence of the expression in a list so that I can write it to a outfile?
– Dominik Scheld
yesterday
3
3
come on, either include your code attempts or google it
– sudonym
yesterday
come on, either include your code attempts or google it
– sudonym
yesterday
add a comment |
up vote
0
down vote
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
add a comment |
up vote
0
down vote
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
add a comment |
up vote
0
down vote
up vote
0
down vote
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
import re
regex = re.compile('(^Cd9)')
matches =
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
answered yesterday
Alex
509318
509318
add a comment |
add a comment |
up vote
0
down vote
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
up vote
0
down vote
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
up vote
0
down vote
up vote
0
down vote
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(Cd9)',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search =
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(Cd9)',i) for i in f]
search.update(f.name:data)
print(search)
This would return a dictionary with file names as keys and a list of found matches.
edited 21 hours ago
answered yesterday
Prayson Daniel
1,0431817
1,0431817
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
1
9,9
is equivalent to9
– tripleee
23 hours ago
1
I would suggest the following:data = re.findall(r'Cd9', f.read())
... Ordata = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.
– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
1
1
9,9
is equivalent to 9
– tripleee
23 hours ago
9,9
is equivalent to 9
– tripleee
23 hours ago
1
1
I would suggest the following:
data = re.findall(r'Cd9', f.read())
... Or data = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.– Sven Krüger
23 hours ago
I would suggest the following:
data = re.findall(r'Cd9', f.read())
... Or data = [re.match(r'Cd9', i) for i in f]
... You also lose the information of the filename if you just add the occurences to a flat list.– Sven Krüger
23 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
Thanks for your comments. I will update the code ASAP
– Prayson Daniel
21 hours ago
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53222200%2fpython-extract-code-from-text-using-regex%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
You can always test you regex with an online testing tool such as regex101.com
– jjj
yesterday
1
If you don't mind to have the filename in front of every match:
grep -r -E '^C[0-9]9$' --exclude out.txt > out.txt
– mikuszefski
yesterday
Just add
-h
to avoid printing the filename ingrep
– tripleee
23 hours ago