How can I find duplicate in the first column, then remove concerning whole lines ?
Clash Royale CLAN TAG#URR8PPP
up vote
4
down vote
favorite
I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.
For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
So output should be like that:
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
How can I do that without sorting?
command-line text-processing duplicate uniq
add a comment |Â
up vote
4
down vote
favorite
I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.
For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
So output should be like that:
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
How can I do that without sorting?
command-line text-processing duplicate uniq
Just out of curiosity, why do you not want to usesort
for this? The whole thing can be accomplished in a simple line ofsort input-file | uniq -w 16 > output-file
. Again, this is only curiosity.
â Terrance
Aug 22 at 13:29
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.
For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
So output should be like that:
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
How can I do that without sorting?
command-line text-processing duplicate uniq
I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.
For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
So output should be like that:
2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7
How can I do that without sorting?
command-line text-processing duplicate uniq
edited Aug 22 at 6:16
muru
1
1
asked Aug 22 at 4:19
Suat Yazñcñ
256
256
Just out of curiosity, why do you not want to usesort
for this? The whole thing can be accomplished in a simple line ofsort input-file | uniq -w 16 > output-file
. Again, this is only curiosity.
â Terrance
Aug 22 at 13:29
add a comment |Â
Just out of curiosity, why do you not want to usesort
for this? The whole thing can be accomplished in a simple line ofsort input-file | uniq -w 16 > output-file
. Again, this is only curiosity.
â Terrance
Aug 22 at 13:29
Just out of curiosity, why do you not want to use
sort
for this? The whole thing can be accomplished in a simple line of sort input-file | uniq -w 16 > output-file
. Again, this is only curiosity.â Terrance
Aug 22 at 13:29
Just out of curiosity, why do you not want to use
sort
for this? The whole thing can be accomplished in a simple line of sort input-file | uniq -w 16 > output-file
. Again, this is only curiosity.â Terrance
Aug 22 at 13:29
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
8
down vote
accepted
To remove duplicates based on a single column, you can use awk
:
awk '!seen[$1]++' input-file > output-file
You can see an explanation for this in this Unix & Linux post.
Removing the older lines is more complicated. Given that duplicates always come together, you can do:
awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file
Here, in the middle block, seen[$1] = $0
saves the current line ($0
) to the seen
array with the first field ($1
) as index, then saves the first field in the prev
variable. This prev
is used in the first block when processing the next line.
In the first block, then, we check if prev
is set (only true for the second line onwards) and not equal to the current first field (here prev
was set while processing the previous line). If it isn't, we have moved past duplicates and can print the previous line. At the END
, we do that again for the last line.
This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â Suat Yazñcñ
Aug 22 at 6:47
@SuatYazñcñ see update.
â muru
Aug 22 at 7:13
2
If removing lower duplicates is easier, a possible "solution" for removing upper duplicates istac | awk [remove lower duplicates] | tac
:-)
â egmont
Aug 22 at 7:14
@muru awk: line 1: syntax error at or near }
â Suat Yazñcñ
Aug 22 at 7:37
@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â Suat Yazñcñ
Aug 22 at 7:38
 |Â
show 5 more comments
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
8
down vote
accepted
To remove duplicates based on a single column, you can use awk
:
awk '!seen[$1]++' input-file > output-file
You can see an explanation for this in this Unix & Linux post.
Removing the older lines is more complicated. Given that duplicates always come together, you can do:
awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file
Here, in the middle block, seen[$1] = $0
saves the current line ($0
) to the seen
array with the first field ($1
) as index, then saves the first field in the prev
variable. This prev
is used in the first block when processing the next line.
In the first block, then, we check if prev
is set (only true for the second line onwards) and not equal to the current first field (here prev
was set while processing the previous line). If it isn't, we have moved past duplicates and can print the previous line. At the END
, we do that again for the last line.
This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â Suat Yazñcñ
Aug 22 at 6:47
@SuatYazñcñ see update.
â muru
Aug 22 at 7:13
2
If removing lower duplicates is easier, a possible "solution" for removing upper duplicates istac | awk [remove lower duplicates] | tac
:-)
â egmont
Aug 22 at 7:14
@muru awk: line 1: syntax error at or near }
â Suat Yazñcñ
Aug 22 at 7:37
@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â Suat Yazñcñ
Aug 22 at 7:38
 |Â
show 5 more comments
up vote
8
down vote
accepted
To remove duplicates based on a single column, you can use awk
:
awk '!seen[$1]++' input-file > output-file
You can see an explanation for this in this Unix & Linux post.
Removing the older lines is more complicated. Given that duplicates always come together, you can do:
awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file
Here, in the middle block, seen[$1] = $0
saves the current line ($0
) to the seen
array with the first field ($1
) as index, then saves the first field in the prev
variable. This prev
is used in the first block when processing the next line.
In the first block, then, we check if prev
is set (only true for the second line onwards) and not equal to the current first field (here prev
was set while processing the previous line). If it isn't, we have moved past duplicates and can print the previous line. At the END
, we do that again for the last line.
This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â Suat Yazñcñ
Aug 22 at 6:47
@SuatYazñcñ see update.
â muru
Aug 22 at 7:13
2
If removing lower duplicates is easier, a possible "solution" for removing upper duplicates istac | awk [remove lower duplicates] | tac
:-)
â egmont
Aug 22 at 7:14
@muru awk: line 1: syntax error at or near }
â Suat Yazñcñ
Aug 22 at 7:37
@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â Suat Yazñcñ
Aug 22 at 7:38
 |Â
show 5 more comments
up vote
8
down vote
accepted
up vote
8
down vote
accepted
To remove duplicates based on a single column, you can use awk
:
awk '!seen[$1]++' input-file > output-file
You can see an explanation for this in this Unix & Linux post.
Removing the older lines is more complicated. Given that duplicates always come together, you can do:
awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file
Here, in the middle block, seen[$1] = $0
saves the current line ($0
) to the seen
array with the first field ($1
) as index, then saves the first field in the prev
variable. This prev
is used in the first block when processing the next line.
In the first block, then, we check if prev
is set (only true for the second line onwards) and not equal to the current first field (here prev
was set while processing the previous line). If it isn't, we have moved past duplicates and can print the previous line. At the END
, we do that again for the last line.
To remove duplicates based on a single column, you can use awk
:
awk '!seen[$1]++' input-file > output-file
You can see an explanation for this in this Unix & Linux post.
Removing the older lines is more complicated. Given that duplicates always come together, you can do:
awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file
Here, in the middle block, seen[$1] = $0
saves the current line ($0
) to the seen
array with the first field ($1
) as index, then saves the first field in the prev
variable. This prev
is used in the first block when processing the next line.
In the first block, then, we check if prev
is set (only true for the second line onwards) and not equal to the current first field (here prev
was set while processing the previous line). If it isn't, we have moved past duplicates and can print the previous line. At the END
, we do that again for the last line.
edited Aug 22 at 8:15
answered Aug 22 at 6:26
muru
1
1
This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â Suat Yazñcñ
Aug 22 at 6:47
@SuatYazñcñ see update.
â muru
Aug 22 at 7:13
2
If removing lower duplicates is easier, a possible "solution" for removing upper duplicates istac | awk [remove lower duplicates] | tac
:-)
â egmont
Aug 22 at 7:14
@muru awk: line 1: syntax error at or near }
â Suat Yazñcñ
Aug 22 at 7:37
@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â Suat Yazñcñ
Aug 22 at 7:38
 |Â
show 5 more comments
This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â Suat Yazñcñ
Aug 22 at 6:47
@SuatYazñcñ see update.
â muru
Aug 22 at 7:13
2
If removing lower duplicates is easier, a possible "solution" for removing upper duplicates istac | awk [remove lower duplicates] | tac
:-)
â egmont
Aug 22 at 7:14
@muru awk: line 1: syntax error at or near }
â Suat Yazñcñ
Aug 22 at 7:37
@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â Suat Yazñcñ
Aug 22 at 7:38
This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â Suat Yazñcñ
Aug 22 at 6:47
This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â Suat Yazñcñ
Aug 22 at 6:47
@SuatYazñcñ see update.
â muru
Aug 22 at 7:13
@SuatYazñcñ see update.
â muru
Aug 22 at 7:13
2
2
If removing lower duplicates is easier, a possible "solution" for removing upper duplicates is
tac | awk [remove lower duplicates] | tac
:-)â egmont
Aug 22 at 7:14
If removing lower duplicates is easier, a possible "solution" for removing upper duplicates is
tac | awk [remove lower duplicates] | tac
:-)â egmont
Aug 22 at 7:14
@muru awk: line 1: syntax error at or near }
â Suat Yazñcñ
Aug 22 at 7:37
@muru awk: line 1: syntax error at or near }
â Suat Yazñcñ
Aug 22 at 7:37
@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â Suat Yazñcñ
Aug 22 at 7:38
@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â Suat Yazñcñ
Aug 22 at 7:38
 |Â
show 5 more comments
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1067708%2fhow-can-i-find-duplicate-in-the-first-column-then-remove-concerning-whole-lines%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Just out of curiosity, why do you not want to use
sort
for this? The whole thing can be accomplished in a simple line ofsort input-file | uniq -w 16 > output-file
. Again, this is only curiosity.â Terrance
Aug 22 at 13:29