How can I find duplicate in the first column, then remove concerning whole lines ?

up vote
4
down vote

favorite

I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.

For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.

2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7

So output should be like that:

2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7

How can I do that without sorting?

edited Aug 22 at 6:16

muru

asked Aug 22 at 4:19

Suat YazÃ„Â±cÃ„Â±

256

Just out of curiosity, why do you not want to use sort for this? The whole thing can be accomplished in a simple line of sort input-file | uniq -w 16 > output-file. Again, this is only curiosity.
â€“Â Terrance
Aug 22 at 13:29

add a commentÂ |Â

up vote
4
down vote

favorite

I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.

For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.

2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7

So output should be like that:

2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7

How can I do that without sorting?

edited Aug 22 at 6:16

muru

asked Aug 22 at 4:19

Suat YazÃ„Â±cÃ„Â±

256

Just out of curiosity, why do you not want to use sort for this? The whole thing can be accomplished in a simple line of sort input-file | uniq -w 16 > output-file. Again, this is only curiosity.
â€“Â Terrance
Aug 22 at 13:29

add a commentÂ |Â

up vote
4
down vote

favorite

I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.

For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.

2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7

So output should be like that:

2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7

How can I do that without sorting?

edited Aug 22 at 6:16

muru

asked Aug 22 at 4:19

Suat YazÃ„Â±cÃ„Â±

256

I have one xlsx file (110725x9 matrix) and I saved as type text (tab delemited) because I don't know whether Unix helps for xlsx files or not. Duplicates rows are always successive line by line.

For example, suppose text file as follow. You will see 3,4-th, 7,8-th and 17,18-th rows are same. I'd like to remove upper duplicate lines not lower always.

2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,1 7,3 7,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,2 6,9 6,2 6,2 6,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,3 5,8 5,5 5,5 5,8
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7

So output should be like that:

2009,37214611872 2009 135 20 17,1 17,4 19,2 21,8 24,1
2009,37237442922 2009 135 22 16,5 14,5 12,6 11,2 10,5
2009,37260273973 2009 136 0 7,7 7,2 7,0 7,2 7,4
2009,37488584475 2009 136 20 14,6 15,1 16,4 18,3 20,1
2009,37511415525 2009 136 22 15,9 14,6 12,8 10,9 9,4
2009,37534246575 2009 137 0 8,1 6,8 6,1 6,0 6,3
2009,37557077626 2009 137 2 6,8 6,7 6,5 6,3 6,2
2009,37579908676 2009 137 4 5,8 5,6 5,4 5,4 5,7
2009,37602739726 2009 137 6 6,3 6,1 5,9 5,8 5,8
2009,37625570776 2009 137 8 4,5 5,2 6,0 6,6 7,2
2009,37648401826 2009 137 10 9,6 9,0 8,4 8,4 9,1
2009,37671232877 2009 137 12 11,4 11,7 12,4 13,4 14,4
2009,37694063927 2009 137 14 12,4 13,1 14,2 15,4 16,7
2009,37785388128 2009 137 22 15,5 14,0 12,2 10,3 8,7
2009,37808219178 2009 138 0 6,2 5,7 5, 4 5,4 5,7

How can I do that without sorting?

edited Aug 22 at 6:16

muru

asked Aug 22 at 4:19

Suat YazÃ„Â±cÃ„Â±

256

edited Aug 22 at 6:16

muru

edited Aug 22 at 6:16

muru

edited Aug 22 at 6:16

muru

asked Aug 22 at 4:19

Suat YazÃ„Â±cÃ„Â±

256

asked Aug 22 at 4:19

Suat YazÃ„Â±cÃ„Â±

256

asked Aug 22 at 4:19

Suat YazÃ„Â±cÃ„Â±

256

Just out of curiosity, why do you not want to use sort for this? The whole thing can be accomplished in a simple line of sort input-file | uniq -w 16 > output-file. Again, this is only curiosity.
â€“Â Terrance
Aug 22 at 13:29

add a commentÂ |Â

Just out of curiosity, why do you not want to use sort for this? The whole thing can be accomplished in a simple line of sort input-file | uniq -w 16 > output-file. Again, this is only curiosity.
â€“Â Terrance
Aug 22 at 13:29

Just out of curiosity, why do you not want to use sort for this? The whole thing can be accomplished in a simple line of sort input-file | uniq -w 16 > output-file. Again, this is only curiosity.
â€“Â Terrance
Aug 22 at 13:29

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
8
down vote

accepted

To remove duplicates based on a single column, you can use awk:

awk '!seen[$1]++' input-file > output-file

You can see an explanation for this in this Unix & Linux post.

Removing the older lines is more complicated. Given that duplicates always come together, you can do:

awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file

Here, in the middle block, seen[$1] = $0 saves the current line ($0) to the seen array with the first field ($1) as index, then saves the first field in the prev variable. This prev is used in the first block when processing the next line.

In the first block, then, we check if prev is set (only true for the second line onwards) and not equal to the current first field (here prev was set while processing the previous line). If it isn't, we have moved past duplicates and can print the previous line. At the END, we do that again for the last line.

edited Aug 22 at 8:15

answered Aug 22 at 6:26

muru

This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 6:47

@SuatYazÃ„Â±cÃ„Â± see update.
â€“Â muru
Aug 22 at 7:13

2

If removing lower duplicates is easier, a possible "solution" for removing upper duplicates is tac | awk [remove lower duplicates] | tac :-)
â€“Â egmont
Aug 22 at 7:14

@muru awk: line 1: syntax error at or near }
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:37

@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:38

Â |Â
show 5 more comments

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "89"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1067708%2fhow-can-i-find-duplicate-in-the-first-column-then-remove-concerning-whole-lines%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
8
down vote

accepted

To remove duplicates based on a single column, you can use awk:

awk '!seen[$1]++' input-file > output-file

You can see an explanation for this in this Unix & Linux post.

Removing the older lines is more complicated. Given that duplicates always come together, you can do:

awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file

edited Aug 22 at 8:15

answered Aug 22 at 6:26

muru

This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 6:47

@SuatYazÃ„Â±cÃ„Â± see update.
â€“Â muru
Aug 22 at 7:13

2

If removing lower duplicates is easier, a possible "solution" for removing upper duplicates is tac | awk [remove lower duplicates] | tac :-)
â€“Â egmont
Aug 22 at 7:14

@muru awk: line 1: syntax error at or near }
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:37

@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:38

Â |Â
show 5 more comments

up vote
8
down vote

accepted

To remove duplicates based on a single column, you can use awk:

awk '!seen[$1]++' input-file > output-file

You can see an explanation for this in this Unix & Linux post.

Removing the older lines is more complicated. Given that duplicates always come together, you can do:

awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file

edited Aug 22 at 8:15

answered Aug 22 at 6:26

muru

This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 6:47

@SuatYazÃ„Â±cÃ„Â± see update.
â€“Â muru
Aug 22 at 7:13

2

If removing lower duplicates is easier, a possible "solution" for removing upper duplicates is tac | awk [remove lower duplicates] | tac :-)
â€“Â egmont
Aug 22 at 7:14

@muru awk: line 1: syntax error at or near }
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:37

@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:38

Â |Â
show 5 more comments

up vote
8
down vote

accepted

To remove duplicates based on a single column, you can use awk:

awk '!seen[$1]++' input-file > output-file

You can see an explanation for this in this Unix & Linux post.

Removing the older lines is more complicated. Given that duplicates always come together, you can do:

awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file

edited Aug 22 at 8:15

answered Aug 22 at 6:26

muru

To remove duplicates based on a single column, you can use awk:

awk '!seen[$1]++' input-file > output-file

You can see an explanation for this in this Unix & Linux post.

Removing the older lines is more complicated. Given that duplicates always come together, you can do:

awk 'prev && ($1 != prev) print seen[prev] seen[$1] = $0; prev = $1 END print seen[$1]' input-file > output-file

edited Aug 22 at 8:15

answered Aug 22 at 6:26

muru

edited Aug 22 at 8:15

answered Aug 22 at 6:26

muru

answered Aug 22 at 6:26

muru

answered Aug 22 at 6:26

muru

This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 6:47

@SuatYazÃ„Â±cÃ„Â± see update.
â€“Â muru
Aug 22 at 7:13

2

If removing lower duplicates is easier, a possible "solution" for removing upper duplicates is tac | awk [remove lower duplicates] | tac :-)
â€“Â egmont
Aug 22 at 7:14

@muru awk: line 1: syntax error at or near }
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:37

@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:38

Â |Â
show 5 more comments

This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 6:47

@SuatYazÃ„Â±cÃ„Â± see update.
â€“Â muru
Aug 22 at 7:13

2

If removing lower duplicates is easier, a possible "solution" for removing upper duplicates is tac | awk [remove lower duplicates] | tac :-)
â€“Â egmont
Aug 22 at 7:14

@muru awk: line 1: syntax error at or near }
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:37

@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:38

This code removed lower duplicates. I need to remove upper duplicates not lower dupplicates.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 6:47

@SuatYazÃ„Â±cÃ„Â± see update.
â€“Â muru
Aug 22 at 7:13

If removing lower duplicates is easier, a possible "solution" for removing upper duplicates is tac | awk [remove lower duplicates] | tac :-)
â€“Â egmont
Aug 22 at 7:14

@muru awk: line 1: syntax error at or near }
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:37

@egmont As far as I see, tac command is similar to cat command in reverse but I didn't succeed anything for now.
â€“Â Suat YazÃ„Â±cÃ„Â±
Aug 22 at 7:38

Â |Â
show 5 more comments

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

Vtyjkyuk