Piping Multiple Numbers into Sedgrep piping into sed, replacing inline; but I want sed to print the filename...

Does しかたない imply disappointment?

Are one-line email responses considered disrespectful?

What does @ mean in a hostname in DNS configuration?

How do I fight with Heavy Armor as a Wizard with Tenser's Transformation?

Do the speed limit reductions due to pollution also apply to electric cars in France?

What am I doing wrong in this use of split?

Is there a Tezos equivalent of ERC?

Can an attached stirge deal no more than 10 HP of damage before it detaches?

Will the duration of traveling to Ceres using the same tech developed for going to Mars be proportional to the distance to go to Mars or not?

Identical projects by students at two different colleges: still plagiarism?

Protagonist constantly has to have long words explained to her. Will this get tedious?

In a post apocalypse world, with no power and few survivors, would Satnav still work?

Is there any danger of my neighbor having my wife's signature?

Expression for "unconsciously using words (or accents) used by a person you often talk with or listen to"?

Is there a way to pause a running process on Linux systems and resume later?

How to know you are over-explaining and oversimplifying a subject?

Buying a "Used" Router

How many copper coins fit inside a cubic foot?

Why is it that Bernie Sanders always called a "socialist"?

I am a loser when it comes to jobs, what possibilities do I have?

Is it possible to detect 100% of SQLi with a simple regex?

When distributing a Linux kernel driver as source code, what's the difference between Proprietary and GPL license?

When does a person lose diplomatic status?

Boss asked me to sign a resignation paper without a date on it along with my new contract



Piping Multiple Numbers into Sed


grep piping into sed, replacing inline; but I want sed to print the filename and changed line. Is it possible?Piping content with multiple spacesHow to split long sed expression into multiple lines?Piping multiple lines within bash scriptPiping sed/grep output does not workPiping multiple inputs into Ghostscriptpiping grep regex into sed commandPiping PIDs into kill doesn't workPiping find results into another commandPython command history when piping into tee













1















Is it possible to pipe a list of numbers into sed?



The pipeline I currently have looks something like:



grep -nP 'foo' file_full.txt | sort | awk -F'[:;t]' '{print $1,$3,$9,$13}' 


The output of this:



2374 213 MID=212 GO=1
2462 213 MID=477 GO=137
2394 233 MID=232 GO=1
2464 233 MID=668 GO=1070
2185 24 MID=23 GO=1
2465 24 MID=752 GO=1083
2146 48 MID=354 GO=1010
1893 48 MID=47 GO=1
2219 58 MID=57 GO=1
2463 58 MID=595 GO=1057


My main goal here is to look at $2 for identical values then compare their $4 or GO values. The line with the larger GO value needs to be deleted.



When I add: sed 's/GO=/& /' | sort -k2,2 -k5n | awk 'a[$2]++ {sub(/GO= /,"GO="); print $1}'to the previous pipeline I get:



2462
2464
2465
2146
2463


Which is a list of line numbers I wish to remove from file_full.txt



I know sed -i '2462d;2464d;2465d;2146d;2463d' file_full.txt would work but I am unsure how to pipe each of those numbers from above into one sed command.



Am I missing something?





Raw data:



#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 213 . A T 1000 PASS MID=477;S=0.0600971;DOM=0.5;PO=1;GO=1037;MT=849;AC=4;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0
1 58 . A T 1000 PASS MID=595;S=0.0450203;DOM=0.5;PO=1;GO=1057;MT=228;AC=2;DP=1000;MULTIALLELIC GT 0|0 1|0 0|0 0|0 1|0 0|0
1 233 . A T 1000 PASS MID=668;S=-0.0447337;DOM=0.5;PO=1;GO=1070;MT=928;AC=1;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0
1 24 . A T 1000 PASS MID=752;S=-0.104791;DOM=0.5;PO=1;GO=1083;MT=93;AC=1;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0









share|improve this question




















  • 2





    You may not want to read from the file while sed is trying to edit the file...

    – Jeff Schaller
    1 hour ago











  • Added actual awk and sed commands to my question, hopefully it is makes more sense.

    – AMB
    1 hour ago






  • 1





    @AMB: It makes more sense now thanks

    – Jesse_b
    1 hour ago






  • 1





    Please edit your question and show us the original data. There may well be easier ways to get to this point or even to do the whole thing in one go.

    – terdon
    1 hour ago






  • 2





    Ah! A vcf! I thought these looked familiar! While this is completely on topic here, you might want to consider asking on Bioinformatics in future since the're probably nifty ways of doing this with domain-specific tools like bcftools etc. Also, do you want the entire line of the vcf file? What output are you expecting?

    – terdon
    44 mins ago


















1















Is it possible to pipe a list of numbers into sed?



The pipeline I currently have looks something like:



grep -nP 'foo' file_full.txt | sort | awk -F'[:;t]' '{print $1,$3,$9,$13}' 


The output of this:



2374 213 MID=212 GO=1
2462 213 MID=477 GO=137
2394 233 MID=232 GO=1
2464 233 MID=668 GO=1070
2185 24 MID=23 GO=1
2465 24 MID=752 GO=1083
2146 48 MID=354 GO=1010
1893 48 MID=47 GO=1
2219 58 MID=57 GO=1
2463 58 MID=595 GO=1057


My main goal here is to look at $2 for identical values then compare their $4 or GO values. The line with the larger GO value needs to be deleted.



When I add: sed 's/GO=/& /' | sort -k2,2 -k5n | awk 'a[$2]++ {sub(/GO= /,"GO="); print $1}'to the previous pipeline I get:



2462
2464
2465
2146
2463


Which is a list of line numbers I wish to remove from file_full.txt



I know sed -i '2462d;2464d;2465d;2146d;2463d' file_full.txt would work but I am unsure how to pipe each of those numbers from above into one sed command.



Am I missing something?





Raw data:



#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 213 . A T 1000 PASS MID=477;S=0.0600971;DOM=0.5;PO=1;GO=1037;MT=849;AC=4;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0
1 58 . A T 1000 PASS MID=595;S=0.0450203;DOM=0.5;PO=1;GO=1057;MT=228;AC=2;DP=1000;MULTIALLELIC GT 0|0 1|0 0|0 0|0 1|0 0|0
1 233 . A T 1000 PASS MID=668;S=-0.0447337;DOM=0.5;PO=1;GO=1070;MT=928;AC=1;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0
1 24 . A T 1000 PASS MID=752;S=-0.104791;DOM=0.5;PO=1;GO=1083;MT=93;AC=1;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0









share|improve this question




















  • 2





    You may not want to read from the file while sed is trying to edit the file...

    – Jeff Schaller
    1 hour ago











  • Added actual awk and sed commands to my question, hopefully it is makes more sense.

    – AMB
    1 hour ago






  • 1





    @AMB: It makes more sense now thanks

    – Jesse_b
    1 hour ago






  • 1





    Please edit your question and show us the original data. There may well be easier ways to get to this point or even to do the whole thing in one go.

    – terdon
    1 hour ago






  • 2





    Ah! A vcf! I thought these looked familiar! While this is completely on topic here, you might want to consider asking on Bioinformatics in future since the're probably nifty ways of doing this with domain-specific tools like bcftools etc. Also, do you want the entire line of the vcf file? What output are you expecting?

    – terdon
    44 mins ago
















1












1








1


1






Is it possible to pipe a list of numbers into sed?



The pipeline I currently have looks something like:



grep -nP 'foo' file_full.txt | sort | awk -F'[:;t]' '{print $1,$3,$9,$13}' 


The output of this:



2374 213 MID=212 GO=1
2462 213 MID=477 GO=137
2394 233 MID=232 GO=1
2464 233 MID=668 GO=1070
2185 24 MID=23 GO=1
2465 24 MID=752 GO=1083
2146 48 MID=354 GO=1010
1893 48 MID=47 GO=1
2219 58 MID=57 GO=1
2463 58 MID=595 GO=1057


My main goal here is to look at $2 for identical values then compare their $4 or GO values. The line with the larger GO value needs to be deleted.



When I add: sed 's/GO=/& /' | sort -k2,2 -k5n | awk 'a[$2]++ {sub(/GO= /,"GO="); print $1}'to the previous pipeline I get:



2462
2464
2465
2146
2463


Which is a list of line numbers I wish to remove from file_full.txt



I know sed -i '2462d;2464d;2465d;2146d;2463d' file_full.txt would work but I am unsure how to pipe each of those numbers from above into one sed command.



Am I missing something?





Raw data:



#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 213 . A T 1000 PASS MID=477;S=0.0600971;DOM=0.5;PO=1;GO=1037;MT=849;AC=4;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0
1 58 . A T 1000 PASS MID=595;S=0.0450203;DOM=0.5;PO=1;GO=1057;MT=228;AC=2;DP=1000;MULTIALLELIC GT 0|0 1|0 0|0 0|0 1|0 0|0
1 233 . A T 1000 PASS MID=668;S=-0.0447337;DOM=0.5;PO=1;GO=1070;MT=928;AC=1;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0
1 24 . A T 1000 PASS MID=752;S=-0.104791;DOM=0.5;PO=1;GO=1083;MT=93;AC=1;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0









share|improve this question
















Is it possible to pipe a list of numbers into sed?



The pipeline I currently have looks something like:



grep -nP 'foo' file_full.txt | sort | awk -F'[:;t]' '{print $1,$3,$9,$13}' 


The output of this:



2374 213 MID=212 GO=1
2462 213 MID=477 GO=137
2394 233 MID=232 GO=1
2464 233 MID=668 GO=1070
2185 24 MID=23 GO=1
2465 24 MID=752 GO=1083
2146 48 MID=354 GO=1010
1893 48 MID=47 GO=1
2219 58 MID=57 GO=1
2463 58 MID=595 GO=1057


My main goal here is to look at $2 for identical values then compare their $4 or GO values. The line with the larger GO value needs to be deleted.



When I add: sed 's/GO=/& /' | sort -k2,2 -k5n | awk 'a[$2]++ {sub(/GO= /,"GO="); print $1}'to the previous pipeline I get:



2462
2464
2465
2146
2463


Which is a list of line numbers I wish to remove from file_full.txt



I know sed -i '2462d;2464d;2465d;2146d;2463d' file_full.txt would work but I am unsure how to pipe each of those numbers from above into one sed command.



Am I missing something?





Raw data:



#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 213 . A T 1000 PASS MID=477;S=0.0600971;DOM=0.5;PO=1;GO=1037;MT=849;AC=4;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0
1 58 . A T 1000 PASS MID=595;S=0.0450203;DOM=0.5;PO=1;GO=1057;MT=228;AC=2;DP=1000;MULTIALLELIC GT 0|0 1|0 0|0 0|0 1|0 0|0
1 233 . A T 1000 PASS MID=668;S=-0.0447337;DOM=0.5;PO=1;GO=1070;MT=928;AC=1;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0
1 24 . A T 1000 PASS MID=752;S=-0.104791;DOM=0.5;PO=1;GO=1083;MT=93;AC=1;DP=1000;MULTIALLELIC GT 0|0 0|0 0|0 0|0 0|0 0|0






sed pipe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 48 mins ago







AMB

















asked 1 hour ago









AMBAMB

204




204








  • 2





    You may not want to read from the file while sed is trying to edit the file...

    – Jeff Schaller
    1 hour ago











  • Added actual awk and sed commands to my question, hopefully it is makes more sense.

    – AMB
    1 hour ago






  • 1





    @AMB: It makes more sense now thanks

    – Jesse_b
    1 hour ago






  • 1





    Please edit your question and show us the original data. There may well be easier ways to get to this point or even to do the whole thing in one go.

    – terdon
    1 hour ago






  • 2





    Ah! A vcf! I thought these looked familiar! While this is completely on topic here, you might want to consider asking on Bioinformatics in future since the're probably nifty ways of doing this with domain-specific tools like bcftools etc. Also, do you want the entire line of the vcf file? What output are you expecting?

    – terdon
    44 mins ago
















  • 2





    You may not want to read from the file while sed is trying to edit the file...

    – Jeff Schaller
    1 hour ago











  • Added actual awk and sed commands to my question, hopefully it is makes more sense.

    – AMB
    1 hour ago






  • 1





    @AMB: It makes more sense now thanks

    – Jesse_b
    1 hour ago






  • 1





    Please edit your question and show us the original data. There may well be easier ways to get to this point or even to do the whole thing in one go.

    – terdon
    1 hour ago






  • 2





    Ah! A vcf! I thought these looked familiar! While this is completely on topic here, you might want to consider asking on Bioinformatics in future since the're probably nifty ways of doing this with domain-specific tools like bcftools etc. Also, do you want the entire line of the vcf file? What output are you expecting?

    – terdon
    44 mins ago










2




2





You may not want to read from the file while sed is trying to edit the file...

– Jeff Schaller
1 hour ago





You may not want to read from the file while sed is trying to edit the file...

– Jeff Schaller
1 hour ago













Added actual awk and sed commands to my question, hopefully it is makes more sense.

– AMB
1 hour ago





Added actual awk and sed commands to my question, hopefully it is makes more sense.

– AMB
1 hour ago




1




1





@AMB: It makes more sense now thanks

– Jesse_b
1 hour ago





@AMB: It makes more sense now thanks

– Jesse_b
1 hour ago




1




1





Please edit your question and show us the original data. There may well be easier ways to get to this point or even to do the whole thing in one go.

– terdon
1 hour ago





Please edit your question and show us the original data. There may well be easier ways to get to this point or even to do the whole thing in one go.

– terdon
1 hour ago




2




2





Ah! A vcf! I thought these looked familiar! While this is completely on topic here, you might want to consider asking on Bioinformatics in future since the're probably nifty ways of doing this with domain-specific tools like bcftools etc. Also, do you want the entire line of the vcf file? What output are you expecting?

– terdon
44 mins ago







Ah! A vcf! I thought these looked familiar! While this is completely on topic here, you might want to consider asking on Bioinformatics in future since the're probably nifty ways of doing this with domain-specific tools like bcftools etc. Also, do you want the entire line of the vcf file? What output are you expecting?

– terdon
44 mins ago












2 Answers
2






active

oldest

votes


















3














You haven't shown the output you expect, but if I understood you correctly, you're looking for something like this (file has the data from your question):



$ sort -t= -k3 -rn file | awk '{a[$2]=$0}END{for(i in a){print a[i]}}'
2185 24 MID=23 GO=1
1893 48 MID=47 GO=1
2219 58 MID=57 GO=1
2374 213 MID=212 GO=1
2394 233 MID=232 GO=1


The idea is to first sort the input on the value of the GO. The -t= sets sort's field separator to =, which makes the number after GO the 3rd field. We sort on that, in reverse numerical order so that the larger numbers come first. Then, the awk will save each line as a value in the array a whose keys are the second fields. Since the file is sorted by the GO value, this means we'll always keep the greatest value for each $2. Then, at the end of the file we print the array.





Alternatively, you can do the whole thing directly from the original file:



$ awk -F'[t=;]' '/^[^#]/{
if(!a[$1$2] || a[$1$2]>$17){
line[$1$2]=$0;
a[$1$2]=$17
}
}
END{
for(i in a){
print line[i]
}
}' file.vcf
1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1


Here the trick is the -F'[t=;]' which sets awk's input field separator to any of t, =, or ;. That makes the GO value field 17 (assuming all of your lines have the same structure, if they don't, please ask on Bioinformatics since that would be better handled with other tools). The rest means:





  • /^[^#]/{ } : do this on lines that don't start with a #.


  • if(a[$1$2]<$17){ : if the value stored in the array a for the first and second fields (chromosome and position) is smaller than the 17th field of this line


  • line[$1$2]=$0; : save this line as the value for the combination of fields 1 and 2 in the array line.


  • a[$1$2]=$17 : save the 17th field as the value for the combination of first and second field in the array a.


  • END{for(i in a){print a[i]}} : just as as above, print the lines.


Now, this will require you to store one entire line for each unique value of the second field. That can be a problem if your file is very large.






share|improve this answer

































    3














    It's a pity we can't see your original data. Given the result of you pipeline in a file called file, you would get the right result though



    $ sort -t ' ' -k2,2 -k4.4n file | sort -u -k2,2
    2374 213 MID=212 GO=1
    2394 233 MID=232 GO=1
    2185 24 MID=23 GO=1
    1893 48 MID=47 GO=1
    2219 58 MID=57 GO=1


    These are the lines in the data with the smallest GO value for each group (where the second field defines the group).



    The first sort sorts the data in such a way that each group's GO values are ordered smallest to largest. That -k4.4n should have a dot in it, not a comma. It specifies the actual value after the = in the fourth field as the secondary sort key.



    The second sort uses only the group number in the second field and does a unique sort. This has the effect of only leaving the first of each unique group in the output.





    Using the raw data in the question:



    sed 's/./;./' file |
    sort -t ';' -k1,1 -k6.4,6n |
    sort -u -t ';' -k1,1 |
    sed 's/;././' |
    sort -k1,1n -k2,2n


    The first two sort calls do the same sort of operation as in the earlier part of this answer.



    The first sed replaces the dot in column 3 with ;.. This is done to let us use ; as the field delimiter properly in the two sort calls.
    The second sed calls restores the original dots.



    The final sort gets the data sorted on chromosomes and position, as from the start.



    This results in



    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
    1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
    1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
    1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
    1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1




    Extracting line numbers that you'd like to delete in a pipeline is often the wrong way to go about a task like this. Note that each part of a pipeline runs concurrently with every other part of the same pipeline. This means that you can't start overwriting or modifying a file in one part that you are, at the same, time reading from in another part.



    Also note that when extracting data through the stages of a pipeline, you lose the data that you're not passing on. This makes it even harder to allow the one pipeline to modify the original data (because it's lost along the way through the pipeline).






    share|improve this answer

























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f502415%2fpiping-multiple-numbers-into-sed%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3














      You haven't shown the output you expect, but if I understood you correctly, you're looking for something like this (file has the data from your question):



      $ sort -t= -k3 -rn file | awk '{a[$2]=$0}END{for(i in a){print a[i]}}'
      2185 24 MID=23 GO=1
      1893 48 MID=47 GO=1
      2219 58 MID=57 GO=1
      2374 213 MID=212 GO=1
      2394 233 MID=232 GO=1


      The idea is to first sort the input on the value of the GO. The -t= sets sort's field separator to =, which makes the number after GO the 3rd field. We sort on that, in reverse numerical order so that the larger numbers come first. Then, the awk will save each line as a value in the array a whose keys are the second fields. Since the file is sorted by the GO value, this means we'll always keep the greatest value for each $2. Then, at the end of the file we print the array.





      Alternatively, you can do the whole thing directly from the original file:



      $ awk -F'[t=;]' '/^[^#]/{
      if(!a[$1$2] || a[$1$2]>$17){
      line[$1$2]=$0;
      a[$1$2]=$17
      }
      }
      END{
      for(i in a){
      print line[i]
      }
      }' file.vcf
      1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
      1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
      1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
      1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1


      Here the trick is the -F'[t=;]' which sets awk's input field separator to any of t, =, or ;. That makes the GO value field 17 (assuming all of your lines have the same structure, if they don't, please ask on Bioinformatics since that would be better handled with other tools). The rest means:





      • /^[^#]/{ } : do this on lines that don't start with a #.


      • if(a[$1$2]<$17){ : if the value stored in the array a for the first and second fields (chromosome and position) is smaller than the 17th field of this line


      • line[$1$2]=$0; : save this line as the value for the combination of fields 1 and 2 in the array line.


      • a[$1$2]=$17 : save the 17th field as the value for the combination of first and second field in the array a.


      • END{for(i in a){print a[i]}} : just as as above, print the lines.


      Now, this will require you to store one entire line for each unique value of the second field. That can be a problem if your file is very large.






      share|improve this answer






























        3














        You haven't shown the output you expect, but if I understood you correctly, you're looking for something like this (file has the data from your question):



        $ sort -t= -k3 -rn file | awk '{a[$2]=$0}END{for(i in a){print a[i]}}'
        2185 24 MID=23 GO=1
        1893 48 MID=47 GO=1
        2219 58 MID=57 GO=1
        2374 213 MID=212 GO=1
        2394 233 MID=232 GO=1


        The idea is to first sort the input on the value of the GO. The -t= sets sort's field separator to =, which makes the number after GO the 3rd field. We sort on that, in reverse numerical order so that the larger numbers come first. Then, the awk will save each line as a value in the array a whose keys are the second fields. Since the file is sorted by the GO value, this means we'll always keep the greatest value for each $2. Then, at the end of the file we print the array.





        Alternatively, you can do the whole thing directly from the original file:



        $ awk -F'[t=;]' '/^[^#]/{
        if(!a[$1$2] || a[$1$2]>$17){
        line[$1$2]=$0;
        a[$1$2]=$17
        }
        }
        END{
        for(i in a){
        print line[i]
        }
        }' file.vcf
        1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
        1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
        1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
        1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1


        Here the trick is the -F'[t=;]' which sets awk's input field separator to any of t, =, or ;. That makes the GO value field 17 (assuming all of your lines have the same structure, if they don't, please ask on Bioinformatics since that would be better handled with other tools). The rest means:





        • /^[^#]/{ } : do this on lines that don't start with a #.


        • if(a[$1$2]<$17){ : if the value stored in the array a for the first and second fields (chromosome and position) is smaller than the 17th field of this line


        • line[$1$2]=$0; : save this line as the value for the combination of fields 1 and 2 in the array line.


        • a[$1$2]=$17 : save the 17th field as the value for the combination of first and second field in the array a.


        • END{for(i in a){print a[i]}} : just as as above, print the lines.


        Now, this will require you to store one entire line for each unique value of the second field. That can be a problem if your file is very large.






        share|improve this answer




























          3












          3








          3







          You haven't shown the output you expect, but if I understood you correctly, you're looking for something like this (file has the data from your question):



          $ sort -t= -k3 -rn file | awk '{a[$2]=$0}END{for(i in a){print a[i]}}'
          2185 24 MID=23 GO=1
          1893 48 MID=47 GO=1
          2219 58 MID=57 GO=1
          2374 213 MID=212 GO=1
          2394 233 MID=232 GO=1


          The idea is to first sort the input on the value of the GO. The -t= sets sort's field separator to =, which makes the number after GO the 3rd field. We sort on that, in reverse numerical order so that the larger numbers come first. Then, the awk will save each line as a value in the array a whose keys are the second fields. Since the file is sorted by the GO value, this means we'll always keep the greatest value for each $2. Then, at the end of the file we print the array.





          Alternatively, you can do the whole thing directly from the original file:



          $ awk -F'[t=;]' '/^[^#]/{
          if(!a[$1$2] || a[$1$2]>$17){
          line[$1$2]=$0;
          a[$1$2]=$17
          }
          }
          END{
          for(i in a){
          print line[i]
          }
          }' file.vcf
          1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
          1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
          1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1


          Here the trick is the -F'[t=;]' which sets awk's input field separator to any of t, =, or ;. That makes the GO value field 17 (assuming all of your lines have the same structure, if they don't, please ask on Bioinformatics since that would be better handled with other tools). The rest means:





          • /^[^#]/{ } : do this on lines that don't start with a #.


          • if(a[$1$2]<$17){ : if the value stored in the array a for the first and second fields (chromosome and position) is smaller than the 17th field of this line


          • line[$1$2]=$0; : save this line as the value for the combination of fields 1 and 2 in the array line.


          • a[$1$2]=$17 : save the 17th field as the value for the combination of first and second field in the array a.


          • END{for(i in a){print a[i]}} : just as as above, print the lines.


          Now, this will require you to store one entire line for each unique value of the second field. That can be a problem if your file is very large.






          share|improve this answer















          You haven't shown the output you expect, but if I understood you correctly, you're looking for something like this (file has the data from your question):



          $ sort -t= -k3 -rn file | awk '{a[$2]=$0}END{for(i in a){print a[i]}}'
          2185 24 MID=23 GO=1
          1893 48 MID=47 GO=1
          2219 58 MID=57 GO=1
          2374 213 MID=212 GO=1
          2394 233 MID=232 GO=1


          The idea is to first sort the input on the value of the GO. The -t= sets sort's field separator to =, which makes the number after GO the 3rd field. We sort on that, in reverse numerical order so that the larger numbers come first. Then, the awk will save each line as a value in the array a whose keys are the second fields. Since the file is sorted by the GO value, this means we'll always keep the greatest value for each $2. Then, at the end of the file we print the array.





          Alternatively, you can do the whole thing directly from the original file:



          $ awk -F'[t=;]' '/^[^#]/{
          if(!a[$1$2] || a[$1$2]>$17){
          line[$1$2]=$0;
          a[$1$2]=$17
          }
          }
          END{
          for(i in a){
          print line[i]
          }
          }' file.vcf
          1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
          1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
          1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
          1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1


          Here the trick is the -F'[t=;]' which sets awk's input field separator to any of t, =, or ;. That makes the GO value field 17 (assuming all of your lines have the same structure, if they don't, please ask on Bioinformatics since that would be better handled with other tools). The rest means:





          • /^[^#]/{ } : do this on lines that don't start with a #.


          • if(a[$1$2]<$17){ : if the value stored in the array a for the first and second fields (chromosome and position) is smaller than the 17th field of this line


          • line[$1$2]=$0; : save this line as the value for the combination of fields 1 and 2 in the array line.


          • a[$1$2]=$17 : save the 17th field as the value for the combination of first and second field in the array a.


          • END{for(i in a){print a[i]}} : just as as above, print the lines.


          Now, this will require you to store one entire line for each unique value of the second field. That can be a problem if your file is very large.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 21 mins ago

























          answered 49 mins ago









          terdonterdon

          131k32257436




          131k32257436

























              3














              It's a pity we can't see your original data. Given the result of you pipeline in a file called file, you would get the right result though



              $ sort -t ' ' -k2,2 -k4.4n file | sort -u -k2,2
              2374 213 MID=212 GO=1
              2394 233 MID=232 GO=1
              2185 24 MID=23 GO=1
              1893 48 MID=47 GO=1
              2219 58 MID=57 GO=1


              These are the lines in the data with the smallest GO value for each group (where the second field defines the group).



              The first sort sorts the data in such a way that each group's GO values are ordered smallest to largest. That -k4.4n should have a dot in it, not a comma. It specifies the actual value after the = in the fourth field as the secondary sort key.



              The second sort uses only the group number in the second field and does a unique sort. This has the effect of only leaving the first of each unique group in the output.





              Using the raw data in the question:



              sed 's/./;./' file |
              sort -t ';' -k1,1 -k6.4,6n |
              sort -u -t ';' -k1,1 |
              sed 's/;././' |
              sort -k1,1n -k2,2n


              The first two sort calls do the same sort of operation as in the earlier part of this answer.



              The first sed replaces the dot in column 3 with ;.. This is done to let us use ; as the field delimiter properly in the two sort calls.
              The second sed calls restores the original dots.



              The final sort gets the data sorted on chromosomes and position, as from the start.



              This results in



              #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
              1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
              1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
              1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
              1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1




              Extracting line numbers that you'd like to delete in a pipeline is often the wrong way to go about a task like this. Note that each part of a pipeline runs concurrently with every other part of the same pipeline. This means that you can't start overwriting or modifying a file in one part that you are, at the same, time reading from in another part.



              Also note that when extracting data through the stages of a pipeline, you lose the data that you're not passing on. This makes it even harder to allow the one pipeline to modify the original data (because it's lost along the way through the pipeline).






              share|improve this answer






























                3














                It's a pity we can't see your original data. Given the result of you pipeline in a file called file, you would get the right result though



                $ sort -t ' ' -k2,2 -k4.4n file | sort -u -k2,2
                2374 213 MID=212 GO=1
                2394 233 MID=232 GO=1
                2185 24 MID=23 GO=1
                1893 48 MID=47 GO=1
                2219 58 MID=57 GO=1


                These are the lines in the data with the smallest GO value for each group (where the second field defines the group).



                The first sort sorts the data in such a way that each group's GO values are ordered smallest to largest. That -k4.4n should have a dot in it, not a comma. It specifies the actual value after the = in the fourth field as the secondary sort key.



                The second sort uses only the group number in the second field and does a unique sort. This has the effect of only leaving the first of each unique group in the output.





                Using the raw data in the question:



                sed 's/./;./' file |
                sort -t ';' -k1,1 -k6.4,6n |
                sort -u -t ';' -k1,1 |
                sed 's/;././' |
                sort -k1,1n -k2,2n


                The first two sort calls do the same sort of operation as in the earlier part of this answer.



                The first sed replaces the dot in column 3 with ;.. This is done to let us use ; as the field delimiter properly in the two sort calls.
                The second sed calls restores the original dots.



                The final sort gets the data sorted on chromosomes and position, as from the start.



                This results in



                #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
                1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
                1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
                1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1




                Extracting line numbers that you'd like to delete in a pipeline is often the wrong way to go about a task like this. Note that each part of a pipeline runs concurrently with every other part of the same pipeline. This means that you can't start overwriting or modifying a file in one part that you are, at the same, time reading from in another part.



                Also note that when extracting data through the stages of a pipeline, you lose the data that you're not passing on. This makes it even harder to allow the one pipeline to modify the original data (because it's lost along the way through the pipeline).






                share|improve this answer




























                  3












                  3








                  3







                  It's a pity we can't see your original data. Given the result of you pipeline in a file called file, you would get the right result though



                  $ sort -t ' ' -k2,2 -k4.4n file | sort -u -k2,2
                  2374 213 MID=212 GO=1
                  2394 233 MID=232 GO=1
                  2185 24 MID=23 GO=1
                  1893 48 MID=47 GO=1
                  2219 58 MID=57 GO=1


                  These are the lines in the data with the smallest GO value for each group (where the second field defines the group).



                  The first sort sorts the data in such a way that each group's GO values are ordered smallest to largest. That -k4.4n should have a dot in it, not a comma. It specifies the actual value after the = in the fourth field as the secondary sort key.



                  The second sort uses only the group number in the second field and does a unique sort. This has the effect of only leaving the first of each unique group in the output.





                  Using the raw data in the question:



                  sed 's/./;./' file |
                  sort -t ';' -k1,1 -k6.4,6n |
                  sort -u -t ';' -k1,1 |
                  sed 's/;././' |
                  sort -k1,1n -k2,2n


                  The first two sort calls do the same sort of operation as in the earlier part of this answer.



                  The first sed replaces the dot in column 3 with ;.. This is done to let us use ; as the field delimiter properly in the two sort calls.
                  The second sed calls restores the original dots.



                  The final sort gets the data sorted on chromosomes and position, as from the start.



                  This results in



                  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
                  1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
                  1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
                  1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1




                  Extracting line numbers that you'd like to delete in a pipeline is often the wrong way to go about a task like this. Note that each part of a pipeline runs concurrently with every other part of the same pipeline. This means that you can't start overwriting or modifying a file in one part that you are, at the same, time reading from in another part.



                  Also note that when extracting data through the stages of a pipeline, you lose the data that you're not passing on. This makes it even harder to allow the one pipeline to modify the original data (because it's lost along the way through the pipeline).






                  share|improve this answer















                  It's a pity we can't see your original data. Given the result of you pipeline in a file called file, you would get the right result though



                  $ sort -t ' ' -k2,2 -k4.4n file | sort -u -k2,2
                  2374 213 MID=212 GO=1
                  2394 233 MID=232 GO=1
                  2185 24 MID=23 GO=1
                  1893 48 MID=47 GO=1
                  2219 58 MID=57 GO=1


                  These are the lines in the data with the smallest GO value for each group (where the second field defines the group).



                  The first sort sorts the data in such a way that each group's GO values are ordered smallest to largest. That -k4.4n should have a dot in it, not a comma. It specifies the actual value after the = in the fourth field as the secondary sort key.



                  The second sort uses only the group number in the second field and does a unique sort. This has the effect of only leaving the first of each unique group in the output.





                  Using the raw data in the question:



                  sed 's/./;./' file |
                  sort -t ';' -k1,1 -k6.4,6n |
                  sort -u -t ';' -k1,1 |
                  sed 's/;././' |
                  sort -k1,1n -k2,2n


                  The first two sort calls do the same sort of operation as in the earlier part of this answer.



                  The first sed replaces the dot in column 3 with ;.. This is done to let us use ; as the field delimiter properly in the two sort calls.
                  The second sed calls restores the original dots.



                  The final sort gets the data sorted on chromosomes and position, as from the start.



                  This results in



                  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  i0      i1      i2      i3      i4      i5      i6      i7      i8      i9      i10     i11     i12     i13     i14
                  1 1 . A T 1000 PASS MID=0;S=0.0324764;DOM=0.5;PO=1;GO=1;MT=0;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 2 . A T 1000 PASS MID=1;S=0.0125739;DOM=0.5;PO=1;GO=1;MT=5;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 3 . A T 1000 PASS MID=2;S=-0.0693919;DOM=0.5;PO=1;GO=1;MT=9;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 4 . A T 1000 PASS MID=3;S=0.0611535;DOM=0.5;PO=1;GO=1;MT=12;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 5 . A T 1000 PASS MID=4;S=-0.0791182;DOM=0.5;PO=1;GO=1;MT=16;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 6 . A T 1000 PASS MID=5;S=0.0463103;DOM=0.5;PO=1;GO=1;MT=21;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 7 . A T 1000 PASS MID=6;S=0.0509527;DOM=0.5;PO=1;GO=1;MT=25;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 8 . A T 1000 PASS MID=7;S=-0.0134404;DOM=0.5;PO=1;GO=1;MT=28;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 9 . A T 1000 PASS MID=8;S=-0.00478324;DOM=0.5;PO=1;GO=1;MT=32;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 10 . A T 1000 PASS MID=9;S=0.03588;DOM=0.5;PO=1;GO=1;MT=36;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 11 . A T 1000 PASS MID=10;S=-0.028843;DOM=0.5;PO=1;GO=1;MT=41;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 12 . A T 1000 PASS MID=11;S=-0.0832497;DOM=0.5;PO=1;GO=1;MT=45;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 13 . A T 1000 PASS MID=12;S=0.0389281;DOM=0.5;PO=1;GO=1;MT=48;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 14 . A T 1000 PASS MID=13;S=0.0362106;DOM=0.5;PO=1;GO=1;MT=53;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 15 . A T 1000 PASS MID=14;S=0.0375309;DOM=0.5;PO=1;GO=1;MT=57;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 16 . A T 1000 PASS MID=15;S=0.0112808;DOM=0.5;PO=1;GO=1;MT=60;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 17 . A T 1000 PASS MID=16;S=0.0243286;DOM=0.5;PO=1;GO=1;MT=65;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 18 . A T 1000 PASS MID=17;S=0.0596463;DOM=0.5;PO=1;GO=1;MT=69;AC=200;DP=1000 GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 24 . A T 1000 PASS MID=23;S=-0.0086571;DOM=0.5;PO=1;GO=1;MT=92;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1 1|1
                  1 58 . A T 1000 PASS MID=57;S=-0.0926969;DOM=0.5;PO=1;GO=1;MT=229;AC=198;DP=1000;MULTIALLELIC GT 1|1 0|1 1|1 1|1 0|1 1|1
                  1 213 . A T 1000 PASS MID=212;S=-0.0925562;DOM=0.5;PO=1;GO=1;MT=848;AC=196;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1
                  1 233 . A T 1000 PASS MID=232;S=-0.0868037;DOM=0.5;PO=1;GO=1;MT=929;AC=199;DP=1000;MULTIALLELIC GT 1|1 1|1 1|1 1|1 1|1 1|1




                  Extracting line numbers that you'd like to delete in a pipeline is often the wrong way to go about a task like this. Note that each part of a pipeline runs concurrently with every other part of the same pipeline. This means that you can't start overwriting or modifying a file in one part that you are, at the same, time reading from in another part.



                  Also note that when extracting data through the stages of a pipeline, you lose the data that you're not passing on. This makes it even harder to allow the one pipeline to modify the original data (because it's lost along the way through the pipeline).







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 19 mins ago

























                  answered 50 mins ago









                  KusalanandaKusalananda

                  132k17250410




                  132k17250410






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f502415%2fpiping-multiple-numbers-into-sed%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Szabolcs (Ungheria) Altri progetti | Menu di navigazione48°10′14.56″N 21°29′33.14″E /...

                      Discografia di Klaus Schulze Indice Album in studio | Album dal vivo | Singoli | Antologie | Colonne...

                      How to make inet_server_addr() return localhost in spite of ::1/128RETURN NEXT in Postgres FunctionConnect to...