Identify non-coding regions from a genome annotationHow to calculate overlapping genes between two genome...

Have the UK Conservatives lost the working majority and if so, what does this mean?

Resorting data from a multidimensional list

What does it mean for south of due west?

Python to write multiple dataframes and highlight rows inside an excel file

In the Lost in Space intro why was Dr. Smith actor listed as a special guest star?

Is it possible to narrate a novel in a faux-historical style without alienating the reader?

Will the duration of traveling to Ceres using the same tech developed for going to Mars be proportional to the distance to go to Mars or not?

Identify non-coding regions from a genome annotation

Did ancient Germans take pride in leaving the land untouched?

How to purchase a drop bar bike that will be converted to flat bar?

Boss asked me to sign a resignation paper without a date on it along with my new contract

Is there a name for this series?

Converting numbers to words - Python

What is an efficient way to digitize a family photo collection?

Is the tritone (A4 / d5) still banned in Roman Catholic music?

Are all power cords made equal?

Missing a connection and don't have money to book next flight

Sed-Grep-Awk operations

What is an explicit bijection in combinatorics?

Trying to make a 3dplot

Does しかたない imply disappointment?

How to deal with an underperforming subordinate?

Is Screenshot Time-tracking Common?

Identical projects by students at two different colleges: still plagiarism?



Identify non-coding regions from a genome annotation


How to calculate overlapping genes between two genome annotation versionsWhy do BLASTn and prokka not seem to be searching the whole fasta file?Same transcript coordinates in gtf file, different transcript IDhg38 GTF file with RefSeq annotationsHow can I calculate gene_length for RPKM calculation from counts data?dividing genome into non-overlapping windows using RIn the gff3 format, could one eukaryotic mRNA contain more than one protein coding gene (i.e. polycistronic)?RNA seq fasta file annotation from alignment to reference matchesRATT works on example bacterial sequence and other bacterial genome but not on C. elegans genome and annotationDerive a GTF containing protein coding genes from a GTF file with Exons and CDS













4












$begingroup$


I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:



awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf


How I could do the inverse and keep only non coding regions?










share|improve this question











$endgroup$












  • $begingroup$
    Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
    $endgroup$
    – terdon
    1 hour ago
















4












$begingroup$


I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:



awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf


How I could do the inverse and keep only non coding regions?










share|improve this question











$endgroup$












  • $begingroup$
    Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
    $endgroup$
    – terdon
    1 hour ago














4












4








4


2



$begingroup$


I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:



awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf


How I could do the inverse and keep only non coding regions?










share|improve this question











$endgroup$




I have this GTF file and I use the command below on a Linux machine to extract the coding regions of the genome:



awk '{if($3=="transcript" && $20==""protein_coding";"){print $0}}' gencode.gtf


How I could do the inverse and keep only non coding regions?







annotation genome gtf text-processing interval






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 20 mins ago









Daniel Standage

2,303329




2,303329










asked 7 hours ago









Feresh TehFeresh Teh

39111




39111












  • $begingroup$
    Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
    $endgroup$
    – terdon
    1 hour ago


















  • $begingroup$
    Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
    $endgroup$
    – terdon
    1 hour ago
















$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon
1 hour ago




$begingroup$
Do you want all non-coding regions of the genome or do you want all non-coding transcripts? These are two very different things.
$endgroup$
– terdon
1 hour ago










3 Answers
3






active

oldest

votes


















3












$begingroup$

Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



UTR has its own feature in the gtf file. So you can do this:



$ awk -v FS="t" '$3=="UTR"' gencode.gtf


If the gtf file is compressed use this instead:



$ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


BTW: Why are you using such an old release of gencode? The current version is v29.






share|improve this answer











$endgroup$













  • $begingroup$
    Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
    $endgroup$
    – Feresh Teh
    6 hours ago










  • $begingroup$
    Sorry I tried hat but my output is empty
    $endgroup$
    – Feresh Teh
    6 hours ago






  • 1




    $begingroup$
    As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
    $endgroup$
    – finswimmer
    6 hours ago












  • $begingroup$
    Sorry but your commands return nothing, I mean not working returning empty file
    $endgroup$
    – Feresh Teh
    1 hour ago










  • $begingroup$
    The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
    $endgroup$
    – terdon
    49 mins ago



















2












$begingroup$

This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



Note: the --delta parameter will extend each gene/transcript by 500bp by default.



Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.






share|improve this answer











$endgroup$





















    1












    $begingroup$

    If you want all transcripts from that gtf file whose type isn't "protein coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



    awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


    Or, a simpler version:



    awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


    Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



    Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



    awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
    1 "translated_processed_pseudogene";
    2 "Mt_rRNA";
    3 "IG_J_pseudogene";
    3 "TR_D_gene";
    4 "TR_J_pseudogene";
    5 "TR_C_gene";
    10 "IG_C_pseudogene";
    18 "IG_C_gene";
    18 "IG_J_gene";
    22 "Mt_tRNA";
    25 "3prime_overlapping_ncrna";
    27 "TR_V_pseudogene";
    37 "IG_D_gene";
    58 "non_stop_decay";
    59 "polymorphic_pseudogene";
    74 "TR_J_gene";
    97 "TR_V_gene";
    144 "IG_V_gene";
    182 "unitary_pseudogene";
    196 "IG_V_pseudogene";
    330 "sense_overlapping";
    387 "pseudogene";
    442 "transcribed_processed_pseudogene";
    531 "rRNA";
    802 "sense_intronic";
    860 "transcribed_unprocessed_pseudogene";
    1529 "snoRNA";
    1923 "snRNA";
    2050 "misc_RNA";
    2549 "unprocessed_pseudogene";
    3116 "miRNA";
    9710 "antisense";
    10623 "processed_pseudogene";
    11780 "lincRNA";
    13052 "nonsense_mediated_decay";
    25955 "retained_intron";
    28082 "processed_transcript";


    You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



    awk '$3=="transcript" && 
    $
    20!=""protein_coding";" &&
    $20!=""translated_processed_pseudogene";"' gencode.gtf





    share|improve this answer











    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "676"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7098%2fidentify-non-coding-regions-from-a-genome-annotation%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3












      $begingroup$

      Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



      UTR has its own feature in the gtf file. So you can do this:



      $ awk -v FS="t" '$3=="UTR"' gencode.gtf


      If the gtf file is compressed use this instead:



      $ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


      BTW: Why are you using such an old release of gencode? The current version is v29.






      share|improve this answer











      $endgroup$













      • $begingroup$
        Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
        $endgroup$
        – Feresh Teh
        6 hours ago










      • $begingroup$
        Sorry I tried hat but my output is empty
        $endgroup$
        – Feresh Teh
        6 hours ago






      • 1




        $begingroup$
        As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
        $endgroup$
        – finswimmer
        6 hours ago












      • $begingroup$
        Sorry but your commands return nothing, I mean not working returning empty file
        $endgroup$
        – Feresh Teh
        1 hour ago










      • $begingroup$
        The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
        $endgroup$
        – terdon
        49 mins ago
















      3












      $begingroup$

      Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



      UTR has its own feature in the gtf file. So you can do this:



      $ awk -v FS="t" '$3=="UTR"' gencode.gtf


      If the gtf file is compressed use this instead:



      $ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


      BTW: Why are you using such an old release of gencode? The current version is v29.






      share|improve this answer











      $endgroup$













      • $begingroup$
        Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
        $endgroup$
        – Feresh Teh
        6 hours ago










      • $begingroup$
        Sorry I tried hat but my output is empty
        $endgroup$
        – Feresh Teh
        6 hours ago






      • 1




        $begingroup$
        As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
        $endgroup$
        – finswimmer
        6 hours ago












      • $begingroup$
        Sorry but your commands return nothing, I mean not working returning empty file
        $endgroup$
        – Feresh Teh
        1 hour ago










      • $begingroup$
        The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
        $endgroup$
        – terdon
        49 mins ago














      3












      3








      3





      $begingroup$

      Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



      UTR has its own feature in the gtf file. So you can do this:



      $ awk -v FS="t" '$3=="UTR"' gencode.gtf


      If the gtf file is compressed use this instead:



      $ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


      BTW: Why are you using such an old release of gencode? The current version is v29.






      share|improve this answer











      $endgroup$



      Getting the non coding regions of a protein coding transcript, sounds like you are looking for UTR.



      UTR has its own feature in the gtf file. So you can do this:



      $ awk -v FS="t" '$3=="UTR"' gencode.gtf


      If the gtf file is compressed use this instead:



      $ zcat gencode.gtf.gz | awk -v FS="t" '$3=="UTR"'


      BTW: Why are you using such an old release of gencode? The current version is v29.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited 6 hours ago

























      answered 6 hours ago









      finswimmerfinswimmer

      962210




      962210












      • $begingroup$
        Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
        $endgroup$
        – Feresh Teh
        6 hours ago










      • $begingroup$
        Sorry I tried hat but my output is empty
        $endgroup$
        – Feresh Teh
        6 hours ago






      • 1




        $begingroup$
        As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
        $endgroup$
        – finswimmer
        6 hours ago












      • $begingroup$
        Sorry but your commands return nothing, I mean not working returning empty file
        $endgroup$
        – Feresh Teh
        1 hour ago










      • $begingroup$
        The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
        $endgroup$
        – terdon
        49 mins ago


















      • $begingroup$
        Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
        $endgroup$
        – Feresh Teh
        6 hours ago










      • $begingroup$
        Sorry I tried hat but my output is empty
        $endgroup$
        – Feresh Teh
        6 hours ago






      • 1




        $begingroup$
        As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
        $endgroup$
        – finswimmer
        6 hours ago












      • $begingroup$
        Sorry but your commands return nothing, I mean not working returning empty file
        $endgroup$
        – Feresh Teh
        1 hour ago










      • $begingroup$
        The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
        $endgroup$
        – terdon
        49 mins ago
















      $begingroup$
      Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
      $endgroup$
      – Feresh Teh
      6 hours ago




      $begingroup$
      Sorry, literally I need non coding regions of human genome, but for asking my question here I referred to coding parts too
      $endgroup$
      – Feresh Teh
      6 hours ago












      $begingroup$
      Sorry I tried hat but my output is empty
      $endgroup$
      – Feresh Teh
      6 hours ago




      $begingroup$
      Sorry I tried hat but my output is empty
      $endgroup$
      – Feresh Teh
      6 hours ago




      1




      1




      $begingroup$
      As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
      $endgroup$
      – finswimmer
      6 hours ago






      $begingroup$
      As @Wouter tells you, the non coding region of a genome is the complement of the coding regions. Coding regions have its own feature in the gtf file. You can get them with $ awk -v FS="t" '$3=="CDS"' gencode.gtf. Reading the manual for bedtools complement is your task.
      $endgroup$
      – finswimmer
      6 hours ago














      $begingroup$
      Sorry but your commands return nothing, I mean not working returning empty file
      $endgroup$
      – Feresh Teh
      1 hour ago




      $begingroup$
      Sorry but your commands return nothing, I mean not working returning empty file
      $endgroup$
      – Feresh Teh
      1 hour ago












      $begingroup$
      The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
      $endgroup$
      – terdon
      49 mins ago




      $begingroup$
      The gtf file the OP has linked to includes non-coding transcripts (LINCs, pseudogenes, tRNAs etc). I am guessing this is what they're after.
      $endgroup$
      – terdon
      49 mins ago











      2












      $begingroup$

      This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



      A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



      Note: the --delta parameter will extend each gene/transcript by 500bp by default.



      Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



      Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.






      share|improve this answer











      $endgroup$


















        2












        $begingroup$

        This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



        A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



        Note: the --delta parameter will extend each gene/transcript by 500bp by default.



        Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



        Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.






        share|improve this answer











        $endgroup$
















          2












          2








          2





          $begingroup$

          This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



          A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



          Note: the --delta parameter will extend each gene/transcript by 500bp by default.



          Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



          Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.






          share|improve this answer











          $endgroup$



          This isn't a problem that's easily solved with awk. It's not like you're extracting a feature that's annotated in the GTF file. Instead, you want the empty space between annotated features.



          A few years ago I wrote a program called LocusPocus for a similar task. It uses a gene annotation to break down a genome into gene loci and intergenic regions. It handles overlapping annotations and other weirdness pretty robustly. The output will include both coding regions and non-coding regions, but you can identify the intergenic spaces as those with iLocus_type equal to iiLocus or fiLocus.



          Note: the --delta parameter will extend each gene/transcript by 500bp by default.



          Caveat: the program only accepts GFF3 input by default. Hopefully it won't be too hard to convert your GTF to GFF3.



          Another caveat: eventual interpretation of these data will depend on what features are annotated in the genome and which annotations you include vs ignore. Do you want your non-coding regions to include non-coding genes, or should these be treated separately? Some non-coding regions will be full of transposable elements and other repetitive DNA, while others will have enhancers, promoters, or other regulatory elements. It's important to tread carefully before you jump to any conclusions.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 13 mins ago

























          answered 23 mins ago









          Daniel StandageDaniel Standage

          2,303329




          2,303329























              1












              $begingroup$

              If you want all transcripts from that gtf file whose type isn't "protein coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



              awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


              Or, a simpler version:



              awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


              Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



              Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



              awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
              1 "translated_processed_pseudogene";
              2 "Mt_rRNA";
              3 "IG_J_pseudogene";
              3 "TR_D_gene";
              4 "TR_J_pseudogene";
              5 "TR_C_gene";
              10 "IG_C_pseudogene";
              18 "IG_C_gene";
              18 "IG_J_gene";
              22 "Mt_tRNA";
              25 "3prime_overlapping_ncrna";
              27 "TR_V_pseudogene";
              37 "IG_D_gene";
              58 "non_stop_decay";
              59 "polymorphic_pseudogene";
              74 "TR_J_gene";
              97 "TR_V_gene";
              144 "IG_V_gene";
              182 "unitary_pseudogene";
              196 "IG_V_pseudogene";
              330 "sense_overlapping";
              387 "pseudogene";
              442 "transcribed_processed_pseudogene";
              531 "rRNA";
              802 "sense_intronic";
              860 "transcribed_unprocessed_pseudogene";
              1529 "snoRNA";
              1923 "snRNA";
              2050 "misc_RNA";
              2549 "unprocessed_pseudogene";
              3116 "miRNA";
              9710 "antisense";
              10623 "processed_pseudogene";
              11780 "lincRNA";
              13052 "nonsense_mediated_decay";
              25955 "retained_intron";
              28082 "processed_transcript";


              You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



              awk '$3=="transcript" && 
              $
              20!=""protein_coding";" &&
              $20!=""translated_processed_pseudogene";"' gencode.gtf





              share|improve this answer











              $endgroup$


















                1












                $begingroup$

                If you want all transcripts from that gtf file whose type isn't "protein coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



                awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


                Or, a simpler version:



                awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


                Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



                Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



                awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
                1 "translated_processed_pseudogene";
                2 "Mt_rRNA";
                3 "IG_J_pseudogene";
                3 "TR_D_gene";
                4 "TR_J_pseudogene";
                5 "TR_C_gene";
                10 "IG_C_pseudogene";
                18 "IG_C_gene";
                18 "IG_J_gene";
                22 "Mt_tRNA";
                25 "3prime_overlapping_ncrna";
                27 "TR_V_pseudogene";
                37 "IG_D_gene";
                58 "non_stop_decay";
                59 "polymorphic_pseudogene";
                74 "TR_J_gene";
                97 "TR_V_gene";
                144 "IG_V_gene";
                182 "unitary_pseudogene";
                196 "IG_V_pseudogene";
                330 "sense_overlapping";
                387 "pseudogene";
                442 "transcribed_processed_pseudogene";
                531 "rRNA";
                802 "sense_intronic";
                860 "transcribed_unprocessed_pseudogene";
                1529 "snoRNA";
                1923 "snRNA";
                2050 "misc_RNA";
                2549 "unprocessed_pseudogene";
                3116 "miRNA";
                9710 "antisense";
                10623 "processed_pseudogene";
                11780 "lincRNA";
                13052 "nonsense_mediated_decay";
                25955 "retained_intron";
                28082 "processed_transcript";


                You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



                awk '$3=="transcript" && 
                $
                20!=""protein_coding";" &&
                $20!=""translated_processed_pseudogene";"' gencode.gtf





                share|improve this answer











                $endgroup$
















                  1












                  1








                  1





                  $begingroup$

                  If you want all transcripts from that gtf file whose type isn't "protein coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



                  awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


                  Or, a simpler version:



                  awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


                  Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



                  Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



                  awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
                  1 "translated_processed_pseudogene";
                  2 "Mt_rRNA";
                  3 "IG_J_pseudogene";
                  3 "TR_D_gene";
                  4 "TR_J_pseudogene";
                  5 "TR_C_gene";
                  10 "IG_C_pseudogene";
                  18 "IG_C_gene";
                  18 "IG_J_gene";
                  22 "Mt_tRNA";
                  25 "3prime_overlapping_ncrna";
                  27 "TR_V_pseudogene";
                  37 "IG_D_gene";
                  58 "non_stop_decay";
                  59 "polymorphic_pseudogene";
                  74 "TR_J_gene";
                  97 "TR_V_gene";
                  144 "IG_V_gene";
                  182 "unitary_pseudogene";
                  196 "IG_V_pseudogene";
                  330 "sense_overlapping";
                  387 "pseudogene";
                  442 "transcribed_processed_pseudogene";
                  531 "rRNA";
                  802 "sense_intronic";
                  860 "transcribed_unprocessed_pseudogene";
                  1529 "snoRNA";
                  1923 "snRNA";
                  2050 "misc_RNA";
                  2549 "unprocessed_pseudogene";
                  3116 "miRNA";
                  9710 "antisense";
                  10623 "processed_pseudogene";
                  11780 "lincRNA";
                  13052 "nonsense_mediated_decay";
                  25955 "retained_intron";
                  28082 "processed_transcript";


                  You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



                  awk '$3=="transcript" && 
                  $
                  20!=""protein_coding";" &&
                  $20!=""translated_processed_pseudogene";"' gencode.gtf





                  share|improve this answer











                  $endgroup$



                  If you want all transcripts from that gtf file whose type isn't "protein coding", you can use almost the same command, just change the == ("is") to != ("isn't"):



                  awk '{if($3=="transcript" && $20!=""protein_coding";"){print $0}}' gencode.gtf 


                  Or, a simpler version:



                  awk '$3=="transcript" && $20!=""protein_coding";"' gencode.gtf 


                  Note that this will not include any of the havana transcripts in the file, but I am assuming that's what you want since that's what your original command did.



                  Specifically, the command will return the following types of transcript (the numbers on the left are the number of such transcripts in the file):



                  awk '$3=="transcript" && $20!=""protein_coding";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
                  1 "translated_processed_pseudogene";
                  2 "Mt_rRNA";
                  3 "IG_J_pseudogene";
                  3 "TR_D_gene";
                  4 "TR_J_pseudogene";
                  5 "TR_C_gene";
                  10 "IG_C_pseudogene";
                  18 "IG_C_gene";
                  18 "IG_J_gene";
                  22 "Mt_tRNA";
                  25 "3prime_overlapping_ncrna";
                  27 "TR_V_pseudogene";
                  37 "IG_D_gene";
                  58 "non_stop_decay";
                  59 "polymorphic_pseudogene";
                  74 "TR_J_gene";
                  97 "TR_V_gene";
                  144 "IG_V_gene";
                  182 "unitary_pseudogene";
                  196 "IG_V_pseudogene";
                  330 "sense_overlapping";
                  387 "pseudogene";
                  442 "transcribed_processed_pseudogene";
                  531 "rRNA";
                  802 "sense_intronic";
                  860 "transcribed_unprocessed_pseudogene";
                  1529 "snoRNA";
                  1923 "snRNA";
                  2050 "misc_RNA";
                  2549 "unprocessed_pseudogene";
                  3116 "miRNA";
                  9710 "antisense";
                  10623 "processed_pseudogene";
                  11780 "lincRNA";
                  13052 "nonsense_mediated_decay";
                  25955 "retained_intron";
                  28082 "processed_transcript";


                  You might also want to remove that "translated_processed_pseudogene" since that is actually translated into protein and is therefore technically coding:



                  awk '$3=="transcript" && 
                  $
                  20!=""protein_coding";" &&
                  $20!=""translated_processed_pseudogene";"' gencode.gtf






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited 46 mins ago

























                  answered 53 mins ago









                  terdonterdon

                  4,2691729




                  4,2691729






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Bioinformatics Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f7098%2fidentify-non-coding-regions-from-a-genome-annotation%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Szabolcs (Ungheria) Altri progetti | Menu di navigazione48°10′14.56″N 21°29′33.14″E /...

                      Discografia di Klaus Schulze Indice Album in studio | Album dal vivo | Singoli | Antologie | Colonne...

                      How to make inet_server_addr() return localhost in spite of ::1/128RETURN NEXT in Postgres FunctionConnect to...