About

Genome data stored in the gEVE database version 1

Species Nomenclature Genome Database Genome ID # of EVEs # of EVEs (Met)* HMM profile
gag pro pol (LINE)** env others
Human Homo sapiens GRCh38, Dec 2013 Hsap38 33,966 31,292 1,782 1,482 29,120 (21,087) 1,731 11
Chimpanzee Pan troglodytes CSAC 2.1.4/panTro4, Feb 2011 Ptro214 30,099 28,136 1,813 1,125 25,572 (19,043) 1,719 10
Gorilla Gorilla gorilla gorilla gorGor3.1/gorGo3, May 2011 Ggor31 26,335 24,409 1,456 1,034 22,462 (16,140) 1,486 8
Orangutan Pongo pygmaeus abelii PPYG2, Sep 2007 Pabe2 28,315 26,716 1,214 846 24,919 (19,492) 1,400 14
Baboon Papio anubis Panu_2.0, Jun 2012 Panu2 27,230 25,192 2,101 1,240 22,125 (15,476) 1,962 5
Macaque Macaca mulatta MMUL 1.0, Feb 2006 Mmul1 26,941 25,043 1,980 1,130 21,968 (15,745) 2,020 7
Marmoset Callithrix jacchus C_jacchus3.2.1, Jan 2010 Cjac321 21,802 20,614 992 406 19,575 (16,070) 888 3
Mouse Mus musculus GRCm38.p1, Jan 2012 Mmus38 61,184 58,805 7,494 5,602 46,784 (29,122) 3,075 16
Rat Rattus norvegicus Rnor_5.0, Mar 2012 Rnor50 34,861 32,525 2,570 1,491 29,258 (21,517) 1,771 6
Rabbit Oryctolagus cuniculus oryCun2, Nov 2009 Ocun2 13,214 12,909 438 237 12,275 (10,473) 292 2
Cow Bos taurus UMD3.1 UMD3.1, Dec 2009 BtauUMD31 105,654 104,674 1,023 673 103,402 (98,952) 648 1
Cow Bos taurus 4.6.1 Btau_4.6.1 Nov 2011 Btau461 98,016 97,150 860 641 96,065 (92,153) 585 0
Dog Canis lupus familiaris CanFam3.1, Sep 2011 Cfam31 11,393 11,011 399 135 10,815 (10,019) 78 0
Cat Felis catus Felis_catus_6.2, Sep 2011 Fcat62 11,132 10,625 694 203 9,898 (8,505) 391 1
Horse Equus caballus Equ Cab 2, Sep 2007 Ecab2 14,391 13,972 190 142 13,904 (12,554) 167 0
Sheep Ovis aries Oar_v3.1, 2012/09/24 Oari31 61,093 60,184 1,099 517 58,940 (55,274) 628 1
Pig Sus scrofa Sscrofa10.2, Aug 2011 Sscr102 15,210 14,761 456 155 14,350 (13,207) 285 9
Goat Capra hircus CHIR_1.0, Jan 2013 Chir1 37,003 36,060 1,106 508 34,797 (31,146) 653 0
Opossum Monodelphis domestica monDom5, Oct 2006 Mdom5 77,190 73,029 2,546 2,723 71,821 (46,874) 1,134 0
Platypus Ornithorhynchus anatinus OANA5, Dec 2005 Oana5 1,742 1,365 2 1 1,732 (1,658) 7 0

*An EVE sequence containing Methionine that can be utilized as the initiation codon (Note that some EVE ORFs obtained in this study do not have an ATG codon)
**Number shown in parentheses indicates pol genes that were thought to be derived from LINEs, which were annotated as “LINE” by RepeatMasker and/or “YP_073558.1” or “NP_048132.1” by BLASTP against the NCBI Viral Genome Database.

HMM profile distributions for each species

hmm_cnt-1.xlsx

Comparison of EVE identification methods

Genome ID species ReTe ∩ ReMa ∩ Blat ReTe ∩ ReMa ReTe ∩ Blat ReMa ∩ Blat ReTe ReMa Blat Other
Hsap38 Human 9,959 1 143 22,790 573 3 496 1
Ptro214 Chimpanzee 8,121 8 99 20,981 459 9 421 1
Ggor31 Gorilla 6,998 3 111 18,211 362 9 640 1
Pabe2 Orangutan 6,130 2 242 21,135 379 18 408 1
Panu2 Baboon 8,876 1 153 17,233 492 3 471 1
Mmul1 Macaque 8,289 3 132 17,683 440 6 387 1
Cjac321 Marmoset 3,949 3 159 16,928 305 6 452 0
Mmus38 Mouse 25,031 14 407 34,260 573 26 873 0
Rnor50 Rat 9,691 0 258 23,890 404 4 614 0
Ocun2 Rabbit 1,875 0 105 10,676 88 3 467 0
BtauUMD31 Cow 8,183 7 193 95,632 618 270 751 0
Btau461 Cow 7,242 9 186 89,134 566 263 616 0
Cfam31 Dog 1,075 0 92 9,948 62 1 215 0
Fcat62 Cat 1,545 1 674 8,259 105 1 547 0
Ecab2 Horse 1,213 1 41 12,602 129 2 403 0
Oari31 Sheep 1,789 5 84 57,672 207 158 1,178 0
Sscr102 Pig 1,505 0 45 13,138 104 1 417 0
Chir1 Goat 4,026 5 274 31,105 431 158 1,004 0
Mdom5 Opossum 17,972 1,159 335 54,040 903 1,581 1,200 0
Oana5 Platypus 39 0 2 1,601 0 10 90 0
SUM 133,508 1,222 3,735 576,918 7,200 2,532 11,650 6

*ReTe, RetroTector; ReMa, RepeatMasker

Methods

A schematic workflow of a four-step procedure for identifying EVE ORFs in 20 mammalian genomes

 

gEVE_pipeline

STEP1

  1. 20 mammalian genome sequences were analyzed using two well-known programs for identification of EVE candidate regions.
    • RetroTector version 1.0.1
    • Repeat Masker version 4.03 with RMblast + RepBase (20140423)

STEP2

  1. Six possible reading frames (ORF) coding more than 80 aa from EVE sequences (+100 bp at both ends) in both strands were scanned.
    • In this database, EVE ORF may not start with ATG codon
  2. For the EVE candidate ORFs, hmmer3 search using HMM profiles of viral proteins (shown below) were performed.
    • Pfam (39 profiles)
      • gag (8)
        • Gag_p24, Gag_p30, zf-CCHC, zf-CCHC_2, zf-CCHC_3, zf-CCHC_4, zf-CCHC_5, zf-CCHC_6
      • pro (23)
        • Asp, Asp_protease, Asp_protease_2, Cytomega_UL84, DCD, DUF1758, DUF570, G-patch, G-patch_2, Herpes_ORF11, Herpes_U55, Herpes_UL82_83, Peptidase_A2B, Peptidase_A2E, Peptidase_A3, RVP, RVP_2, Spuma_A9PTase, TAXi_C, TAXi_N, Zn_protease, dUTPase, gag-asp_proteas
      • pol (6)
        • IN_DBD_C, Integrase_Zn, RNase_H, RVT_1, RVT_thumb, rve
      • env (2)
        • GP41, TLV_coat
    • GypsyDB (304 profiles)
      • gag
      • Gag (51)
        • GAG_412_mdg1, GAG_alpharetroviridae, GAG_athila, GAG_betaretroviridae, GAG_cer2_3, GAG_crm, GAG_reina, GAG_del, GAG_galadriel, GAG_deltaretroviridae, GAG_gammaretroviridae, GAG_lentiviridae, GAG_micropia_mdg3, GAG_osvaldo, GAG_spumaretroviridae, GAG_tat, GAG_TF, GAG_csrn1, GAG_1731, GAG_17_6, GAG_a_clade, GAG_b_clade, GAG_bel, GAG_codi_c, GAG_codi_d, GAG_codi_I, GAG_codi_II, GAG_copia, GAG_epsilonretroviridae, GAG_galea, GAG_gmr1, GAG_gypsy, GAG_hydra, GAG_maggy, GAG_oryco, GAG_pao, GAG_pCretro, GAG_pseudovirus, GAG_pyggy, GAG_pyret, GAG_retrofit, GAG_sinbad, GAG_sire, GAG_suzu, GAG_tas, GAG_tork, GAG_v_clade, GAGCOAT_caulimovirus, GAGCOAT_cavemovirus, GAGCOAT_soymovirus, GAGCOAT_badnavirus
      • Protease (62)
        • AP_DTG_ILG_template, AP_micropia_mdg3, AP_alpharetroviridae, AP_athila, AP_betaretroviridae, AP_cer2_3, AP_bel, AP_crm, AP_reina, AP_del, AP_galadriel, AP_cog3577, AP_cog5550, AP_ddi, AP_deltaretroviridae, AP_gammaretroviridae, AP_lentiviridae, AP_nix1, AP_osvaldo, AP_pepsins_A1a, AP_pepsins_A1b, AP_saspase, AP_retropepsins, AP_spumaretroviridae, AP_tat, AP_ty1copia, AP_csrn1, AP_1731, AP_17_6, AP_412_mdg1, AP_a_clade, AP_b_clade, AP_caulimoviridae_dom2, AP_caulimovirus, AP_badnavirus, AP_cavemovirus, AP_codi_c, AP_codi_d, AP_codi_I, AP_codi_II, AP_copia, AP_epsilonretroviridae, AP_galea, AP_gmr1, AP_gypsy, AP_hydra, AP_maggy, AP_oryco, AP_pao, AP_pseudovirus, AP_pyggy, AP_pyret, AP_pCretro, AP_retrofit, AP_sinbad, AP_sire, AP_soymovirus, AP_suzu, AP_tas, AP_TF, AP_tork, AP_v_clade
      • Reverse_Transcriptase (51)
        • RT_412_mdg1, RT_alpharetroviridae, RT_athila, RT_betaretroviridae, RT_cer2_3, RT_crm, RT_reina, RT_del, RT_galadriel, RT_TF, RT_deltaretroviridae, RT_gammaretroviridae, RT_lentiviridae, RT_micropia_mdg3, RT_osvaldo, RT_spumaretroviridae, RT_tat, RT_csrn1, RT_1731, RT_17_6, RT_a_clade, RT_b_clade, RT_badnavirus, RT_bel, RT_caulimovirus, RT_cavemovirus, RT_codi_c, RT_codi_d, RT_codi_I, RT_codi_II, RT_copia, RT_epsilonretroviridae, RT_galea, RT_gmr1, RT_gypsy, RT_hydra, RT_maggy, RT_oryco, RT_pao, RT_pCretro, RT_pseudovirus, RT_pyggy, RT_pyret, RT_retrofit, RT_sinbad, RT_sire, RT_soymovirus, RT_suzu, RT_tas, RT_tork, RT_v_clade
      • Ribonuclease_H (51)
        • RNaseH_412_mdg1, RNaseH_alpharetroviridae, RNaseH_athila, RNaseH_betaretroviridae, RNaseH_cer2_3, RNaseH_crm, RNaseH_reina, RNaseH_del, RNaseH_galadriel, RNaseH_TF, RNaseH_deltaretroviridae, RNaseH_gammaretroviridae, RNaseH_lentiviridae, RNaseH_micropia_mdg3, RNaseH_osvaldo, RNaseH_spumaretroviridae, RNaseH_tat, RNaseH_csrn1, RNaseH_1731, RNaseH_17_6, RNaseH_a_clade, RNaseH_b_clade, RNaseH_badnavirus, RNaseH_bel, RNaseH_caulimovirus, RNaseH_cavemovirus, RNaseH_codi_c, RNaseH_codi_d, RNaseH_codi_I, RNaseH_codi_II, RNaseH_copia, RNaseH_epsilonretroviridae, RNaseH_galea, RNaseH_gmr1, RNaseH_gypsy, RNaseH_hydra, RNaseH_maggy, RNaseH_oryco, RNaseH_pao, RNaseH_pCretro, RNaseH_pseudovirus, RNaseH_pyggy, RNaseH_pyret, RNaseH_retrofit, RNaseH_sinbad, RNaseH_sire, RNaseH_soymovirus, RNaseH_suzu, RNaseH_tas, RNaseH_tork, RNaseH_v_clade
      • Integrase (48)
        • INT_412_mdg1, INT_alpharetroviridae, INT_athila, INT_betaretroviridae, INT_cer2_3, INT_crm, INT_reina, INT_del, INT_galadriel, INT_TF, INT_deltaretroviridae, INT_gammaretroviridae, INT_lentiviridae, INT_micropia_mdg3, INT_osvaldo, INT_spumaretroviridae, INT_tat, INT_csrn1, GIN1, INT_1731, INT_17_6, INT_a_clade, INT_b_clade, INT_bel, INT_codi_c, INT_codi_d, INT_codi_I, INT_codi_II, INT_copia, INT_epsilonretroviridae, INT_galea, INT_gmr1, INT_gypsy, INT_hydra, INT_maggy, INT_oryco, INT_pao, INT_pCretro, INT_pseudovirus, INT_pyggy, INT_pyret, INT_retrofit, INT_sinbad, INT_sire, INT_suzu, INT_tas, INT_tork, INT_v_clade
      • Chromodomain (3)
        • CHR_all, CHR_shadow, CHR_retroelement
      • Env (13)
        • ENV_alpharetroviridae, ENV_athila, ENV_B-type_betaretroviridae, ENV_D-type_betaretroviridae, ENV_deltaretroviridae, ENV_gammaretroviridae, ENV_errantiviridae, ENV_lentiviridae, ENV_retroviridae, ENV_spumaretroviridae, ENV_epsilonretroviridae, ENV_sire, ENV_tas
      • dUTPase (3)
        • DUT_betaretroviridae, DUT_lentiviridae, DUT_caulimoviruses
      • Accessory_proteins (22)
        • BEL1_spumaretroviridae, BEL2_spumaretroviridae, NEF_retroviridae, ORFQ_retroviridae, VIF_retroviridae, VIF_Q_retroviridae, ORFW_retroviridae, ORFX_betaretroviridae, REV_retroviridae, REX_retroviridae, ROF_retroviridae, SORF_retroviridae, TAT_retroviridae, TAX_retroviridae, TOF_retroviridae, VPR_VPX_retroviridae, VPR_retroviridae, VPX_retroviridae, ORF2_badnavirus, ORF6_badnavirus, ORFB_soymovirus, ORFC_soymovirus
      • Aphid_transmission_factor (1)
        • ATF_caulimovirus
      • Movement_protein (4)
        • MOV_badnavirus, MOV_caulimovirus, MOV_soymovirus, MOV_cavemovirus
      • Transactivator/viroplasmin_protein (3)
        • TAV_caulimovirus, TAV_cavemovirus, TAV_soymovirus
      • Virion_associated_protein (2)
        • VAP_caulimovirus, VAP_badnavirus

STEP3

  1. 20 genome sequences were scanned by BLAT searches using the following amino acid sequences:
    • All EVE sequences identified in the 20 genomes (above mentioned sequences)
      • 774,172 sequences(726,934 sequences without duplicates)
    • All viral sequences coding proteins stored in the NCBI Viral Genome Database (viral.1.protein.faa, version 07/10, 2014)
      • 200,396 sequences (178,484 sequences without duplicates)
    • Known EVE sequences coding proteins

STEP4

  1. All BLAT hit sequences were also searched by hmmer3 as shown in the STEP2 and merged all sequences identified RetroTector, Repeat Masker and BLAT

Annotation

We simply applied the BLAST best hit(s) for annotation of each EVE sequence. The relationships among BLAST hits against all viral sequences coding proteins stored in the NCBI Viral Genome Database (viral.1.protein.faa) shown in the database.

Validation

Comparison between known EVE sequences and gEVE IDs were shown in the Google Sheets.

Usage (examples)

RNA-seq analysis

We provide a General Transfer Format (GTF) file for EVE gene loci of each genome stored in the gEVE database (see “Download” page).  Using these GTF files with NGS data, dynamic expression profiles of EVE genes can be examined.  For example, the RNA-seq data of human placenta expression (ID: ERR315374) stored in the sequence read archive (SRA) were examined.  The FASTQ sequences were obtained and mapped onto the human genome (GRCh38) using TopHat2.  The expression levels of EVE sequences were computed using Cufflinks with the GTF file of gEVE Hsap38.  The top 10 EVE sequences showing biggest FPKM values (i.e. highly expressed EVE sequences) are summarized in Table 3.  We successfully identified known EVEs expressed in human placenta — PEG10, suppressyn, syncytin-1 and syncytin-2 — as well as novel EVE sequences.  This result shows that NGS data analyses combined with our annotation data enable us to discover hidden functional EVE sequences in genomes.