Site hosted by Angelfire.com: Build your free website today!
DNA Annotation Project

By Doug Drury

Introduction:

Through out history biologist have been restricted to only working in the wet lab to advance their experiments, but this view of the biologist has changed dramatically in the last twenty years.  With the advent of more sophisticated computer technology and the ease of internet aided communication, biologists have entered a new era of laboratory work.  This new era is the age of Bioinformatics.  Bioinformatics is a merging of biology, computer science, and information technology (1).  The goal of bioinformatics is to create a globally unified community that can search for, acquire, and apply data obtained from around the world to further advance the biological sciences.

In this project, a nucleic acid sequence from the yellow fever mosquito Aedes aegyti’s genomic library will be annotated.  All significant biological information will be derived from the sequence using the most up to date bioinformatics tool found on the internet.  These programs are constantly being updated to provide the most relevant information possible.  This information will be used to identify what proteins are possibly encoded, exon-intron sites, and promoter sites. 


For this project a nucleic acid sequence obtained from an Aedes aegyti genomic library will be annotated.


Methods and Materials:
 

BLASTn:


The first BLASTn search performed was a search using the entire provided DNA sequence to see if there were any significant hits.  These hits were noted to be compared to results from other programs.  The sequence was then place in a GENSCAN search to find possible coding sequences.  The proposed coding sequences from GENSCAN were then placed in a BLASTn to confirm the original findings.

 

GENSCAN:


Once the sequence was placed in a BLASTn queue, it was used in a GENSCAN search.  GENSCAN is a program used to predict the locations and exon-intron structures of genes in genomic sequences (4).  The parameters of the GENSCAN search were set to Organism: Vertebrate and the options were set to print the predicted coding sequence and peptides.  The organism being studied is an insect, but the DNA is being compared to vertebrate DNA.  Vertebrate was chosen as the comparison organism because the only other options were plants.  The output from the GENSCAN search showed four predicted coding sequences.  To obtain further information about these sequences, each of the sequences were used in a BLASTn search (GENSCAN output).

 

BLASTn/BLASTx:


The resulting coding sequences retrieved from the GENSCAN were individually placed in a BLASTn search.  After the BLASTn, a BLASTx was performed using the same exons to confirm the results.  The results of the BLAST searches, on the GENSCAN identified coding sequences, showed that three out of the four sequences produces a significant hit and these hits confirmed the original findings of the BLASTn of the entire sequence.

 

CgGPlot/CpGReport:


The next program used to identify coding regions of the sample DNA was the CgGPlot program found at the European bioinformatics institute.  This program scans the sequence for areas of high G and C concentration. Detection of regions of genomic sequence that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on (3).  Often CpG islands overlap the promoter and extend about 1000 base pairs downstream into the transcription unit.  Locating these islands is a great aid to confirm the existence and location of possible coding sequences.

The total Genome sequence was used in a CpGPlot (the p represents the phosphate linkage) to find CpG islands that might be a sign of a highly expressed gene.  The options of the scan were altered to include the reverse sequence and the reverse/complement sequence.  The results were compared to the predicted gene locations provided by GENSCAN.


Results:

BLASTn:


When the whole DNA sample was placed in a BLASTn many significant hits resulted.  One of these hits was a reverse transcriptase from Aedes aegypti another was a ferritin heavy chain-like protein also from Aedes aegypti.  The BLAST showed that the section of the DNA, base 13236 to 14631, corresponded to a section of an Aedes aegypti DNA sequence that codes for transposable elements.  The section that the sample DNA matches is part of the coding region for a reverse transcriptase domain (Z86117).  This match produced a Bit score of 2409 and an E value of 0.0.


The second biggest score corresponded to a ferritin subunit in Aedes aegypti.  This hit had an E value of 0.0 and a bit score of 1641.  The region that the sequences overlapped contained a ferritin subunit coding sequence (L37082).  The region where this gene occurs on the sample sequence is approximately in 5000 to 6000 base pair range.

firstBLASTn
Figure 1: Results from the whole sequence BLASTn


GENSCAN:
 

The GENSCAN output showed 4 possible coding sequences.  Each sequence was placed in a BLASTn to search for similar sequences.

When the first sequence presented by GENSCAN was placed in a BLASTn, the results were not significant.  The sequence was then used in a BLASTx.  The results form the BLASTx showed significant hits to proteins from Anopheles gambiae.  The functions of these proteins were unknown and thus could not aid in the identifying of the unknown sequence.  No valuable data was obtained from the BLASTs of sequence 1.  The GENSCAN results for the first identified coding sequence shows that it exists on the negative strand of the DNA.  It also shows the possibility for the coding sequence to contain four introns and a promoter.

 S1Bn

Figure 2: Results for the BLASTn of sequence 1


S1Bx

Figure 3: Results for the BLASTx of sequence 1

 

Sequence 2 from the GENSCAN out put was then placed in a BLASTn queue.  The results of this search showed significant hits to Aedes aegypti ferritin subunits specifically a ferritin heavy chain-like protein.  The BLASTx of sequence 2 also produced significant hits to Aedes aegypti heavy chain-like protein.

S2Bn

Figure 4: Results of the BLASTn for sequence 2

 S2Bx

Figure 5: Results of the BLASTx for sequence 2

 

Sequence 3 of the GENSCAN output gave interesting results.  Just a few hits significant hits came back and these hits were all involved with an Abdominal-B protein, but these proteins belonged to a verity of different insects.  The BLASTx of sequence 3 also returned hit to Abdominal-B proteins although these were less significant.


S3Bn

Figure 6: Results of the BLASTn from sequence 3

 
S3Bx

Figure 7: Results of the BLASTx of sequence 3

 

The BLASTn of the final sequence specified by GENSCAN produced very significant results.  These results corresponded to transposable elements found in Aedes aegypti.  The BLASTx of sequence 4 found Significant hits to unnamed Anopheles gambiae proteins and to synthetic reverse transcriptases.

S4Bn
 Figure 8: Results of the BLASTn for sequence 4

S4Bx

Figure 9: Results of the BLASTx for sequence 4



GENSCAN also produced an image that depicts the possible genes locations along the sample sequence.


GENSCAN img
GENSCAN img key
Figure 10: Image of possible exons along the sample DNA sequence



CpGPlot/CpGReport:

The CpGPlot/CpGReport searches showed 12 possible CpGIslands.  These islands were compared too the gene output from GENSCAN.  The CpGReport located the possible CpG Islands in the sample sequence.  It also produced an output that showed the location and percentage of GpC at each Island.  This output can be seen here: CpGReport Output.
CpGPlotIsland
Figure 11: Output from CpGPlot


50..1490
1822..2253
2567..2766
4117..4418
7436..7699
8802..9450
11715..12048
12177..12408
12538..12824
12877..13147
13336..13866
14116..14364

Table 1:  List of CpG Island locations in base pairs


Discussion:

The original BLASTn done using this sequence resulted in significant hits to copia-like transposable element ZebedeeI and LINE-like element JAM1 and ferritin heavy chain-like protein.  The results of the GENSCAN search showed the presence of four genes in the sequence.  The CpGPlot/CpGReport showed many CpG islands in the areas of the GENSCAN predicted genes.  There are a few CpG Islands that do not correspond to the GENSCAN predicted genes.  These inconsistencies can be explained by the fact that CpGPlot analysis is a power full tool for analyzing mammalian genomes, but it loses its power the further away the organism being studied’s taxonomic group is from mammals.

The first gene predicted by GENSCAN showed no significant hits using BLASTn, but the BLASTx found significant hits to unknown proteins in Anopheles gambiae.  This unknown protein find does not aid it the annotating of the sequence.  The GENSCAN results, on the other hand, show that this gene has no predicted pollyA tail or terminal exon; therefore, this most possibly could be a pseudo-gene brought along by the retrovirus gene, or it could have been erroneously fragmented during the processes of library creation. 

For the second, gene predicted by GENSCAN the BLASTn and BLASTx produced significant results pointing towards a ferritin heavy chain-like protein from Aedes aegypti.  This predicted gene would occur on the antisense strand of the provided sample DNA.  According the GENSCAN output, this predicted gene does have an initiation and termination exon, a promoter, and a pollyA tail.  All these are need to have an expressible gene. 

The third gene produced by GENSCAN obtained a hit to an Abdominal-B protein.  This is gene that aids in the development of the most posterior end of the thorax (2).  This proposed gene also has an initiation and termination exon, a promoter, and a pollyA tail according to GENSCAN.  The Abdominal-B protein BLAST results came from an assortment of insects.  A reason for this could be that the abdominal-B protein is a conserved protein that many insects use in development.  This gene existence could have also resulted from the being carried by the reverse transcriptase. 

The significant hits to the copia-like transposable element ZebedeeI and LINE-like element JAM1 were repeated with the fourth gene predicted by GENSCAN.  Both BLASTn and BLASTx results pointed toward reverse transcriptase genes.  This gene has an internal and termination exon present along with a pollyA tail according to the GENSCAN results, but this gene is lacking a promoter and an initiation exon according to this information. 

This sample of DNA could possibly contain 3 genes: a ferritin heavy chain-like protein, an abdominal-B protein, and a reverse transcriptase.  The ferritin and reverse transcriptase genes seem to be the only areas of importance.  The abdominal-B protein is most likely an artifact brought along by the reverse transcriptase. 

 

The probable locations of genes in this Aedes aegypti genomic library sequence are a ferritin heavy chain-like protein in the 5000 to 7000 base pair range on the antisense strand with an initiation and termination exon, a promoter and a pollyA tail.  The second possible gene is a reverse transcriptase that starts at approximately 1200 base pairs and goes on to the end of the sequence.  This sequence contains an internal and terminal exon and a pollyA tail; however, it is missing a promoter and an initiation exon.  This truncation of the sequence could have occurred during the library formation.

Reference:

[1]  National
Center
for Biotechnology Information. (2000). [Online]. Available: URL ncbi.nlm.nih.gov [December 4, 2003].        

[2]  Homeobox Genes DataBase (200). [Online]. Available: URL http://www.iephb.nw.ru/labs/lab38/spirov/hox_pro/abd-b.html [
December 4, 2003].

[3]  European Bioinformatics Institute. (2003).  [Online]. Available: URL http://www.ebi.ac.uk/ [December 4, 2003].

[4]  The New GENSCAN server at MIT (2003).  [Online]. Available: URL http://genes.mit.edu/GENSCAN.html [December 4, 2003].