I. BIO 325 Bioinformatics Lab Group Gene/Protein Assignments* for Sequence Data Sets

Instructor: Fornari

Fall 2020 (Genes and Disease)

Lab Grp# Author Name(s) Gene/Protein/Disease RefSeq # (protein)
1      
2      
3      
4      
5      
6      
7      
8      
9      
10      
11      
12      
13      
14      

Back to Bioinformatics Syllabus


*Specific Assignments: I.Biological & Biochemical Characterization of your Query Sequence (Q.S.);
Construction of your Sequence Data Set (SDS)

1. Each assigned entry is a GENE and its PROTEIN product, and each one belongs to a Gene Family; your first task is to inform yourself as accurately as possible about the molecular, biochemical, and biological properties and features of your gene/protein. For example:

  • Perform a Gene Ontology Search and analysis as you did in a previous Lab Exercise; find Acession number, KEGG number, RefSeq Numbers (all for gene, RNA, protein), and UniProt entries; basically follow that exercise and complete it for your selected gene and protein and disease. Required: a completed Table of General Gene and Protein Information
  • In which family or super-family does the gene and its protein reside? What characteristic(s) define the family? See: Protein Families and Phylodome (for a broader perspective on the role of domains in protein evolution, and the distribution of domains over lineages).
  • What is the size of the gene and its protein product? In other words, what are some of its physical properties? Can you identify characteristic domains or motifs?
  • What does it do in the cell? What is its function? What disease exactly is caused by mutation(s) in this gene? Remember: Structure to Function relationships!
  • Where in the cell does it function? And every protein functions within a cellular context (a place where a biological process occurs with its associated molecular functions)
  • On which chromosome in the human genome does this gene reside? Use the NCBI Gene ID page to assign exactly its location, and provide a completed graphic display from EMBL or HGNC or UCSC Genome Browser (for example, NCBI Entrez > 'Gene' databae in Entrez >'Ensembl' link > Chromosome link at the 'Locaton' entry; copy and paste the image, add an exact marker rectangle in red by inserting a 'shape').

An early introduction to MEGA 7.0 would be most helpful (associated paper in your I:/drive Labs folder), and we'll start early in the lab sessions with this program; also see the Online Manual and a relevant paper (Building Phylogenetic Trees from Molecular Data with MEGA)

Helpful sites:HGNC; NCBI Entrez; Ensembl; the Gene Ontology website; Pfam Home Page; InterPro Home; Bioinformatics Toolkit; NAR Database and Web Server;

How will you begin and sustain this particular task? How will you maintain the results of your research into this gene/protein and its family?

2. Start a collection of your gene/protein from various organisms to start an SDS or Sequence Data Set. One way to proceed here is to search NCBI's Entrez with the name of your protein. You will get back lots of results, but click on the 'Proteins (protein sequences)' databse link, and you should see a list of sequence references for your protein. Choose an appropriate one (from homo spapiens) as query sequence (QS) for a BLASTP, PSI-BLAST, and Delta-BLAST search. Recall from class discussions how to adjust and use the various options in BLAST, especially the scoring matrices; in other words, have you estimated the degree of expected divergence in your gene/protein? Are you seeking closely related or more distantly diverged relationships? Don't overlook the important handout (GetTZs_checks.pdf) for alternative/extra ways of finding distantly-related protein sequeces; and don't overlook one of the main goals of the project: to find the most distantly-related yet homologous protein sequence to your QS.

3. Compare and contrast the results of your your BLAST searches, and examine the end of each BLAST output to see relevant data and information on the search (as discussed in class). Using a protein E-value cut-off of < 10-3 and a sequence identity of > 25% for a protein sequence of 150 a.a. or more, > 40% sequence identity for sequences of around 70 a.a., select your sequence data set (SDS) by Downloading and Saving (in the appropriate format as discussed in class) only the sequences meeting these criteria (unless your E-values show an abrupt change from the recommended low range to an unacceptable high range; if they show such a transition, then consider using the protein sequences up to this abrupt transition for your SDS); a main objective is to find one or more protein sequenes in the 'Twilight Zone', and then to use 2 or more methods to determine whether or not these TZ proteins are true homologs to your QS..

4. You will most likely need to "clean up" your SDS by trimming the sequences to a uniform size, or by eliminating fragments or redundant entries of the same sequence by distributing percent identitis over a balanced range; also eliminate too many closely related sequences from the same organism, and make some good effort to find and use RefSeq sequences, as discussed in class.

II. Project Assignments, contd.; making MSAs
()
 

5. Create an MSA as discussed in class by using the pre-configured program packages in MEGA 7, or CLUSTAL Omega); you may use EBI's site for easy use of CLUSTAL programs along with JalView for viewing and editing the alignment. For possible comparisons, consider also using MAFFT, MUSCLE, or HMM at EBI.

6. An MSA created with the help of an alignment program based on structures for the input data is required. See these highly recommended NCBI sites as one starting point: NCBI Structure and the MMDB (NCBI's Molecular Modeling Structure Database). An equally excellent alternative includes the powerful functions of the Chimera Program (see the 'Sequence Comparison' and 'Sequence' entries in the Tools-dropdown menu for the MultiAlign Viewer with MatchMaker, Match/Align). A powerful option is to use the ConSurf Server in conjunction with Chimera.

7. Analyze, adjust and save your MSA in an appropriate file format for the Tree building or Stucture comparing programs. Catalog your results, especially from any comparisons generated from using different algorithms within the same MSA program, or from using different programs (MAFFT or MUSCLE or HMM or one of your choosing. Use GUIDANCE to evaluate your MSAs, or re-align your SDS into an evaluated MSA

8. Once again, when presented with opportunities to adjust parameters, especially the scoring matrices, have you estimated the degrees of divergence anticipated in your data set?

III. Project Assignments, contd.; building Trees and Structure alignments
()

9. Create Trees from your MSAs with programs in MEGA and phylogeny.fr; or some other type of program (FigTree); you may need to examine carefully your outputs and use some of the functions provided by the programs to adjust gap placements, etc. NOTE: a tree is only as good as the MSA that went into it! TreeDomViewer ; also see: TreeDyn; for those comfortable with 'command-line' entries, an excellent software package based in the 'R' is: APE . A powerful suite of programs with steep learning curses is Mesquite. One way to superimpose Trees to see their differences is by TreeJuxtaPoser with software at Bio-Sotf.net.

10. For structural analyses, consider 3-D PSSM, MaTras, Chimera, or the highly recommended and upgraded sites at NCBI: NCBI Structure and the MMDB page (Molecular Modelling Database). Once again, save and catalog your data and results from any comparisons and analyses of discrepancies, unexpected outcomes, or new discoveries!

IV. Tree Evaluations and determination of statistical significance scores
()

11. What's a Boostrap analysis? How would you evaluate the Bootstrap analysis output? Do all tree-building algorithms depend on a Boostrap analysis for statistical validation? See also TreeFam at the Sanger Insititute for further analyis of gene trees as compared to species trees; Compare your tree to others that may have been published (TreeBASE)