FAQ - Formations

Publications or Web Tools that might interest you:

NGS

Great introduction to genome assembly! "De novo genome assembly: what every biologist should know" :

http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1935.html

Everything you ever wanted to know about genome assembly, but were too afraid to ask :

http://www.cbcb.umd.edu/research/assembly_primer.shtml

A wiki database of tools for high-throughput sequencing analysis:

http://seqanswers.com/wiki/SEQanswers

How can I get an idea about the evolutionary origin and relationships of a protein? A colleague suggested psiBLAST but I would like to know if there is something better (and/or easier) available that can also make phylogenetic trees to illustrate important evolutionary distances.

You can try the ncbi blast. From the blastp interface, you can choose blastp or psiblast as  algorithms for  comparing your sequence with any of the database (the latter one can be filtered using a keyword for any organism or taxonomy, if you want to focus on specific species). Independently of the algorithm used, once you get the results, you can select some sequences found by blastp or psiblast (from the first or second run).
Then, at the end of the form, you can ask for (1) a distance tree of results (2) the selected sequence or multiple alignment.
The first option will provide you a tree based on distance method (only 2 methods are available trough the interface).
The option 2 will enable the use of another software for running muliple comparison and/or computing phylogenetic trees (with many more methods)
To do this, here is a very nice tool (seaview http://pbil.univ-lyon1.fr/software/seaview.html), very easy to install.
This software can be used to compute mulitple alignements (with different algorithms) and to compute trees based on different methods.
Here is a partical session proposed by Guy Perriere (from the LBBE, Lyon): http://lbbe.univ-lyon1.fr/-Perriere-Guy-.html

I also have  a few proteins with no known motifs. How do I do all of the above with these?

If you want to compare them, you can use seaview for computing a global multiple alignement (the objective is to compare the whole length of each sequence) or if you want to identify specific regions that are shared by some of the sequences, a local multiple alignement will be better (for instance MEME http://meme.sdsc.edu/meme/intro.html)

Blast

Should I use protein or DNA in these searches?

If you are interested in long evolutionary distances, using protein sequences is recommended.

How To: Submit multiple query sequences in a single BLAST search

http://www.ncbi.nlm.nih.gov/guide/howto/submit-mult-seq-blast/

Using BLAST to Teach “E-value-tionary” Concepts

http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014

http://joyeuserrance.wordpress.com/2009/07/12/significativite-dun-alignement-pour-blast/

A quick guide to Blast:

http://www.embnet.org/files/WebFM/PPRPC_group/QuickGuides/guideBLAST.pdf

Protein Analysis

Using databases (primary or secondary sequences) and sequence alignment tools

Tools for predicting or analising/visualizing the structure

Bioanalysis in general (Tutorials, web sites...)

Where to find tools? 

The Bioinformatics Links Directory features curated links to molecular resources, tools and databases. The links listed in this section are selected on the basis of recommendations from bioinformatics experts in the field. We also rely on input from our community of bioinformatics users for suggestions. Starting in 2003, we have also started listing all links contained in the NAR Webserver issue :

http://bioinformatics.ca/links_directory/

Quick introduction to Bioinformatics (blast, phrad, using perl, phylip....) :

http://www.embnet.org/QuickGuides

FAQ about basic Unix commands

How to  unpack the sequences into individual Fasta formatted files using the EMBOSS command:

seqretsplit testseq.fas

How to create a simple loop program to check each sequence against the other ones. Use the shell command foreach like this:

for i in *.fasta;
do water -asequence=$i -bsequence=testseq.fas -auto;
done

This means to run the EMBOSS program "water" (more about this next week) using a local dataset of all sequences in the current directory (*.fasta), and to set all other parameters to the default values (and not to ask for any other input).

How to format a sequence file:

ReadSeq is a program and library for conversion of biosequence data from one format to another, useful in various bioinformatics programs and services. It is written in Java, though an earlier version in C remains available. It can be used from this website (http://www.ebi.ac.uk/cgi-bin/readseq.cgi) or from your computer after installing it (http://readseq.sourceforge.net/).

readseq can used as command line:

readseq fichierEntree -all -format=fmt -output=fichierSortie

fmt is a string to specify the output format, for instance GenBank|gb, EMBL|em, Pearson|Fasta|fa, Clustal, MSF, XML, ...

How to get a fasta format from a file given by phylip?

readseq sequences.phy -all -format=Pearson/Fasta -output=sequences.fasta

Training materials from the BTN (Bioinformatic Training Networks)