Applied Bioinformatics - Vanderbilt...
Transcript of Applied Bioinformatics - Vanderbilt...
Applied Bioinformatics Course Overview & Introduction to Linux
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
What is bioinformatics
Applied Bioinformatics, Spring 2013 2
Bio informatics
Data
§ Hypotheses § Questions § Samples § Experiments
§ DNA § RNA § Protein § Metabolite § Phenotype
§ Sequence § Expression § Structure § Interaction
§ Storage/retrieval § Visualization § Computational methods § Statistical methods
Bioinformatics
Genomic sequences
Applied Bioinformatics, Spring 2013 3
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
Human genome project (1990-2003)
First bacterial (H. influenzae)
First eukaryote
(yeast)
First metazoan (C. elegans)
http://www.genomesonline.org
Completely Sequenced Genomes September 2012
Genome sequencing costs plunge
Applied Bioinformatics, Spring 2013 4
n Mission (Bio) q To accelerate the understanding of the
molecular basis of cancer through the application of genome analysis technologies.
n 2014 target (Data)
q 25 tumor types x 500 cases each
q Exome/whole genome sequencing
q Copy number variation
q Promoter methylation
q mRNA expression
q miRNA expression
The Cancer Genome Atlas (TCGA)
Applied Bioinformatics, Spring 2013 5
Glioblastoma
Ovarian Cancer
From TCGA data portal
Why now?
Applied Bioinformatics, Spring 2013 6
Bio informatics
Data
§ Hypotheses § Questions § Samples § Experiments
§ DNA § RNA § Protein § Metabolite § Phenotype
§ Sequence § Expression § Structure § Interaction
§ Storage/retrieval § Visualization § Computational methods § Statistical methods
informatics
Roles for different investigators in bioinformatics
n Algorithm developer q Statisticians
q Mathematicians
q Computer scientists
n Tool developer q Bioinformaticians
n Data provider/consumer q Biologists
Applied Bioinformatics, Spring 2013 7
Graph courtesy of http://www.incogen.com/
Comprehensive resource list
Applied Bioinformatics, Spring 2013 8
http://www.bioinformatics.ca/links_directory/
Sequence and structure databases
n Genbank: http://www.ncbi.nlm.nih.gov/genbank/ q Annotated collection of all publicly available DNA sequences
q 126,551,501,141 bases in 135,440,924 sequence as of April 2011
n UniProt: http://www.uniprot.org/
q Comprehensive resource for protein sequences and functional information
q 534,242 reviewed entries as of January 2012
n PDB: http://www.rcsb.org/ q 3D structures of large biological molecules, including proteins and nucleic acids
q 79,180 structures as of February 2012
n Pfam: http://pfam.sanger.ac.uk/ q Collection of protein families, each represented by multiple sequence alignments
and hidden Markov models (HMMs)
q 13,672 families as of November 2011
Applied Bioinformatics, Spring 2013 9
Genome browsers
n UCSC genome browser q http://genome.ucsc.edu/cgi-bin/hgGateway
n Ensembl genome browser q http://www.ensembl.org/index.html
Applied Bioinformatics, Spring 2013 10
Gene-centric databases
n Entrez Gene q http://www.ncbi.nlm.nih.gov/gene
q NCBI/NIH
q All completely sequenced genomes
q One gene per page
n Ensembl BioMart q http://www.ensembl.org/biomart/martview
q EMBL-EBI and Sanger Institute
q Vertebrates and other selected eukaryotic species
q Batch information retrieval
Applied Bioinformatics, Spring 2013 11
Gene expression data
n Gene Expression Omnibus (GEO) q http://www.ncbi.nlm.nih.gov/geo/
n ArrayExpress q http://www.ebi.ac.uk/arrayexpress/
Applied Bioinformatics, Spring 2013 12
Pathway and network resources
n Gene Ontology (GO): http://www.geneontology.org/
n Pathway databases q KEGG: http://www.genome.jp/kegg/pathway.html
q Reactome: http://www.reactome.org/
q WikiPathways: http://www.wikipathways.org/
n Protein-protein interaction databases q DIP: http://dip.doe-mbi.ucla.edu/ q MINT: http://mint.bio.uniroma2.it/mint/ q BioGRID: http://www.thebiogrid.org/ q HPRD: http://www.hprd.org
n Protein-DNA interaction database q Transfac: http://www.gene-regulation.com
Applied Bioinformatics, Spring 2013 13
Course content and grades
Applied Bioinformatics, Spring 2013 14
Applied Bioinformatics
IGP300B Bioregulation II, Spring 2013
(M/W/F, 10:00-10:55am, LH512)
Module director: Bing Zhang, Ph.D. ([email protected]; Department of Biomedical Informatics; 2525 West End Ave, Room 656; Phone: 936-0090)
Team members: William Bush, Ph.D., Qi Liu, Ph.D., Zhongming Zhao, Ph.D.
Date Subject Instructor Homework (HW) / Project 2/15 Course overview & Introduction to Linux Zhang 2/18 Pairwise sequence alignment Zhao 2/20 Multiple sequence alignment Zhao 2/22 Inferring phylogenetic relationships Zhao HW I distribution (20 pts) 2/25 Gene prediction Bush 2/27 Gene regulatory elements and conservation Bush HW I due 3/1 Assessing the impact of genetic variation Bush HW II distribution (20 pts) 3/4 Supervised analysis of gene expression data Zhang 3/6 Unsupervised analysis of gene expression data Zhang HW II due 3/8 Functional interpretation of gene lists Zhang 3/11 Next-Generation Sequencing data analysis Liu HW III distribution (20 pts) 3/13 Project: Bioinformatics analysis and
interpretation of genomic sequencing data
Zhang & Liu
3/15 HW III due 3/18 3/20 Project presentation Project presentation (40 pts) 3/22 HW assignments will be graded by each instructor for their respective sections. Final Grade = sum of the three hw scores and the project presentation score (100 pts in total). A: 85-100; B: 75-84; C: 65-74; D: 55-64; F: 0-54
Course materials and assignments
n Lecture slides available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php before each lecture
n Homework assignments available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php on the distribution date (2/22, 3/1, 3/11)
n Homework assignments are due on paper at the beginning of class on the due date (2/27, 3/6, 3/15). There will be a 10% per day deduction for late reports.
n Start thinking about forming project teams (5-6 person per team)
n Instructor contact information q Dr. Bing Zhang: [email protected]
q Dr. Zhongming Zhao: [email protected]
q Dr. William Bush: [email protected]
q Dr. Qi Liu: [email protected]
Applied Bioinformatics, Spring 2013 15
ACCRE
n Advanced Computing Center for Research & Education q http://www.accre.vanderbilt.edu/
q The compute cluster currently consists of more than 500 Linux systems with quad or hex core processors
n Linux system q An operating system (OS) like Windows or Mac
q Portable, multi-tasking, multi-user OS
q High performance and free, making it idea for high performance computing clusters
Applied Bioinformatics, Spring 2013 16
Get an ACCRE account n http://www.accre.vanderbilt.edu/?page_id=617
n Registration form q Name, VUNetID, Department (VU), School (VU), Email, Phone, Position
q Group: IGP300b_ab (igp300b_ab) q Primary research area: bioinformatics
q Primary application: Existing Application
q Primary application name: R
q Primary application type: Serial
q Expected typical number of processors: NA
q Expected typical number of concurrent running jobs: 1
q Linux experience:
q Expected compilers/languages: C, C++, R, perl, python
q Expected external libraries: NA
q BlueArc User: No
q Other useful information: NA
Applied Bioinformatics, Spring 2013 17
Logging onto the cluster and change password
n Windows q Application: SSH (http://its.vanderbilt.edu/downloads)
q Two steps: add profile -> edit profile
q Host name: vmplogin.accre.vanderbilt.edu
q Username: your_user_name
n Mac q Spotlight to find the application: Terminal
q Command: ssh [email protected]
n Change password q rsh vmpsched
q passwd
n Exit q exit
Applied Bioinformatics, Spring 2013 18
Logging onto the cluster and change password (using SSH in Windows)
Applied Bioinformatics, Spring 2013 19
Logging onto the cluster and change password (using Terminal in Mac)
Applied Bioinformatics, Spring 2013 20
You won’t see any response while typing
password, which is fine.
Hierarchical Filesystem
/
bin usr home scratch etc tmp
chmod
cp
date
grep
mv
rm
vi
igptest annie cody bin lib
bin docs src
libc.so
libgpfs.so
libjpeg.so
libstdc++.so
diff
find
gcc
id
make
perl
ssh
prog1.c
prog2.f77
prog3.cpp
myprog.sh
dothis.pl
dothat.py
/home
/home/igptest
/home/igptest/src/prog3.cpp
Applied Bioinformatics, Spring 2013 21
Working with directories
n pwd (pr ints your present working directory)
n ls (lists directory contents)
n mkdir (makes a directory)
n cd (changes directories) q .. (parent directory)
q . (current directory)
q ~ or no parameter (home directory)
n rmdir (removes an empty directory)
Applied Bioinformatics, Spring 2013 22
Getting help
n man (display manual pages for a command) q man ls (display manual for the
ls command)
q space bar to show next page
q q to exist
n Alternatives of ls q ls -a (do not ignore entries
starting with .)
q ls -l (use a long listing format)
q ls -al (use a long listing format and do not ignore entries starting with .)
Applied Bioinformatics, Spring 2013 23
Working with files
n more (displays the contents of a file) q space bar to show next page
q q to exist
n cp (copies files)
n mv (renames/moves files)
n rm (removes files)
Applied Bioinformatics, Spring 2013 24
Editing files with nano q cd (change to home directory)
q nano .bashrc (use nano to edit file .bashrc, which includes commands that are executed when starting the system).
q Add “setpkgs –a R” to the end of the file (this will allow you to use the R environment which has been installed in the ACCRE system for statistical computing).
Applied Bioinformatics, Spring 2013 25
Copying files to/from a local computer
n Windows q Application: SSH (http://its.vanderbilt.edu/downloads)
n Mac q Application: Fugu (http://its.vanderbilt.edu/downloads)
q Connect to: vmplogin.accre.vanderbilt.edu
q Username: your_user_name
q Don’t change other items
Applied Bioinformatics, Spring 2013 26
Copying files to/from a local computer (using SSH in Windows)
Applied Bioinformatics, Spring 2013 27
Copying files to/from a local computer (using Fugu in Mac)
Applied Bioinformatics, Spring 2013 28
Homework
n Get an ACCRE account
n Log onto the cluster and change password
n Get familiar with the Linux commands introduced today
n Copy the file sample_file.txt under directory /home/igptest to your home directory
n Add “setpkgs –a R” to the end of your .bashrc file
Applied Bioinformatics, Spring 2013 29