Applied Bioinformatics - Vanderbilt...

Post on 24-Jul-2018

214 views 0 download

Transcript of Applied Bioinformatics - Vanderbilt...

Applied Bioinformatics Course Overview & Introduction to Linux

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

bing.zhang@vanderbilt.edu

What is bioinformatics

Applied Bioinformatics, Spring 2013 2

Bio informatics

Data

§ Hypotheses § Questions § Samples § Experiments

§ DNA § RNA § Protein § Metabolite § Phenotype

§ Sequence § Expression § Structure §  Interaction

§ Storage/retrieval § Visualization § Computational methods § Statistical methods

Bioinformatics

Genomic sequences

Applied Bioinformatics, Spring 2013 3

http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html

Human genome project (1990-2003)

First bacterial (H. influenzae)

First eukaryote

(yeast)

First metazoan (C. elegans)

http://www.genomesonline.org

Completely Sequenced Genomes September 2012

Genome sequencing costs plunge

Applied Bioinformatics, Spring 2013 4

n  Mission (Bio) q  To accelerate the understanding of the

molecular basis of cancer through the application of genome analysis technologies.

n  2014 target (Data)

q  25 tumor types x 500 cases each

q  Exome/whole genome sequencing

q  Copy number variation

q  Promoter methylation

q  mRNA expression

q  miRNA expression

The Cancer Genome Atlas (TCGA)

Applied Bioinformatics, Spring 2013 5

Glioblastoma

Ovarian Cancer

From TCGA data portal

Why now?

Applied Bioinformatics, Spring 2013 6

Bio informatics

Data

§ Hypotheses § Questions § Samples § Experiments

§ DNA § RNA § Protein § Metabolite § Phenotype

§ Sequence § Expression § Structure §  Interaction

§ Storage/retrieval § Visualization § Computational methods § Statistical methods

informatics

Roles for different investigators in bioinformatics

n  Algorithm developer q  Statisticians

q  Mathematicians

q  Computer scientists

n  Tool developer q  Bioinformaticians

n  Data provider/consumer q  Biologists

Applied Bioinformatics, Spring 2013 7

Graph courtesy of http://www.incogen.com/

Comprehensive resource list

Applied Bioinformatics, Spring 2013 8

http://www.bioinformatics.ca/links_directory/

Sequence and structure databases

n  Genbank: http://www.ncbi.nlm.nih.gov/genbank/ q  Annotated collection of all publicly available DNA sequences

q  126,551,501,141 bases in 135,440,924 sequence as of April 2011

n  UniProt: http://www.uniprot.org/

q  Comprehensive resource for protein sequences and functional information

q  534,242 reviewed entries as of January 2012

n  PDB: http://www.rcsb.org/ q  3D structures of large biological molecules, including proteins and nucleic acids

q  79,180 structures as of February 2012

n  Pfam: http://pfam.sanger.ac.uk/ q  Collection of protein families, each represented by multiple sequence alignments

and hidden Markov models (HMMs)

q  13,672 families as of November 2011

Applied Bioinformatics, Spring 2013 9

Genome browsers

n  UCSC genome browser q  http://genome.ucsc.edu/cgi-bin/hgGateway

n  Ensembl genome browser q  http://www.ensembl.org/index.html

Applied Bioinformatics, Spring 2013 10

Gene-centric databases

n  Entrez Gene q  http://www.ncbi.nlm.nih.gov/gene

q  NCBI/NIH

q  All completely sequenced genomes

q  One gene per page

n  Ensembl BioMart q  http://www.ensembl.org/biomart/martview

q  EMBL-EBI and Sanger Institute

q  Vertebrates and other selected eukaryotic species

q  Batch information retrieval

Applied Bioinformatics, Spring 2013 11

Gene expression data

n  Gene Expression Omnibus (GEO) q  http://www.ncbi.nlm.nih.gov/geo/

n  ArrayExpress q  http://www.ebi.ac.uk/arrayexpress/

Applied Bioinformatics, Spring 2013 12

Pathway and network resources

n  Gene Ontology (GO): http://www.geneontology.org/

n  Pathway databases q  KEGG: http://www.genome.jp/kegg/pathway.html

q  Reactome: http://www.reactome.org/

q  WikiPathways: http://www.wikipathways.org/

n  Protein-protein interaction databases q  DIP: http://dip.doe-mbi.ucla.edu/ q  MINT: http://mint.bio.uniroma2.it/mint/ q  BioGRID: http://www.thebiogrid.org/ q  HPRD: http://www.hprd.org

n  Protein-DNA interaction database q  Transfac: http://www.gene-regulation.com

Applied Bioinformatics, Spring 2013 13

Course content and grades

Applied Bioinformatics, Spring 2013 14

Applied Bioinformatics

IGP300B Bioregulation II, Spring 2013

(M/W/F, 10:00-10:55am, LH512)

Module director: Bing Zhang, Ph.D. (bing.zhang@vanderbilt.edu; Department of Biomedical Informatics; 2525 West End Ave, Room 656; Phone: 936-0090)

Team members: William Bush, Ph.D., Qi Liu, Ph.D., Zhongming Zhao, Ph.D.

Date Subject Instructor Homework (HW) / Project 2/15 Course overview & Introduction to Linux Zhang 2/18 Pairwise sequence alignment Zhao 2/20 Multiple sequence alignment Zhao 2/22 Inferring phylogenetic relationships Zhao HW I distribution (20 pts) 2/25 Gene prediction Bush 2/27 Gene regulatory elements and conservation Bush HW I due 3/1 Assessing the impact of genetic variation Bush HW II distribution (20 pts) 3/4 Supervised analysis of gene expression data Zhang 3/6 Unsupervised analysis of gene expression data Zhang HW II due 3/8 Functional interpretation of gene lists Zhang 3/11 Next-Generation Sequencing data analysis Liu HW III distribution (20 pts) 3/13 Project: Bioinformatics analysis and

interpretation of genomic sequencing data

Zhang & Liu

3/15 HW III due 3/18 3/20 Project presentation Project presentation (40 pts) 3/22 HW assignments will be graded by each instructor for their respective sections. Final Grade = sum of the three hw scores and the project presentation score (100 pts in total). A: 85-100; B: 75-84; C: 65-74; D: 55-64; F: 0-54

Course materials and assignments

n  Lecture slides available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php before each lecture

n  Homework assignments available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php on the distribution date (2/22, 3/1, 3/11)

n  Homework assignments are due on paper at the beginning of class on the due date (2/27, 3/6, 3/15). There will be a 10% per day deduction for late reports.

n  Start thinking about forming project teams (5-6 person per team)

n  Instructor contact information q  Dr. Bing Zhang: bing.zhang@vanderbilt.edu

q  Dr. Zhongming Zhao: zhongming.zhao@vanderbilt.edu

q  Dr. William Bush: william.s.bush@vanderbilt.edu

q  Dr. Qi Liu: qi.liu@vanderbilt.edu

Applied Bioinformatics, Spring 2013 15

ACCRE

n  Advanced Computing Center for Research & Education q  http://www.accre.vanderbilt.edu/

q  The compute cluster currently consists of more than 500 Linux systems with quad or hex core processors

n  Linux system q  An operating system (OS) like Windows or Mac

q  Portable, multi-tasking, multi-user OS

q  High performance and free, making it idea for high performance computing clusters

Applied Bioinformatics, Spring 2013 16

Get an ACCRE account n  http://www.accre.vanderbilt.edu/?page_id=617

n  Registration form q  Name, VUNetID, Department (VU), School (VU), Email, Phone, Position

q  Group: IGP300b_ab (igp300b_ab) q  Primary research area: bioinformatics

q  Primary application: Existing Application

q  Primary application name: R

q  Primary application type: Serial

q  Expected typical number of processors: NA

q  Expected typical number of concurrent running jobs: 1

q  Linux experience:

q  Expected compilers/languages: C, C++, R, perl, python

q  Expected external libraries: NA

q  BlueArc User: No

q  Other useful information: NA

Applied Bioinformatics, Spring 2013 17

Logging onto the cluster and change password

n  Windows q  Application: SSH (http://its.vanderbilt.edu/downloads)

q  Two steps: add profile -> edit profile

q  Host name: vmplogin.accre.vanderbilt.edu

q  Username: your_user_name

n  Mac q  Spotlight to find the application: Terminal

q  Command: ssh your_user_name@vmplogin.accre.vanderbilt.edu

n  Change password q  rsh vmpsched

q  passwd

n  Exit q  exit

Applied Bioinformatics, Spring 2013 18

Logging onto the cluster and change password (using SSH in Windows)

Applied Bioinformatics, Spring 2013 19

Logging onto the cluster and change password (using Terminal in Mac)

Applied Bioinformatics, Spring 2013 20

You won’t see any response while typing

password, which is fine.

Hierarchical Filesystem

/

bin usr home scratch etc tmp

chmod

cp

date

grep

mv

rm

vi

igptest annie cody bin lib

bin docs src

libc.so

libgpfs.so

libjpeg.so

libstdc++.so

diff

find

gcc

id

make

perl

ssh

prog1.c

prog2.f77

prog3.cpp

myprog.sh

dothis.pl

dothat.py

/home

/home/igptest

/home/igptest/src/prog3.cpp

Applied Bioinformatics, Spring 2013 21

Working with directories

n  pwd (pr ints your present working directory)

n  ls (lists directory contents)

n  mkdir (makes a directory)

n  cd (changes directories) q  .. (parent directory)

q  . (current directory)

q  ~ or no parameter (home directory)

n  rmdir (removes an empty directory)

Applied Bioinformatics, Spring 2013 22

Getting help

n  man (display manual pages for a command) q  man ls (display manual for the

ls command)

q  space bar to show next page

q  q to exist

n  Alternatives of ls q  ls -a (do not ignore entries

starting with .)

q  ls -l (use a long listing format)

q  ls -al (use a long listing format and do not ignore entries starting with .)

Applied Bioinformatics, Spring 2013 23

Working with files

n  more (displays the contents of a file) q  space bar to show next page

q  q to exist

n  cp (copies files)

n  mv (renames/moves files)

n  rm (removes files)

Applied Bioinformatics, Spring 2013 24

Editing files with nano q  cd (change to home directory)

q  nano .bashrc (use nano to edit file .bashrc, which includes commands that are executed when starting the system).

q  Add “setpkgs –a R” to the end of the file (this will allow you to use the R environment which has been installed in the ACCRE system for statistical computing).

Applied Bioinformatics, Spring 2013 25

Copying files to/from a local computer

n  Windows q  Application: SSH (http://its.vanderbilt.edu/downloads)

n  Mac q  Application: Fugu (http://its.vanderbilt.edu/downloads)

q  Connect to: vmplogin.accre.vanderbilt.edu

q  Username: your_user_name

q  Don’t change other items

Applied Bioinformatics, Spring 2013 26

Copying files to/from a local computer (using SSH in Windows)

Applied Bioinformatics, Spring 2013 27

Copying files to/from a local computer (using Fugu in Mac)

Applied Bioinformatics, Spring 2013 28

Homework

n  Get an ACCRE account

n  Log onto the cluster and change password

n  Get familiar with the Linux commands introduced today

n  Copy the file sample_file.txt under directory /home/igptest to your home directory

n  Add “setpkgs –a R” to the end of your .bashrc file

Applied Bioinformatics, Spring 2013 29