Applied Bioinformatics - Vanderbilt...

29
Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected]

Transcript of Applied Bioinformatics - Vanderbilt...

Page 1: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Applied Bioinformatics Course Overview & Introduction to Linux

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Page 2: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

What is bioinformatics

Applied Bioinformatics, Spring 2013 2

Bio informatics

Data

§ Hypotheses § Questions § Samples § Experiments

§ DNA § RNA § Protein § Metabolite § Phenotype

§ Sequence § Expression § Structure §  Interaction

§ Storage/retrieval § Visualization § Computational methods § Statistical methods

Bioinformatics

Page 3: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Genomic sequences

Applied Bioinformatics, Spring 2013 3

http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html

Human genome project (1990-2003)

First bacterial (H. influenzae)

First eukaryote

(yeast)

First metazoan (C. elegans)

http://www.genomesonline.org

Completely Sequenced Genomes September 2012

Page 4: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Genome sequencing costs plunge

Applied Bioinformatics, Spring 2013 4

Page 5: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

n  Mission (Bio) q  To accelerate the understanding of the

molecular basis of cancer through the application of genome analysis technologies.

n  2014 target (Data)

q  25 tumor types x 500 cases each

q  Exome/whole genome sequencing

q  Copy number variation

q  Promoter methylation

q  mRNA expression

q  miRNA expression

The Cancer Genome Atlas (TCGA)

Applied Bioinformatics, Spring 2013 5

Glioblastoma

Ovarian Cancer

From TCGA data portal

Page 6: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Why now?

Applied Bioinformatics, Spring 2013 6

Bio informatics

Data

§ Hypotheses § Questions § Samples § Experiments

§ DNA § RNA § Protein § Metabolite § Phenotype

§ Sequence § Expression § Structure §  Interaction

§ Storage/retrieval § Visualization § Computational methods § Statistical methods

informatics

Page 7: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Roles for different investigators in bioinformatics

n  Algorithm developer q  Statisticians

q  Mathematicians

q  Computer scientists

n  Tool developer q  Bioinformaticians

n  Data provider/consumer q  Biologists

Applied Bioinformatics, Spring 2013 7

Graph courtesy of http://www.incogen.com/

Page 8: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Comprehensive resource list

Applied Bioinformatics, Spring 2013 8

http://www.bioinformatics.ca/links_directory/

Page 9: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Sequence and structure databases

n  Genbank: http://www.ncbi.nlm.nih.gov/genbank/ q  Annotated collection of all publicly available DNA sequences

q  126,551,501,141 bases in 135,440,924 sequence as of April 2011

n  UniProt: http://www.uniprot.org/

q  Comprehensive resource for protein sequences and functional information

q  534,242 reviewed entries as of January 2012

n  PDB: http://www.rcsb.org/ q  3D structures of large biological molecules, including proteins and nucleic acids

q  79,180 structures as of February 2012

n  Pfam: http://pfam.sanger.ac.uk/ q  Collection of protein families, each represented by multiple sequence alignments

and hidden Markov models (HMMs)

q  13,672 families as of November 2011

Applied Bioinformatics, Spring 2013 9

Page 10: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Genome browsers

n  UCSC genome browser q  http://genome.ucsc.edu/cgi-bin/hgGateway

n  Ensembl genome browser q  http://www.ensembl.org/index.html

Applied Bioinformatics, Spring 2013 10

Page 11: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Gene-centric databases

n  Entrez Gene q  http://www.ncbi.nlm.nih.gov/gene

q  NCBI/NIH

q  All completely sequenced genomes

q  One gene per page

n  Ensembl BioMart q  http://www.ensembl.org/biomart/martview

q  EMBL-EBI and Sanger Institute

q  Vertebrates and other selected eukaryotic species

q  Batch information retrieval

Applied Bioinformatics, Spring 2013 11

Page 12: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Gene expression data

n  Gene Expression Omnibus (GEO) q  http://www.ncbi.nlm.nih.gov/geo/

n  ArrayExpress q  http://www.ebi.ac.uk/arrayexpress/

Applied Bioinformatics, Spring 2013 12

Page 13: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Pathway and network resources

n  Gene Ontology (GO): http://www.geneontology.org/

n  Pathway databases q  KEGG: http://www.genome.jp/kegg/pathway.html

q  Reactome: http://www.reactome.org/

q  WikiPathways: http://www.wikipathways.org/

n  Protein-protein interaction databases q  DIP: http://dip.doe-mbi.ucla.edu/ q  MINT: http://mint.bio.uniroma2.it/mint/ q  BioGRID: http://www.thebiogrid.org/ q  HPRD: http://www.hprd.org

n  Protein-DNA interaction database q  Transfac: http://www.gene-regulation.com

Applied Bioinformatics, Spring 2013 13

Page 14: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Course content and grades

Applied Bioinformatics, Spring 2013 14

Applied Bioinformatics

IGP300B Bioregulation II, Spring 2013

(M/W/F, 10:00-10:55am, LH512)

Module director: Bing Zhang, Ph.D. ([email protected]; Department of Biomedical Informatics; 2525 West End Ave, Room 656; Phone: 936-0090)

Team members: William Bush, Ph.D., Qi Liu, Ph.D., Zhongming Zhao, Ph.D.

Date Subject Instructor Homework (HW) / Project 2/15 Course overview & Introduction to Linux Zhang 2/18 Pairwise sequence alignment Zhao 2/20 Multiple sequence alignment Zhao 2/22 Inferring phylogenetic relationships Zhao HW I distribution (20 pts) 2/25 Gene prediction Bush 2/27 Gene regulatory elements and conservation Bush HW I due 3/1 Assessing the impact of genetic variation Bush HW II distribution (20 pts) 3/4 Supervised analysis of gene expression data Zhang 3/6 Unsupervised analysis of gene expression data Zhang HW II due 3/8 Functional interpretation of gene lists Zhang 3/11 Next-Generation Sequencing data analysis Liu HW III distribution (20 pts) 3/13 Project: Bioinformatics analysis and

interpretation of genomic sequencing data

Zhang & Liu

3/15 HW III due 3/18 3/20 Project presentation Project presentation (40 pts) 3/22 HW assignments will be graded by each instructor for their respective sections. Final Grade = sum of the three hw scores and the project presentation score (100 pts in total). A: 85-100; B: 75-84; C: 65-74; D: 55-64; F: 0-54

Page 15: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Course materials and assignments

n  Lecture slides available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php before each lecture

n  Homework assignments available at https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php on the distribution date (2/22, 3/1, 3/11)

n  Homework assignments are due on paper at the beginning of class on the due date (2/27, 3/6, 3/15). There will be a 10% per day deduction for late reports.

n  Start thinking about forming project teams (5-6 person per team)

n  Instructor contact information q  Dr. Bing Zhang: [email protected]

q  Dr. Zhongming Zhao: [email protected]

q  Dr. William Bush: [email protected]

q  Dr. Qi Liu: [email protected]

Applied Bioinformatics, Spring 2013 15

Page 16: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

ACCRE

n  Advanced Computing Center for Research & Education q  http://www.accre.vanderbilt.edu/

q  The compute cluster currently consists of more than 500 Linux systems with quad or hex core processors

n  Linux system q  An operating system (OS) like Windows or Mac

q  Portable, multi-tasking, multi-user OS

q  High performance and free, making it idea for high performance computing clusters

Applied Bioinformatics, Spring 2013 16

Page 17: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Get an ACCRE account n  http://www.accre.vanderbilt.edu/?page_id=617

n  Registration form q  Name, VUNetID, Department (VU), School (VU), Email, Phone, Position

q  Group: IGP300b_ab (igp300b_ab) q  Primary research area: bioinformatics

q  Primary application: Existing Application

q  Primary application name: R

q  Primary application type: Serial

q  Expected typical number of processors: NA

q  Expected typical number of concurrent running jobs: 1

q  Linux experience:

q  Expected compilers/languages: C, C++, R, perl, python

q  Expected external libraries: NA

q  BlueArc User: No

q  Other useful information: NA

Applied Bioinformatics, Spring 2013 17

Page 18: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Logging onto the cluster and change password

n  Windows q  Application: SSH (http://its.vanderbilt.edu/downloads)

q  Two steps: add profile -> edit profile

q  Host name: vmplogin.accre.vanderbilt.edu

q  Username: your_user_name

n  Mac q  Spotlight to find the application: Terminal

q  Command: ssh [email protected]

n  Change password q  rsh vmpsched

q  passwd

n  Exit q  exit

Applied Bioinformatics, Spring 2013 18

Page 19: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Logging onto the cluster and change password (using SSH in Windows)

Applied Bioinformatics, Spring 2013 19

Page 20: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Logging onto the cluster and change password (using Terminal in Mac)

Applied Bioinformatics, Spring 2013 20

You won’t see any response while typing

password, which is fine.

Page 21: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Hierarchical Filesystem

/

bin usr home scratch etc tmp

chmod

cp

date

grep

mv

rm

vi

igptest annie cody bin lib

bin docs src

libc.so

libgpfs.so

libjpeg.so

libstdc++.so

diff

find

gcc

id

make

perl

ssh

prog1.c

prog2.f77

prog3.cpp

myprog.sh

dothis.pl

dothat.py

/home

/home/igptest

/home/igptest/src/prog3.cpp

Applied Bioinformatics, Spring 2013 21

Page 22: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Working with directories

n  pwd (pr ints your present working directory)

n  ls (lists directory contents)

n  mkdir (makes a directory)

n  cd (changes directories) q  .. (parent directory)

q  . (current directory)

q  ~ or no parameter (home directory)

n  rmdir (removes an empty directory)

Applied Bioinformatics, Spring 2013 22

Page 23: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Getting help

n  man (display manual pages for a command) q  man ls (display manual for the

ls command)

q  space bar to show next page

q  q to exist

n  Alternatives of ls q  ls -a (do not ignore entries

starting with .)

q  ls -l (use a long listing format)

q  ls -al (use a long listing format and do not ignore entries starting with .)

Applied Bioinformatics, Spring 2013 23

Page 24: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Working with files

n  more (displays the contents of a file) q  space bar to show next page

q  q to exist

n  cp (copies files)

n  mv (renames/moves files)

n  rm (removes files)

Applied Bioinformatics, Spring 2013 24

Page 25: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Editing files with nano q  cd (change to home directory)

q  nano .bashrc (use nano to edit file .bashrc, which includes commands that are executed when starting the system).

q  Add “setpkgs –a R” to the end of the file (this will allow you to use the R environment which has been installed in the ACCRE system for statistical computing).

Applied Bioinformatics, Spring 2013 25

Page 26: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Copying files to/from a local computer

n  Windows q  Application: SSH (http://its.vanderbilt.edu/downloads)

n  Mac q  Application: Fugu (http://its.vanderbilt.edu/downloads)

q  Connect to: vmplogin.accre.vanderbilt.edu

q  Username: your_user_name

q  Don’t change other items

Applied Bioinformatics, Spring 2013 26

Page 27: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Copying files to/from a local computer (using SSH in Windows)

Applied Bioinformatics, Spring 2013 27

Page 28: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Copying files to/from a local computer (using Fugu in Mac)

Applied Bioinformatics, Spring 2013 28

Page 29: Applied Bioinformatics - Vanderbilt Universitybioinfo.vanderbilt.edu/zhanglab/lectures/AB2013Lecture01.pdf · Applied Bioinformatics Course Overview & Introduction to Linux Bing Zhang

Homework

n  Get an ACCRE account

n  Log onto the cluster and change password

n  Get familiar with the Linux commands introduced today

n  Copy the file sample_file.txt under directory /home/igptest to your home directory

n  Add “setpkgs –a R” to the end of your .bashrc file

Applied Bioinformatics, Spring 2013 29