RNA-Seqsites.tufts.edu/cbi/files/2013/02/Unix_intro.pdf · 2013. 2. 4. · Phred Prob. Incorrect...

65
RNA-Seq Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University [email protected]

Transcript of RNA-Seqsites.tufts.edu/cbi/files/2013/02/Unix_intro.pdf · 2013. 2. 4. · Phred Prob. Incorrect...

  • RNA-Seq

    Joshua Ainsley, PhDPostdoctoral Researcher

    Lab of Leon ReijmersNeuroscience Department

    Tufts [email protected]

  • Day two

    Intro to UnixFile formats

    RNA-Seq QC

  • Lecture outline

    • What is Unix?• Tufts Cluster• Unix Introduction• File formats

    – FASTA– FASTQ– SAM/BAM– GFF/GTF– BED

  • What is Unix?An operating system

  • Why does bioinformatics use Unix?

    Open source fits with academic ideals

    Best software development environment

    Programming languages already installed and configured - difficult on Windows

  • Why does bioinformatics use Unix?

    Software is free and easily available

    Shell tools – bioinformatics is mostly about processing text in some way

    Mac OS is based on Unix and (for the most part) works the same

  • Unix directory structure

  • Unix components

    KernelThe operating system. Allocates hardware resources in response to software and user requests.

    ShellThe interface between the user and the kernel. We will use the Bash shell.

  • Files and Processes

    Everything in Unix is a file or a process

    A file is a destination for or a source of data (this includes directories, the screen, printers)

    A process is a program that is running

    A file stores the instructions for a process, and a process may interact with files

  • Interacting with Unix

    Text-based command line

    You type in commands and the OS assigns resources to utilize the appropriate

    process(es) and file(s)

    $ command –options targets

  • Tufts High-performance computing research cluster

    172 RedHat 6 systems8-16 cores/node16-128 GB RAM/node

    Access using ssh(secure shell)Terminal on Mac/LinuxPuTTY on Windows

  • Why use a cluster?

    NGS data is big and getting bigger

    Your desktop/laptop aren’t good enough

    Run many simultaneous programs (jobs)

    Cloud computing from Amazon, Illumina(computer rental)

  • Lecture outline

    • What is Unix?• Tufts Cluster• Unix Introduction• File formats

    – FASTA– FASTQ– SAM/BAM– GFF/GTF– BED

  • Login to Tufts Cluster

    cluster.uit.tufts.edu

    Windows Putty.exe

    Mac Terminal

    ssh [email protected]

  • Your first Unix commands

    $ ls

    lists the contents of a directory$ pwd

    shows your current directory$ whoami

    shows your user name$ touch file.txt

    interacts with a file (requires a target)

  • Command manuals

    $ man

    shows the manual for a target command$ man ls

    navigate with arrow keys, PgUp, PgDnpress “q” to exit

    Some useful options for ls:-l –a –t –S

    -lat

  • Editing files in UnixRequires a text editor. We’ll use nano.

    $ nano

  • Unix intro exercises

  • Lecture outline

    • What is Unix?• Tufts Cluster• Unix Introduction• File formats

    – FASTA– FASTQ– SAM/BAM– GFF/GTF– BED

  • Important file formats

    FASTA – nucleotide sequenceFASTQ – sequence + quality information

    SAM/BAM – alignmentsGFF/GTF – transcript informationBED – misc. feature coordinates

  • FASTA

    >gi|212549564|ref|NM_015981.3| Homo sapiens calcium/calmodulin-dependent protein kinase II alpha (CAMK2A), transcript variant 1, mRNAGGTTGCCATGGGGACCTGGATGCTGACGAAGGCTCGCGAGGCTGTGAGCAGCCACAGTGCCCTGCTCAGAAGCCCCGGGCTCGTCAGTCAAACCGGTTCTCTGTTTGCACTCGGCAGCACGGGCAGGCAAGTGGTCCCTAGGTTCGGGAGCAGAGCAGCAGCGCCTCAGTCCTGGTCCCCCAGTCCCAAGCCTCACCTGCCTGCCCAGCGCCAGGATGGCCACCATCACCTGCACCCGCTTCACGGAAGAGTACCAGCTCTTCGAGGAATTGGGCAAGGGAGCCTTCTCGGTGGTGCGAAGGTGTGTGAAGGTGCTGGCTGGCCAGGAGTATGCTGCCAAGATCATCAACACAAAGAAGCTGTCAGCCAGAGACCATCAGAAGCTGGAGCGTGAAGCCCGCATCTGCCGCCTGCTGAAGCACCCCAACATCGTCCGACTACATGACAGCATCTCAGAGGAGGGACACCACTACCTGATCTTCGACCTGGTCACTGGTGGGGAACTGTTTGAAGATATCGTGGCCCGGGAGTATTACAGTGAGGCGGATGCCAGTCACTGTATCCAGCAGATCCTGGAGGCTGTGCTGCACTGCCACCAGATGGGGGTGGTGCACCGGGACCTGAAGCCTG

  • Tab completion

    Save time typing and reduce spelling errors by using tab completion.

    Type in enough letters to uniquely identify a command/file/path, and press tab.

    Unix will automatically fill in the rest.If pressing tab does nothing, what you have

    typed is not enough.Press tab twice to see a list of possible

    matches.

  • Text manipulation commands

    less - displays part of a filehead - display beginning of filetail - display end of filesort – sorts a filecut - select columns of a filetr - replace or remove charactersgrep – searches a filesed - stream editor, edits a file line by lineawk - programming language, very useful for advanced text manipulation

  • Handling text output

    >

    Redirects output to a target (overwrites)>>

    Appends output to a target|

    “Pipe” - Sends output to another program. Very useful for multi-step text manipulation.

  • FASTA exercisesMinute cards

    Break

  • FASTQStores sequence information and quality scores

    associated with the sequence

    Quality represented as a Phred score

    where Q = quality and P = error probability

    Phred Prob. Incorrect Accuracy10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

  • FASTQTo save space, Phred scores are represented as ASCII characters

  • Phred scalesSanger format

    Phred+33 from 0-40Illumina 1.0/Solexa

    Phred+64 from -5-40Illumina 1.3+

    Phred+64 from 3-40Illumina 1.8+

    Phred+33 from 0-41

    Look for “B” tails or “#” tails.

  • FASTQ@42JV5AAXX_HWI-EAS229_1:6:87:886:1289

    CTACACCTTGAGCAAGAGGACCCTGCAATGTCCCTAGCTGCCAGCAGGCGGC

    +

    B?6B@@ABB@A;AB@@>B?@@@@?AA@A@@BBA5C>>?>?7;

    First line = unique identifier (starts with “@”)Second line = sequenceThird line = spacer (may repeat identifier, starts with “+”)Fourth line = quality scores

  • FastQC

    FastQC is used to generate summary information about FASTQ sequences

    You will use this every time you receive RNA-Seq data

    Babraham Bioinfomaticshttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/

  • FastQC – Basic Statistics

  • FastQC –Per base sequence quality

  • FastQC –Per base sequence content

  • FastQC –Sequence duplication levels

  • FastQC – Kmer content

  • What’s wrong here? #1

  • What’s wrong here? #1

    Random hexamer bias at start

    AT rich due to mitochondrial and polyA reads

    Solution: remove the reads before analysis

  • What’s wrong here? #2

  • What’s wrong here? #2

    Adapter sequences due to short insert

    Solution: trim the reads before mapping

  • What’s wrong here? #3

  • What’s wrong here? #3

    This sample is almost entirely adapter-adapter ligation products

    Solution: None. Data unusable.Add the ligase last!!

  • LSF (Load Sharing Facility)

    Resource allocator - how programs are assigned to nodes in the cluster

    You interact with the head node, along with everyone else

    Programs should notbe run on the head node

  • Submitting jobs to the cluster

    bsub

    Submits a job to the clusterbjobs

    Displays information about current jobsbkill

    Stops a jobbqueues

    Displays information about the queues (different nodes that can run programs)

  • Cluster modules

    Modules exist for specific software packages on the cluster

    Sets environment parameters to correctly run the software

    Path: tells the OS where to find programs

  • Using WinSCP

    WinSCP is an FTP/SFTP program for transferring files

    Useful for transferring files between the cluster and your computer

    Login credentials are the same as with PuTTY

  • WinSCP

    Create stored sessions on your personal computers

  • FastQC exercises

  • SAM (Sequence Alignment/Map)

    Standard format for alignment dataTab delimited text formatHeader lines and alignment lines

    All header lines start with “@”Header contains metadata about alignmentsOne line per alignment

    BAM is a binary form of SAMhttp://samtools.sourceforge.net/SAM1.pdf

  • SAM header@HD VN:1.0 SO:coordinate

    @SQ SN:chr1 LN:249250621

    @SQ SN:chr10 LN:135534747

    @SQ SN:chr11 LN:135006516

    @SQ SN:chr12 LN:133851895

    @SQ SN:chr13 LN:115169878

    @SQ SN:chr14 LN:107349540

    @SQ SN:chr15 LN:102531392

    @SQ SN:chr16 LN:90354753

    @SQ SN:chr17 LN:81195210

    @SQ SN:chr18 LN:78077248

    @SQ SN:chr19 LN:59128983

    @SQ SN:chr2 LN:243199373

    @SQ SN:chr20 LN:63025520

    @SQ SN:chr21 LN:48129895

    @SQ SN:chr22 LN:51304566

    @SQ SN:chr3 LN:198022430

    @SQ SN:chr4 LN:191154276

    @SQ SN:chr5 LN:180915260

    @SQ SN:chr6 LN:171115067

    @SQ SN:chr7 LN:159138663

    @SQ SN:chr8 LN:146364022

    @SQ SN:chr9 LN:141213431

    @SQ SN:chrX LN:155270560

    @SQ SN:chrY LN:59373566

    @HD is the header line. Shows sort order

    @SQ are the sequence dictionary lines. Show what sequences the reads were aligned to.

    Other lines specified in SAM format document

  • SAM alignment sectionCol Field Type Brief description1 QNAME String Query template NAME2 FLAG Int bitwise FLAG3 RNAME String Reference sequence NAME4 POS Int 1‐based leftmost mapping POSition5 MAPQ Int MAPping Quality (sometimes)6 CIGAR String CIGAR string7 RNEXT String Ref. name of the mate/next segment8 PNEXT Int Position of the mate/next segment9 TLEN Int observed Template LENgth10 SEQ String segment SEQuence11 QUAL String ASCII of Phred‐scaled base QUALity+33

    42JV5AAXX_HWI-EAS229_1:6:87:886:1289 272 chr1 11320 1 76M * 0 0 TTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCT…62664(1666646648848668688888856488868666886.6468886

  • bitwise FLAGFLAG Description1 Read is paired2 Both paired reads mapped4 Read unmapped8 Mate unmapped16 Read reverse strand32 Mate reverse strand64 First in pair128 Second in pair256 Not primary alignment512 not passing quality controls1024 PCR or optical duplicate

    http://picard.sourceforge.net/explain-flags.html

  • CIGAR string

    Symbol DescriptionM alignment match (can be a sequence match or mismatch)I insertion to the referenceD deletion from the referenceN skipped region from the referenceS soft clipping (clipped sequences present in SEQ)H hard clipping (clipped sequences NOT present in SEQ)P padding (silent deletion from padded reference)= sequence matchX sequence mismatch

  • SAM optional fields

    Aligner specific information added to readsSome fields are specified in SAM format

    Good for filtering reads

    For Tophat:AS:i:-1 XN:i:0 XM:i:1 XO:i:0XG:i:0 NM:i:1 MD:Z:25A50 YT:Z:UU NH:i:4 CC:Z:chr15 CP:i:102519634HI:i:0

  • SAM/BAM exercisesBreak

  • GFF (General Feature Format)GTF (Gene Transfer Format)

    File formats to store gene structures1. seqname - chromosome2. source - The program that generated this feature3. feature - Feature name. "CDS", "start_codon",

    "stop_codon", "exon"4. start - Starting position. (1-based)5. end - The ending position of the feature (inclusive)6. score - A score between 0 and 1000, or “.”7. strand - '+', '-', or '.' 8. frame - For coding exons, number between 0-2 that

    represents the reading frame of the first base. Otherwise, '.'

  • GFF (General Feature Format)GFF29. group – All lines with the same group are linked

    together into a single item. Usually a single string.

    GFF39. attributes – tag=value format

    ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDENctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001ctg123 . exon 1300 1500 . + . Parent=gene00001

  • GTF (Gene Transfer Format)

    9. attribute – gene_id “…”; transcript_id “…”gene_id: A globally unique identifier for the

    genomic source of the sequence.transcript_id: A globally unique identifier

    for the predicted transcript.

    Most commonly used format for recording gene structures for use in NGS applications.

  • GTF (Gene Transfer Format)

    Necessary to determine which reads map to which genes/transcripts

    Coordinates change - make sure your GTF was created from the same

    genome build you used for alignment!

    UCSC Table Browser is the simplest place to obtain GTF files

  • GTF (Gene Transfer Format)chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id"uc010nxr.1"; transcript_id "uc010nxr.1";

    chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id"uc010nxq.1"; transcript_id "uc010nxq.1";

    chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id"uc010nxq.1"; transcript_id "uc010nxq.1";

  • How to get a GTF file

  • BED

    File format for defining genomic intervals

    Required fields:1. chrom – chromosome2. chromStart – starting position (0-based)3. chromEnd – ending position

  • BEDOptional fields:

    4. name – Name of the BED line. 5. score - A score between 0 and 1000, or “.”6. strand - Defines the strand - either '+' or '-'.7. thickStart - The starting position at which the feature is drawn

    thickly (for example, the start codon in gene displays).8. thickEnd - The ending position at which the feature is drawn

    thickly (for example, the stop codon in gene displays).9. itemRgb - An RGB value for display.10. blockCount - The number of blocks (exons) in the BED line.11. blockSizes - A comma-separated list of the block sizes.12. blockStarts - A comma-separated list of block starts.

  • How to get a BED file from the UCSC Table browser

  • BED exercises

  • Questions? Minute cards!

    • What is Unix?• Tufts Cluster• Unix Introduction• File formats

    – FASTA– FASTQ– SAM/BAM– GFF/GTF– BED