RNA-Seqsites.tufts.edu/cbi/files/2013/02/Unix_intro.pdf · 2013. 2. 4. · Phred Prob. Incorrect...

RNA-Seq

Joshua Ainsley, PhDPostdoctoral Researcher

Lab of Leon ReijmersNeuroscience Department

Tufts [email protected]

Day two

Intro to UnixFile formats

RNA-Seq QC

Lecture outline

• What is Unix?• Tufts Cluster• Unix Introduction• File formats

– FASTA– FASTQ– SAM/BAM– GFF/GTF– BED

What is Unix?An operating system

Why does bioinformatics use Unix?

Open source fits with academic ideals

Best software development environment

Programming languages already installed and configured - difficult on Windows

Why does bioinformatics use Unix?

Software is free and easily available

Shell tools – bioinformatics is mostly about processing text in some way

Mac OS is based on Unix and (for the most part) works the same

Unix directory structure

Unix components

KernelThe operating system. Allocates hardware resources in response to software and user requests.

ShellThe interface between the user and the kernel. We will use the Bash shell.

Files and Processes

Everything in Unix is a file or a process

A file is a destination for or a source of data (this includes directories, the screen, printers)

A process is a program that is running

A file stores the instructions for a process, and a process may interact with files

Interacting with Unix

Text-based command line

You type in commands and the OS assigns resources to utilize the appropriate

process(es) and file(s)

$ command –options targets

Tufts High-performance computing research cluster

172 RedHat 6 systems8-16 cores/node16-128 GB RAM/node

Access using ssh(secure shell)Terminal on Mac/LinuxPuTTY on Windows

Why use a cluster?

NGS data is big and getting bigger

Your desktop/laptop aren’t good enough

Run many simultaneous programs (jobs)

Cloud computing from Amazon, Illumina(computer rental)

Lecture outline



Login to Tufts Cluster

cluster.uit.tufts.edu

Windows Putty.exe

Mac Terminal

ssh [email protected]

Your first Unix commands

$ ls

lists the contents of a directory$ pwd

shows your current directory$ whoami

shows your user name$ touch file.txt

interacts with a file (requires a target)

Command manuals

$ man

shows the manual for a target command$ man ls

navigate with arrow keys, PgUp, PgDnpress “q” to exit

Some useful options for ls:-l –a –t –S

-lat

Editing files in UnixRequires a text editor. We’ll use nano.

$ nano

Unix intro exercises

Lecture outline



Important file formats

FASTA – nucleotide sequenceFASTQ – sequence + quality information

SAM/BAM – alignmentsGFF/GTF – transcript informationBED – misc. feature coordinates

FASTA

>gi|212549564|ref|NM_015981.3| Homo sapiens calcium/calmodulin-dependent protein kinase II alpha (CAMK2A), transcript variant 1, mRNAGGTTGCCATGGGGACCTGGATGCTGACGAAGGCTCGCGAGGCTGTGAGCAGCCACAGTGCCCTGCTCAGAAGCCCCGGGCTCGTCAGTCAAACCGGTTCTCTGTTTGCACTCGGCAGCACGGGCAGGCAAGTGGTCCCTAGGTTCGGGAGCAGAGCAGCAGCGCCTCAGTCCTGGTCCCCCAGTCCCAAGCCTCACCTGCCTGCCCAGCGCCAGGATGGCCACCATCACCTGCACCCGCTTCACGGAAGAGTACCAGCTCTTCGAGGAATTGGGCAAGGGAGCCTTCTCGGTGGTGCGAAGGTGTGTGAAGGTGCTGGCTGGCCAGGAGTATGCTGCCAAGATCATCAACACAAAGAAGCTGTCAGCCAGAGACCATCAGAAGCTGGAGCGTGAAGCCCGCATCTGCCGCCTGCTGAAGCACCCCAACATCGTCCGACTACATGACAGCATCTCAGAGGAGGGACACCACTACCTGATCTTCGACCTGGTCACTGGTGGGGAACTGTTTGAAGATATCGTGGCCCGGGAGTATTACAGTGAGGCGGATGCCAGTCACTGTATCCAGCAGATCCTGGAGGCTGTGCTGCACTGCCACCAGATGGGGGTGGTGCACCGGGACCTGAAGCCTG

Tab completion

Save time typing and reduce spelling errors by using tab completion.

Type in enough letters to uniquely identify a command/file/path, and press tab.

Unix will automatically fill in the rest.If pressing tab does nothing, what you have

typed is not enough.Press tab twice to see a list of possible

matches.

Text manipulation commands

less - displays part of a filehead - display beginning of filetail - display end of filesort – sorts a filecut - select columns of a filetr - replace or remove charactersgrep – searches a filesed - stream editor, edits a file line by lineawk - programming language, very useful for advanced text manipulation

Handling text output

>

Redirects output to a target (overwrites)>>

Appends output to a target|

“Pipe” - Sends output to another program. Very useful for multi-step text manipulation.

FASTA exercisesMinute cards

Break

FASTQStores sequence information and quality scores

associated with the sequence

Quality represented as a Phred score

where Q = quality and P = error probability

Phred Prob. Incorrect Accuracy10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

FASTQTo save space, Phred scores are represented as ASCII characters

Phred scalesSanger format

Phred+33 from 0-40Illumina 1.0/Solexa

Phred+64 from -5-40Illumina 1.3+

Phred+64 from 3-40Illumina 1.8+

Phred+33 from 0-41

Look for “B” tails or “#” tails.

FASTQ@42JV5AAXX_HWI-EAS229_1:6:87:886:1289

CTACACCTTGAGCAAGAGGACCCTGCAATGTCCCTAGCTGCCAGCAGGCGGC

+

B?6B@@ABB@A;AB@@>B?@@@@?AA@A@@BBA5C>>?>?7;

First line = unique identifier (starts with “@”)Second line = sequenceThird line = spacer (may repeat identifier, starts with “+”)Fourth line = quality scores

FastQC

FastQC is used to generate summary information about FASTQ sequences

You will use this every time you receive RNA-Seq data

Babraham Bioinfomaticshttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FastQC – Basic Statistics

FastQC –Per base sequence quality

FastQC –Per base sequence content

FastQC –Sequence duplication levels

FastQC – Kmer content

What’s wrong here? #1


Random hexamer bias at start

AT rich due to mitochondrial and polyA reads

Solution: remove the reads before analysis


Adapter sequences due to short insert

Solution: trim the reads before mapping


This sample is almost entirely adapter-adapter ligation products

Solution: None. Data unusable.Add the ligase last!!

LSF (Load Sharing Facility)

Resource allocator - how programs are assigned to nodes in the cluster

You interact with the head node, along with everyone else

Programs should notbe run on the head node

Submitting jobs to the cluster

bsub

Submits a job to the clusterbjobs

Displays information about current jobsbkill

Stops a jobbqueues

Displays information about the queues (different nodes that can run programs)

Cluster modules

Modules exist for specific software packages on the cluster

Sets environment parameters to correctly run the software

Path: tells the OS where to find programs

Using WinSCP

WinSCP is an FTP/SFTP program for transferring files

Useful for transferring files between the cluster and your computer

Login credentials are the same as with PuTTY

WinSCP

Create stored sessions on your personal computers

FastQC exercises

SAM (Sequence Alignment/Map)

Standard format for alignment dataTab delimited text formatHeader lines and alignment lines

All header lines start with “@”Header contains metadata about alignmentsOne line per alignment

BAM is a binary form of SAMhttp://samtools.sourceforge.net/SAM1.pdf

SAM header@HD VN:1.0 SO:coordinate

@SQ SN:chr1 LN:249250621

@SQ SN:chr10 LN:135534747

@SQ SN:chr11 LN:135006516

@SQ SN:chr12 LN:133851895

@SQ SN:chr13 LN:115169878

@SQ SN:chr14 LN:107349540

@SQ SN:chr15 LN:102531392

@SQ SN:chr16 LN:90354753

@SQ SN:chr17 LN:81195210

@SQ SN:chr18 LN:78077248

@SQ SN:chr19 LN:59128983

@SQ SN:chr2 LN:243199373

@SQ SN:chr20 LN:63025520

@SQ SN:chr21 LN:48129895

@SQ SN:chr22 LN:51304566

@SQ SN:chr3 LN:198022430

@SQ SN:chr4 LN:191154276

@SQ SN:chr5 LN:180915260

@SQ SN:chr6 LN:171115067

@SQ SN:chr7 LN:159138663

@SQ SN:chr8 LN:146364022

@SQ SN:chr9 LN:141213431

@SQ SN:chrX LN:155270560

@SQ SN:chrY LN:59373566

@HD is the header line. Shows sort order

@SQ are the sequence dictionary lines. Show what sequences the reads were aligned to.

Other lines specified in SAM format document

SAM alignment sectionCol Field Type Brief description1 QNAME String Query template NAME2 FLAG Int bitwise FLAG3 RNAME String Reference sequence NAME4 POS Int 1‐based leftmost mapping POSition5 MAPQ Int MAPping Quality (sometimes)6 CIGAR String CIGAR string7 RNEXT String Ref. name of the mate/next segment8 PNEXT Int Position of the mate/next segment9 TLEN Int observed Template LENgth10 SEQ String segment SEQuence11 QUAL String ASCII of Phred‐scaled base QUALity+33

42JV5AAXX_HWI-EAS229_1:6:87:886:1289 272 chr1 11320 1 76M * 0 0 TTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCT…62664(1666646648848668688888856488868666886.6468886

bitwise FLAGFLAG Description1 Read is paired2 Both paired reads mapped4 Read unmapped8 Mate unmapped16 Read reverse strand32 Mate reverse strand64 First in pair128 Second in pair256 Not primary alignment512 not passing quality controls1024 PCR or optical duplicate

http://picard.sourceforge.net/explain-flags.html

CIGAR string

Symbol DescriptionM alignment match (can be a sequence match or mismatch)I insertion to the referenceD deletion from the referenceN skipped region from the referenceS soft clipping (clipped sequences present in SEQ)H hard clipping (clipped sequences NOT present in SEQ)P padding (silent deletion from padded reference)= sequence matchX sequence mismatch

SAM optional fields

Aligner specific information added to readsSome fields are specified in SAM format

Good for filtering reads

For Tophat:AS:i:-1 XN:i:0 XM:i:1 XO:i:0XG:i:0 NM:i:1 MD:Z:25A50 YT:Z:UU NH:i:4 CC:Z:chr15 CP:i:102519634HI:i:0

SAM/BAM exercisesBreak

GFF (General Feature Format)GTF (Gene Transfer Format)

File formats to store gene structures1. seqname - chromosome2. source - The program that generated this feature3. feature - Feature name. "CDS", "start_codon",

"stop_codon", "exon"4. start - Starting position. (1-based)5. end - The ending position of the feature (inclusive)6. score - A score between 0 and 1000, or “.”7. strand - '+', '-', or '.' 8. frame - For coding exons, number between 0-2 that

represents the reading frame of the first base. Otherwise, '.'

GFF (General Feature Format)GFF29. group – All lines with the same group are linked

together into a single item. Usually a single string.

GFF39. attributes – tag=value format

ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDENctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001ctg123 . exon 1300 1500 . + . Parent=gene00001

GTF (Gene Transfer Format)

9. attribute – gene_id “…”; transcript_id “…”gene_id: A globally unique identifier for the

genomic source of the sequence.transcript_id: A globally unique identifier

for the predicted transcript.

Most commonly used format for recording gene structures for use in NGS applications.

GTF (Gene Transfer Format)

Necessary to determine which reads map to which genes/transcripts

Coordinates change - make sure your GTF was created from the same

genome build you used for alignment!

UCSC Table Browser is the simplest place to obtain GTF files

GTF (Gene Transfer Format)chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id"uc010nxr.1"; transcript_id "uc010nxr.1";

chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id"uc010nxq.1"; transcript_id "uc010nxq.1";

chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id"uc010nxq.1"; transcript_id "uc010nxq.1";

How to get a GTF file

BED

File format for defining genomic intervals

Required fields:1. chrom – chromosome2. chromStart – starting position (0-based)3. chromEnd – ending position

BEDOptional fields:

4. name – Name of the BED line. 5. score - A score between 0 and 1000, or “.”6. strand - Defines the strand - either '+' or '-'.7. thickStart - The starting position at which the feature is drawn

thickly (for example, the start codon in gene displays).8. thickEnd - The ending position at which the feature is drawn

thickly (for example, the stop codon in gene displays).9. itemRgb - An RGB value for display.10. blockCount - The number of blocks (exons) in the BED line.11. blockSizes - A comma-separated list of the block sizes.12. blockStarts - A comma-separated list of block starts.

How to get a BED file from the UCSC Table browser

BED exercises

Questions? Minute cards!



RNA-Seqsites.tufts.edu/cbi/files/2013/02/Unix_intro.pdf · 2013. 2. 4. · Phred Prob. Incorrect...

Documents

Transcript of RNA-Seqsites.tufts.edu/cbi/files/2013/02/Unix_intro.pdf · 2013. 2. 4. · Phred Prob. Incorrect...