RNA-Seqsites.tufts.edu/cbi/files/2013/02/Unix_intro.pdf · 2013. 2. 4. · Phred Prob. Incorrect...
Transcript of RNA-Seqsites.tufts.edu/cbi/files/2013/02/Unix_intro.pdf · 2013. 2. 4. · Phred Prob. Incorrect...
-
RNA-Seq
Joshua Ainsley, PhDPostdoctoral Researcher
Lab of Leon ReijmersNeuroscience Department
Tufts [email protected]
-
Day two
Intro to UnixFile formats
RNA-Seq QC
-
Lecture outline
• What is Unix?• Tufts Cluster• Unix Introduction• File formats
– FASTA– FASTQ– SAM/BAM– GFF/GTF– BED
-
What is Unix?An operating system
-
Why does bioinformatics use Unix?
Open source fits with academic ideals
Best software development environment
Programming languages already installed and configured - difficult on Windows
-
Why does bioinformatics use Unix?
Software is free and easily available
Shell tools – bioinformatics is mostly about processing text in some way
Mac OS is based on Unix and (for the most part) works the same
-
Unix directory structure
-
Unix components
KernelThe operating system. Allocates hardware resources in response to software and user requests.
ShellThe interface between the user and the kernel. We will use the Bash shell.
-
Files and Processes
Everything in Unix is a file or a process
A file is a destination for or a source of data (this includes directories, the screen, printers)
A process is a program that is running
A file stores the instructions for a process, and a process may interact with files
-
Interacting with Unix
Text-based command line
You type in commands and the OS assigns resources to utilize the appropriate
process(es) and file(s)
$ command –options targets
-
Tufts High-performance computing research cluster
172 RedHat 6 systems8-16 cores/node16-128 GB RAM/node
Access using ssh(secure shell)Terminal on Mac/LinuxPuTTY on Windows
-
Why use a cluster?
NGS data is big and getting bigger
Your desktop/laptop aren’t good enough
Run many simultaneous programs (jobs)
Cloud computing from Amazon, Illumina(computer rental)
-
Lecture outline
• What is Unix?• Tufts Cluster• Unix Introduction• File formats
– FASTA– FASTQ– SAM/BAM– GFF/GTF– BED
-
Login to Tufts Cluster
cluster.uit.tufts.edu
Windows Putty.exe
Mac Terminal
-
Your first Unix commands
$ ls
lists the contents of a directory$ pwd
shows your current directory$ whoami
shows your user name$ touch file.txt
interacts with a file (requires a target)
-
Command manuals
$ man
shows the manual for a target command$ man ls
navigate with arrow keys, PgUp, PgDnpress “q” to exit
Some useful options for ls:-l –a –t –S
-lat
-
Editing files in UnixRequires a text editor. We’ll use nano.
$ nano
-
Unix intro exercises
-
Lecture outline
• What is Unix?• Tufts Cluster• Unix Introduction• File formats
– FASTA– FASTQ– SAM/BAM– GFF/GTF– BED
-
Important file formats
FASTA – nucleotide sequenceFASTQ – sequence + quality information
SAM/BAM – alignmentsGFF/GTF – transcript informationBED – misc. feature coordinates
-
FASTA
>gi|212549564|ref|NM_015981.3| Homo sapiens calcium/calmodulin-dependent protein kinase II alpha (CAMK2A), transcript variant 1, mRNAGGTTGCCATGGGGACCTGGATGCTGACGAAGGCTCGCGAGGCTGTGAGCAGCCACAGTGCCCTGCTCAGAAGCCCCGGGCTCGTCAGTCAAACCGGTTCTCTGTTTGCACTCGGCAGCACGGGCAGGCAAGTGGTCCCTAGGTTCGGGAGCAGAGCAGCAGCGCCTCAGTCCTGGTCCCCCAGTCCCAAGCCTCACCTGCCTGCCCAGCGCCAGGATGGCCACCATCACCTGCACCCGCTTCACGGAAGAGTACCAGCTCTTCGAGGAATTGGGCAAGGGAGCCTTCTCGGTGGTGCGAAGGTGTGTGAAGGTGCTGGCTGGCCAGGAGTATGCTGCCAAGATCATCAACACAAAGAAGCTGTCAGCCAGAGACCATCAGAAGCTGGAGCGTGAAGCCCGCATCTGCCGCCTGCTGAAGCACCCCAACATCGTCCGACTACATGACAGCATCTCAGAGGAGGGACACCACTACCTGATCTTCGACCTGGTCACTGGTGGGGAACTGTTTGAAGATATCGTGGCCCGGGAGTATTACAGTGAGGCGGATGCCAGTCACTGTATCCAGCAGATCCTGGAGGCTGTGCTGCACTGCCACCAGATGGGGGTGGTGCACCGGGACCTGAAGCCTG
-
Tab completion
Save time typing and reduce spelling errors by using tab completion.
Type in enough letters to uniquely identify a command/file/path, and press tab.
Unix will automatically fill in the rest.If pressing tab does nothing, what you have
typed is not enough.Press tab twice to see a list of possible
matches.
-
Text manipulation commands
less - displays part of a filehead - display beginning of filetail - display end of filesort – sorts a filecut - select columns of a filetr - replace or remove charactersgrep – searches a filesed - stream editor, edits a file line by lineawk - programming language, very useful for advanced text manipulation
-
Handling text output
>
Redirects output to a target (overwrites)>>
Appends output to a target|
“Pipe” - Sends output to another program. Very useful for multi-step text manipulation.
-
FASTA exercisesMinute cards
Break
-
FASTQStores sequence information and quality scores
associated with the sequence
Quality represented as a Phred score
where Q = quality and P = error probability
Phred Prob. Incorrect Accuracy10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%
-
FASTQTo save space, Phred scores are represented as ASCII characters
-
Phred scalesSanger format
Phred+33 from 0-40Illumina 1.0/Solexa
Phred+64 from -5-40Illumina 1.3+
Phred+64 from 3-40Illumina 1.8+
Phred+33 from 0-41
Look for “B” tails or “#” tails.
-
FASTQ@42JV5AAXX_HWI-EAS229_1:6:87:886:1289
CTACACCTTGAGCAAGAGGACCCTGCAATGTCCCTAGCTGCCAGCAGGCGGC
+
B?6B@@ABB@A;AB@@>B?@@@@?AA@A@@BBA5C>>?>?7;
First line = unique identifier (starts with “@”)Second line = sequenceThird line = spacer (may repeat identifier, starts with “+”)Fourth line = quality scores
-
FastQC
FastQC is used to generate summary information about FASTQ sequences
You will use this every time you receive RNA-Seq data
Babraham Bioinfomaticshttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/
-
FastQC – Basic Statistics
-
FastQC –Per base sequence quality
-
FastQC –Per base sequence content
-
FastQC –Sequence duplication levels
-
FastQC – Kmer content
-
What’s wrong here? #1
-
What’s wrong here? #1
Random hexamer bias at start
AT rich due to mitochondrial and polyA reads
Solution: remove the reads before analysis
-
What’s wrong here? #2
-
What’s wrong here? #2
Adapter sequences due to short insert
Solution: trim the reads before mapping
-
What’s wrong here? #3
-
What’s wrong here? #3
This sample is almost entirely adapter-adapter ligation products
Solution: None. Data unusable.Add the ligase last!!
-
LSF (Load Sharing Facility)
Resource allocator - how programs are assigned to nodes in the cluster
You interact with the head node, along with everyone else
Programs should notbe run on the head node
-
Submitting jobs to the cluster
bsub
Submits a job to the clusterbjobs
Displays information about current jobsbkill
Stops a jobbqueues
Displays information about the queues (different nodes that can run programs)
-
Cluster modules
Modules exist for specific software packages on the cluster
Sets environment parameters to correctly run the software
Path: tells the OS where to find programs
-
Using WinSCP
WinSCP is an FTP/SFTP program for transferring files
Useful for transferring files between the cluster and your computer
Login credentials are the same as with PuTTY
-
WinSCP
Create stored sessions on your personal computers
-
FastQC exercises
-
SAM (Sequence Alignment/Map)
Standard format for alignment dataTab delimited text formatHeader lines and alignment lines
All header lines start with “@”Header contains metadata about alignmentsOne line per alignment
BAM is a binary form of SAMhttp://samtools.sourceforge.net/SAM1.pdf
-
SAM header@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr2 LN:243199373
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@HD is the header line. Shows sort order
@SQ are the sequence dictionary lines. Show what sequences the reads were aligned to.
Other lines specified in SAM format document
-
SAM alignment sectionCol Field Type Brief description1 QNAME String Query template NAME2 FLAG Int bitwise FLAG3 RNAME String Reference sequence NAME4 POS Int 1‐based leftmost mapping POSition5 MAPQ Int MAPping Quality (sometimes)6 CIGAR String CIGAR string7 RNEXT String Ref. name of the mate/next segment8 PNEXT Int Position of the mate/next segment9 TLEN Int observed Template LENgth10 SEQ String segment SEQuence11 QUAL String ASCII of Phred‐scaled base QUALity+33
42JV5AAXX_HWI-EAS229_1:6:87:886:1289 272 chr1 11320 1 76M * 0 0 TTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCT…62664(1666646648848668688888856488868666886.6468886
-
bitwise FLAGFLAG Description1 Read is paired2 Both paired reads mapped4 Read unmapped8 Mate unmapped16 Read reverse strand32 Mate reverse strand64 First in pair128 Second in pair256 Not primary alignment512 not passing quality controls1024 PCR or optical duplicate
http://picard.sourceforge.net/explain-flags.html
-
CIGAR string
Symbol DescriptionM alignment match (can be a sequence match or mismatch)I insertion to the referenceD deletion from the referenceN skipped region from the referenceS soft clipping (clipped sequences present in SEQ)H hard clipping (clipped sequences NOT present in SEQ)P padding (silent deletion from padded reference)= sequence matchX sequence mismatch
-
SAM optional fields
Aligner specific information added to readsSome fields are specified in SAM format
Good for filtering reads
For Tophat:AS:i:-1 XN:i:0 XM:i:1 XO:i:0XG:i:0 NM:i:1 MD:Z:25A50 YT:Z:UU NH:i:4 CC:Z:chr15 CP:i:102519634HI:i:0
-
SAM/BAM exercisesBreak
-
GFF (General Feature Format)GTF (Gene Transfer Format)
File formats to store gene structures1. seqname - chromosome2. source - The program that generated this feature3. feature - Feature name. "CDS", "start_codon",
"stop_codon", "exon"4. start - Starting position. (1-based)5. end - The ending position of the feature (inclusive)6. score - A score between 0 and 1000, or “.”7. strand - '+', '-', or '.' 8. frame - For coding exons, number between 0-2 that
represents the reading frame of the first base. Otherwise, '.'
-
GFF (General Feature Format)GFF29. group – All lines with the same group are linked
together into a single item. Usually a single string.
GFF39. attributes – tag=value format
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDENctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001ctg123 . exon 1300 1500 . + . Parent=gene00001
-
GTF (Gene Transfer Format)
9. attribute – gene_id “…”; transcript_id “…”gene_id: A globally unique identifier for the
genomic source of the sequence.transcript_id: A globally unique identifier
for the predicted transcript.
Most commonly used format for recording gene structures for use in NGS applications.
-
GTF (Gene Transfer Format)
Necessary to determine which reads map to which genes/transcripts
Coordinates change - make sure your GTF was created from the same
genome build you used for alignment!
UCSC Table Browser is the simplest place to obtain GTF files
-
GTF (Gene Transfer Format)chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id"uc010nxr.1"; transcript_id "uc010nxr.1";
chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id"uc010nxq.1"; transcript_id "uc010nxq.1";
chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id"uc010nxq.1"; transcript_id "uc010nxq.1";
-
How to get a GTF file
-
BED
File format for defining genomic intervals
Required fields:1. chrom – chromosome2. chromStart – starting position (0-based)3. chromEnd – ending position
-
BEDOptional fields:
4. name – Name of the BED line. 5. score - A score between 0 and 1000, or “.”6. strand - Defines the strand - either '+' or '-'.7. thickStart - The starting position at which the feature is drawn
thickly (for example, the start codon in gene displays).8. thickEnd - The ending position at which the feature is drawn
thickly (for example, the stop codon in gene displays).9. itemRgb - An RGB value for display.10. blockCount - The number of blocks (exons) in the BED line.11. blockSizes - A comma-separated list of the block sizes.12. blockStarts - A comma-separated list of block starts.
-
How to get a BED file from the UCSC Table browser
-
BED exercises
-
Questions? Minute cards!
• What is Unix?• Tufts Cluster• Unix Introduction• File formats
– FASTA– FASTQ– SAM/BAM– GFF/GTF– BED