6 Month Allelic Series RNAseq QC 1. QC summary 2 QC was performed on all 192 samples focusing on...
-
Upload
stephon-cressey -
Category
Documents
-
view
217 -
download
0
Transcript of 6 Month Allelic Series RNAseq QC 1. QC summary 2 QC was performed on all 192 samples focusing on...
6 Month Allelic Series RNAseq QC
1
2
QC summary
QC was performed on all 192 samples focusing on determining failed or outliersamples. Four samples are recommended for omission from the final analysisdataset based on evidence of RNA degradation, PCA analysis, and model-basedgene outlier detection. Those four samples can be found on slide 19.
Additionally two correctable issues were identified. First, one flowcell worth of samples was run an additional time to add read depth to the 100 million required.This re-run was inadvertently run as 75-mers instead of 50-mer so the samplesare a mix of read length. Secondly, for a subset of cortex samples (Q92 and Q140)there appears to be an infinitesimal but detectable amount of liver tissue. Theoverall dilution is 500-1000x, but given the extraordinary sensitivity of RNAseq thisis still measureable. We have recommended a simple filter to remove those liver transcripts based on the fact that they have a recognizable correlation pattern (listed on slide 29), but other methods may be more sensitive.
3
How does CHDI QC RNAseq data in general?
• Mostly we’re looking for outliers• Also showing overall experiment
worked• When we find outliers, we try to
determine the cause– That helps show it is an outlier and
not part of the biology• Methods
– Principal Components Analysis– RNA degradation plots– Paired end insert size– Read lengths– Read mapping efficiency– Repetitive sequences and their origin– Highly expressed genes– # gene outliers
-300 -200 -100 0 100 200 300
Comp.1 [17.40%]
-200
-150
-100
-50
0
50
100
150
200
Com
p.2
[6.7
7%]
4
PCA whole datasetColor by Tissue
cortex
Liver
striatum
Shape by Sex
F
M
• Not surprisingly, tissuescluster
• Strong sex effect in liver• Cortex is tightly clustered
5
PCA striatumColor by Q Length
Q111
Q140
Q175
Q20
Q80
Q92
Shape by Genotype
HET
• Q lengths cluster, good sign the design worked
• Q92, 111, 140, 175 uniquelycluster
• They even stagger in Q length order
• Couple potential outliers (in red outline)
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
6
PCA cortexColor by Q Length
Q111
Q140
Q175
Q20
Q80
Q92
• Only Q175 stands outsidethe main cluster
• Possible Q175 outliers,but hard to be certain
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
7
PCA liverColor by Q Length
Q111
Q140
Q175
Q20
Q80
Q92
Shape by Sex
F
M
• Strong sex clustering willneed to be accounted for
• No strong Q clusters (sex masking?)
• One potential outlier
450_Liver_Q175_HET_M_L8.LB1_1.clipped
8
Duplication in brain (representative examples)
20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
Perc
enta
ge
20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50P
erc
enta
ge
20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
Perc
enta
ge
20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
• Duplication is consistent andhovers between 13-24%
• No red flags• Higher in striatum than
cortex generally• Origin of the majority of the
duplicated sequences is mitochondrial
9
Liver duplication (representative examples)520_Liver_Q111_HET_M_L8.LB12_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
520_Liver_Q111_HET_M_L8.LB12_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
522_Liver_Q111_HET_F_L8.LB13_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
5101520
25303540
Perc
enta
ge
522_Liver_Q111_HET_F_L8.LB13_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
524_Liver_Q111_HET_F_L4.LB14_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
524_Liver_Q111_HET_F_L4.LB14_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
526_Liver_Q111_HET_F_L4.LB15_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
526_Liver_Q111_HET_F_L4.LB15_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
528_Liver_Q111_HET_F_L6.LB16_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
528_Liver_Q111_HET_F_L6.LB16_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
642_Liver_Q20_HET_M_L3.LB18_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
642_Liver_Q20_HET_M_L3.LB18_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
644_Liver_Q20_HET_M_L4.LB19_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50P
erc
enta
ge
644_Liver_Q20_HET_M_L4.LB19_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
646_Liver_Q20_HET_M_L6.LB20_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
646_Liver_Q20_HET_M_L6.LB20_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
648_Liver_Q20_HET_M_L1.LB21_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
648_Liver_Q20_HET_M_L1.LB21_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
650_Liver_Q20_HET_F_L7.LB22_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
650_Liver_Q20_HET_F_L7.LB22_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
652_Liver_Q20_HET_F_L2.LB23_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
652_Liver_Q20_HET_F_L2.LB23_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
654_Liver_Q20_HET_F_L1.LB25_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
60
Perc
enta
ge
654_Liver_Q20_HET_F_L1.LB25_2.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
656_Liver_Q20_HET_F_L6.LB27_1.clipped.fastq
1 3 5 7 9 10+
DuplicationLevel
0
10
20
30
40
50
Perc
enta
ge
• Liver duplication is much higher, 40-50%
• Major duplicated sequences are all mouse pheromone receptors (Mup1-21)
• Hurts our true read depth, but nothing terrible
• Should keep in mind for future liver work
10
20914_449L_striatum_Q175_WT_M_L1.LB1_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20921_450L_cortex_Q175_HET_M_L2._1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20927_450L_striatum_Q175_HET_M_L1.LB2_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20940_451L_striatum_Q175_WT_M_L6.LB3_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20953_452L_striatum_Q175_HET_M_L8.LB4_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20966_453L_striatum_Q175_WT_M_L2.LB5_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20973_454L_cortex_Q175_HET_M_L4.LB3_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20979_454L_striatum_Q175_HET_M_L4.LB6_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20992_455L_striatum_Q175_WT_M_L2.LB7_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
20999_456L_cortex_Q175_HET_M_L5.LB4_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21005_456L_striatum_Q175_HET_M_L5.LB8_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21018_457L_striatum_Q175_WT_F_L1.LB9_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21025_458L_cortex_Q175_HET_F_L6.LB6_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21031_458L_striatum_Q175_HET_F_L8.LB10_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21044_459L_striatum_Q175_WT_F_L6.LB11_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21057_460L_striatum_Q175_HET_F_L8.LB12_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21070_461L_striatum_Q175_WT_F_L3.LB13_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21077_462L_cortex_Q175_HET_F_L7.LB7_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
0.2
0.4
0.6
0.8
1
21083_462L_striatum_Q175_HET_F_L6.LB14_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin860
0.2
0.4
0.6
0.8
1
21096_463L_striatum_Q175_WT_F_L1.LB15_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin860
0.2
0.4
0.6
0.8
1
21103_464L_cortex_Q175_HET_F_L3.LB8_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin860
0.2
0.4
0.6
0.8
1
21109_464L_striatum_Q175_HET_F_L5.LB16_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin860
0.2
0.4
0.6
0.8
1
23346_513L_striatum_Q111_WT_M_L4.LB18_1.clipped
Bin1 Bin18 Bin35 Bin52 Bin69 Bin860
0.2
0.4
0.6
0.8
1
5’ -> 3’ degradation charts (representative examples)
Color by TranscriptBin
1-499
500-999
1000-1999
2000-2999
3000-3999
4000-4999
5000+
Displays the likelihood of gettingfull length transcripts for variousmRNA lengths• Very high quality samples in general• Most samples show >70% of all mRNA
molecules are >80% complete• Liver on average more degraded• Some samples have degradation in the
longer mRNA species (one marked in red)
11
Suspect samples by 5’ -> 3’ degradation
454_Liver_Q175_HET_M_L3._1.clipped
456_Liver_Q175_HET_M_L7.LB4_1.clipped20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
845_Liver_Q92_HET_F_L6.LB25_1.clipped
522_Liver_Q111_HET_F_L8.LB13_1.clipped
452_Liver_Q175_HET_M_L1.LB2_1.clipped
776_Liver_Q140_HET_F_L8.LB13_1.clipped
716_Liver_Q80_HET_F_L7.LB6_1.clipped
843_Liver_Q92_HET_F_L6.LB23_1.clipped21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
12
GC content per read has a red flag20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
Cou
nt
20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
Cou
nt
20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq
0 10 20 30 40 50 60 70 80
GC#
0
500000
1000000
1500000
2000000
2500000
3000000
Cou
nt
20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq
0 10 20 30 40 50 60 70 80
GC#
0
500000
1000000
1500000
2000000
2500000
3000000
Cou
nt
20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
Cou
nt
20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000C
ount
20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq
0 10 20 30 40 50
GC#
0
1000000
2000000
3000000
4000000
Cou
nt
20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
Cou
nt
20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0500000
10000001500000
2000000250000030000003500000
4000000
Cou
nt
20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq
0 10 20 30 40 50
GC#
0500000
10000001500000
2000000250000030000003500000
4000000
Cou
nt
21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq
0 10 20 30 40 50 60
GC#
0
1000000
2000000
3000000
4000000
5000000
Cou
nt
8 of the samples have a “shoulder”in the GC# chartThis is usually a really bad thing• Suggests non-mouse or
non-biological sequence
13
Those same samples flag for read length as well20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq
0 10 20 30 40 50 60 70 80
SequenceLength
0
5000000
10000000
15000000
20000000
25000000
30000000
Count
20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq
0 10 20 30 40 50 60 70 80
SequenceLength
0
5000000
10000000
15000000
20000000
25000000
30000000
Count
20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
Count
20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
Count
20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
Count
20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
Count
20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
Count
20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
Count
20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
Count
20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
Count
21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
Count
21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq
0 10 20 30 40 50 60
SequenceLength
0
10000000
20000000
30000000
40000000
50000000
60000000
Count
Those same samples have a mixof 50mer reads and 75mer readsThat’s very odd
At this point we asked our sequencinglab for clarification on what happened
14
Our sequencing partner found the cause
264401 20921_1_450L_cortex_Q175_HET_M 20130523 V02604 VIRT 2264416 23535_1_528L_cortex_Q111_HET_F 20130523 V02604 VIRT 3264418 28243_1_644L_cortex_Q20_HET_M 20130523 V02604 VIRT 4264397 35624_1_844L_striatum_Q92_WT_F 20130523 V02604 VIRT 1264447 35631_1_845L_cortex_Q92_HET_F 20130604 V02761 VIRT 1264448 35657_1_847L_cortex_Q92_HET_F 20130604 V02761 VIRT 2264451 454_Liver_Q175_HET_M 20130604 V02761 VIRT 3264455 462_Liver_Q175_HET_F 20130604 V02761 VIRT 4
The 8 suspect samples
For these 8 samples, the initial run didn’t get a full 100 million reads. When that happens the lab runs the samples again and then merges the run into a full “virtual run” of the full read depth we paid for. That’s all good. The strange thing that happened to us this time was that the run they added our 8 samples to (they add it to ongoing flow cells) happened to be a 75mer run. Again no big problem usually, and what they do is clip off 25 bases in their processing and all is compatible. This specific time they forgot to trim, so we saw the ugly intermediate state. What this means is that the data for those 8 are fine. They are longer, but still good reads from our samples.
15
Mitochondrial rate in brain
8-9% of the reads are mtRNAnothing surprising there
16
Mitochondrial rate in liver
6-7% of reads are mtRNAAgain in line with expectations
17
Other QC parameters that looked great
• Insert sizes: All right around 175 as expected• Sense/antisense sequence ratio: 1:1 as expected• Sequence coverage
– 40% of mouse transcriptome detected in brain– About 30% of mouse transcriptome detected in liver
• Mapped read rate in the upper 90s – 98% for brain, 96% for liver
• 95-97% of our reads are mapped to known genes– 3-5% intergenic regions
18
Model based outlier detection
20914_449L_striatum_Q175_WT_M_L1.LB1_... 23385_516L_striatum_Q111_HET_M_L1.LB2... 28295_648L_cortex_Q20_HET_M_L8.LB21_1... 30772_715L_striatum_Q80_WT_F_L3.LB11_... 33205_781L_striatum_Q140_WT_F_L6.LB6_...0
1000
2000
3000
4000
5000
6000
7000
8000
Method by which we look for the number of genes that areoutliers after accounting for our modeled effects• 2 samples stand out, and additional 4-6 are suspect, but probably
OK (Q92 Het males)
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
19
Integrating the sample QC to choose omissions
A very simple way to determine what to throw out is to look for multiple strikes against a sample
454_Liver_Q175_HET_M_L3._1.clipped
456_Liver_Q175_HET_M_L7.LB4_1.clipped20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
845_Liver_Q92_HET_F_L6.LB25_1.clipped
522_Liver_Q111_HET_F_L8.LB13_1.clipped
452_Liver_Q175_HET_M_L1.LB2_1.clipped
776_Liver_Q140_HET_F_L8.LB13_1.clipped
716_Liver_Q80_HET_F_L7.LB6_1.clipped
843_Liver_Q92_HET_F_L6.LB23_1.clipped21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
5’ -> 3’ charts
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
450_Liver_Q175_HET_M_L8.LB1_1.clipped
PCA outliers
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
Model based outliers
20
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
Final list of proposed samples for omission
21
Liver contamination in cortex Q140 and Q92?
While the sequencing lab was looking into the 75mer issue I ran cortex through some basic statistical modeling (omitting the samples mentioned previously)
I found changes, but the pattern and biology was all wrong
Alb
ENSMUSG000000293680
1
2
3
4
5
6
7
8
22
Color by Q Length
Q111
Q140Q175
Q20Q80
Q92
Every single change is an increase
Completely off in Q175, 111, 80, and 20On (but not that strongly in 140 and 92)
It’s make no sense for Q111 to be skippedand for Q175 to go back to normal
Logg
ed F
PK
M
Albumin is the top hit?Isn’t Albumin liver specific?
23
Some of the other changed genes are suspicious
• Albumin• ApoC3, C1, • Mup3, 10, 18, 19• FABP1• Urate oxidase
All reasonably solid liver markers
DAVID functional annotation also suggests the altered genesare liver related (p < 10-5)
24
1 3 7 20 50 200 500 2000 5000 20000 50000 200000 500000 2000000 5000000
33030_768L_cortex_Q140_HET_M_L4.LB9_1.clipped
1
2
3
45
79
20
30
4050
7090
200
300
400500
700900
2000
3000
40005000
70009000
20000
30000
4000050000
7000090000
200000
300000
400000500000
700000900000
2000000
3000000
40000005000000
70000009000000
768_
Live
r_Q
140_
HE
T_M
_L3.
LB9_
1.cl
ippe
d
A subset of genes with good correlation between liver and cortexbut shifted from the 1:1 axis
25
1 3 7 20 50 200 500 2000 5000 20000 50000 200000 500000 2000000 5000000
33030_768L_cortex_Q140_HET_M_L4.LB9_1.clipped
1
2
3
45
79
20
30
4050
7090
200
300
400500
700900
2000
3000
40005000
70009000
20000
30000
4000050000
7000090000
200000
300000
400000500000
700000900000
2000000
3000000
40000005000000
70000009000000
768_
Live
r_Q
140_
HE
T_M
_L3.
LB9_
1.cl
ippe
d
Same chart with the “significant” genes in red
26
1 3 7 20 50 200 500 2000 5000 20000 50000 200000 500000 2000000 5000000
23353_514L_cortex_Q111_HET_M_L1.LB9_1.clipped
1
2
3
45
79
20
30
4050
7090
200
300
400500
700900
2000
3000
40005000
70009000
20000
30000
4000050000
7000090000
200000
300000
400000500000
700000900000
2000000
3000000
40000005000000
70000009000000
514_
Live
r_Q
111_
HE
T_M
_L8.
LB9_
1.cl
ippe
d
Same chart and shading in Q111, notice the Lack of linear correlation
27
What we suspect happened
• The basic problem is that liver specific transcripts should not have correlated expression in cortex
• A very small amount of liver contamination has occurred. The shift is 500 to 1000 times lower than normal liver expression
• What this means is only the absolute highest liver expressed genes are detected at all
• The challenge is uniquely identifying the affected genes
Cortex Striatum Liver
Albumin 173 0.01 40979
FPKMs of albumin, which should not exist in brain
28
1 3 7 20 50 200 500 2000 5000 20000 50000 200000 500000 2000000 5000000
33056_770L_cortex_Q140_HET_M_L7.LB10_1.clipped
1
2
3
45
79
20
30
4050
7090
200
300
400500
700900
2000
3000
40005000
70009000
20000
30000
4000050000
7000090000
200000
300000
400000500000
700000900000
2000000
3000000
40000005000000
70000009000000
770_
Live
r_Q
140_
HE
T_M
_L2.
LB10
_1.c
lippe
d
Liver filter created as• Liver mean count > 2000• Mean ratio of liver to cortex > 500• Cortex count > 0
Not a bad first approximation
29
Effect of filtering out the liver specific genes from the cortex data
Q80 Q92 Q111 Q140 Q1750
10
20
30
40
50
60
70
80
90
Hits pre-filterHits post-filter
30
Summary of QC
• All but 4 of the 192 samples can move forward to the analysis
• A filter to clear out highly expressed liver genes is needed for the cortex Q140 and Q92 sets
• Striatum PCA plots show that CAG length is the single largest global element of variance!