Qbi Centre for Brain genomics (Informatics side)

16
The Queensland Brain Institute | QBI’s Centre for Brain Genomics The informatics side of things 11/1/22 [Sprengben [why not get a friend]]

description

An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration with external data on the genome browser.

Transcript of Qbi Centre for Brain genomics (Informatics side)

Page 1: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute |

QBI’s Centre for Brain GenomicsThe informatics side of things

April 11, 2023

[Sprengben [why not get a friend]]

Page 2: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute |

Objective of QBI’s Centre for Brain genomics

On-time

deliveryReliable data production

Convincing data

Easy delivery

Perkel JM. Coding your way out of a problem. Nat Methods. 2011 Jun PMID: 21716280.

Page 3: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Birdseye view of facility’s workflow

Page 4: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Detailed workflow

CASAVA

Raw sequencereads

projects flowcell

CbotHiSeq

30 diff. programs

HiSeq cluster cluster

Page 5: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Overview of Production Informatics framework

//clusterstorage

//cluster-vm

Run/ Data/

MakeFastq.sh trigger.sh armed trigger.sh html

Unaligned/ bwa/, reCaAl/, variant/ Summary.html

Apache, IGV, R, UCSC

Automatic

Manual

Processing Evaluation

Page 6: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Trigger.sh

• Keeping data separate from scripts

• Automating verification, quality control and summary HTML generation

• Rerunning pipeline from every point

Page 7: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Flexible generic names: header

#ProgramsBWA="/clusterdata/hiseq_apps/bin/$MODE/bwa"SAMTOOLS="/clusterdata/hiseq_apps/bin/$MODE/samtools"IGVTOOLS="/clusterdata/hiseq_apps/bin/$MODE/igvtools/IGVTools/igvtools.jar”

# Task namesTASKFASTQC="fastQC"TASKBWA="bwa"TASKRCA="reCalAln”

#FileabbREADONE="read1"READTWO="read2"FASTQ="fastq.gz"ALN="aln" # aligned

Page 8: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Config.txt

#********************# Tasks#********************mappingBWA="1" recalibrateQualScore="1"

#********************# Paths#********************FASTA="/clusterdata/resources/hg19/hg19.fasta" SEQREG=chr1:229994688-230071581"DBSNP="/clusterdata/resources/hg19/snpdb132.vcf"

#********************# PARAMETER#********************LIBRARY="QBI”ADDPARAMBWA=“--force single”

Specifics what to do,e.g. mapping and recalibration

Specifics where to find resources

Customizes stanard sripts for this project

Page 9: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

call

• trigger.sh config.txt armed• trigger.sh config.txt html

s_1_read1.fastqs_1_read2.fastqs_2_read1.fastqs_2_read2.fastq

s_3_read1.fastqs_3_read2.fastqs_4_read1.fastqs_4_read2.fastq

s_1.bams_2.bam

s_1.ashrr.bams_2.ashrr.bam

s_3.bams_4.bam

s_3.ashrr.bams_4.ashrr.bam

Sub1_s_1.outSub1_s_2.outSub2_s_3.outSub2_s_4.out

Sub1_s_1.outSub1_s_2.outSub2_s_3.outSub2_s_4.out

Page 10: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Summary.html

Project CardsSequence statistics

Data Visualization

Download

Run check points

Mapping stats

Interesting Regions

Page 11: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Scaffold of pbsScripts.sh: Error catching

# QCVARIABLES, loosing reads, unmapped read,no such file,file not found,bwa.sh: line

>>>>>>>>>> ErrorsQC_PASS .. 0 have We are loosing reads/184QC_PASS .. 0 have for unmapped read/184QC_PASS .. 0 have no such file/184QC_PASS .. 0 have file not found/184QC_PASS .. 0 have bwa.sh: line/184

Code example for setting up what errors to look out for

Output in Summary.html

Page 12: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Scaffold of pbsScripts.sh: checkpoints

>>>>>>>>>> CheckPointsQC_PASS .. 184 have mapping/184QC_PASS .. 184 have sorting and bam-conversion/184QC_PASS .. 184 have mark duplicates/184QC_PASS .. 184 have statistics/184QC_PASS .. 184 have coverage track/184

echo “********* mapping”$BWA aln -t $THREADS $FASTA $f > $OUT/${n/$FASTQ/sai}$BWA aln -t $THREADS $FASTA ${f/$READONE/$READTWO} > $OUT/${n/$READONE.$FASTQ/$READTWO.sai}

Code example for setting up checkpoints in the pbsScript.sh

Output in Summary.html

Page 13: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute |

Availability: tailored to skills

Website RStudio Command line

1 2 3

Page 14: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute |

Documentation: Project Server

Application Backup/Version Control

Data Warehousing

RSudio

Project Cards

Software

Processed Data

External Genomic

Resources

Custom Scripts

Custom Scripts

Visualization

IGVGenome Browser

Statistic Analysis

Quality Control

Hypothesis Generation

DataProcessing and Analysis

HiSeq Output

Rsync

Version Control

Genomes, Annotation, etc.

7 project-cards10 Projects, 6 HiSeq-Runs

40 wiki pages, 250 Tasks, 551h logged

160 Commits35 external programs

41 custom scripts (4197 lines of code)

5 TB raw data750 GB processed data

57 GB external data

//cluster-vm //clusterstorage //groupshare, //ethan

Covering all aspects of: design*, set-up*, maintenance*, usage (*except cluster)

//project

Processed Data

Raw Data

Cluster

GalaxyProject Server

Content

BWA, GATK, samtools, etc.

The big picture

Page 15: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Three things to remember

• Reliable data production– Projects have all a similar structure and are processed in

the same way

• Convincing data– All steps are tightly quality controlled and the QC report

is accessible

• Easy delivery– We tailored data availability to skill-levels (webpage,

Rstudio, console

• On time delivery– Production informatics has priority on the cluster( )

Page 16: Qbi Centre for Brain genomics (Informatics side)

The Queensland Brain Institute | April 11, 2023

Next week

• NGS Discussion group:

Methylation analysisKevin Dudley and Danay Baker-Andresen