1
EE381V:GenomicSignalProcessingandDataScience
EE381V:GenomicSignalProcessingandDataScience
EE381V:GenomicSignalProcessingandDataScience
BasicInforma<on
! Instructor:HarisVikalo! Contactinforma<on:[email protected],(512)232-7922! Officehours:POB3.110,Tue/Thus2:00pm-3:00pm
! TeachingAssistant:SomsubhraBarik! Contactinforma<on:[email protected]! Officehours:TBD
! Electroniccoursesite:Canvas! hUp://canvas.utexas.edu/! distribu<onofhomeworkassignments,solu<ons,andclassslides/notes! shouldbeabletoaccessitifyouhaveUTEIDandareregistered
! Coursewebsite:h>p://users.ece.utexas.edu/~hvikalo/ee381v.html! classnotes(mirroredfromCanvas)andsuggestedreading! finalprojectguidelines
2
2
EE381V:GenomicSignalProcessingandDataScience
! Textbook:none! classnotes,readingassignmentswillbedistributedviacoursewebsite,Canvas
! Suggestedreading:! R.Durbin,et.al.,BiologicalSequenceAnalysis:Probabilis5cModelsofProteins,
CambridgeUniversityPress,1998.! N.C.JonesandP.A.Pevzner,AnIntroduc5ontoBioinforma5csAlgorithms,MIT
Press,2004.! M.Schena,MicroarrayAnalysis,Wiley2003.
! Grading(tentaJve):! homeworks(30%),midterm(30%),finalproject(40%)
! Homeworksandexams:! 4-5assignments(theorycomponent+programmingcomponent)! midterm(take-home)
3
BasicInforma<onCont’d
EE381V:GenomicSignalProcessingandDataScience
PrerequisitesandTargetAudience! Finalproject:eitherexpository(survey)orinnova<ve(research)
! uptotwostudentscancollaborateonaproject
! requiredwriUendocuments:(1)proposaland(2)finalreport
! alistofpossibleprojectswillbeprovidedshortly
! Prerequisites:! anundergraduatecourseinprobability! programmingexperience(MatlaborPython)! nobiologybackgroundrequired
! Targetaudience:! studentsspecializinginsignalprocessing/machinelearning/algorithms/
informa<ontheorywhowanttolearnofapplica<onsinbiology/genomicsandgetexposuretorealdata
! studentsspecializingincomputa<onalbiology,whowanttostrengthentheirknowledgeofbasicsignalprocessing/machinelearning/informa<ontheory
4
3
EE381V:GenomicSignalProcessingandDataScience
CourseDescrip<on
5
! CourseDescripJon:Anexplora<onofsignalprocessinganddatasciencesproblemsencounteredintheanalysisofhigh-throughputgenomicsdata! applica<onstodiagnos<cs(e.g.,viralstrainrecogni<on),studiesofcomplex
diseases(e.g.,cancer),studiesofimmunesystem,phenotypepredic<on,non-invasivepre-nataltes<ng
! Topicsinclude:! DNAsequencingandsequencealignment;! basecallinginhigh-throughputsequencingsystems! reference-guidedandreference-free(denovo)genomeassembly! genotypingandsingleindividualhaplotyping(haplotypeassembly);! RNAsequencingandChiP-Seq;! DNAmicroarraysandquan<ta<vepolymerasechainreac<onsystems;! modelingandinferenceforgene<cregulatorynetworks;! popula<onhaplotyping;phylogeny;! futuresequencingtechnologies
EE381V:GenomicSignalProcessingandDataScience
! Signalprocessingand“bigdata”challengesingenomics
! formula<ngproblems,presen<ngsolu<ons
! Duality:computaJonandbiology
! provideabiology/technologybackgroundtomo<vateacomputa<onaltask
! overviewrelevantcomputa<onaltechniques,derivesolu<ons,analysis
! FoundaJonsandfronJers
! welldefinedconven<onalproblemsandgeneralmethodologies
! contemporarychallenges,futureresearchdirec<ons,etc.
! Majorthemes:
! enablingbiotechnologies:modeling,algorithms,analysisofperformance
! cellularsystems:computa<onalmethodsforinferringtheirstructureandunderstandinghowtheyfunc<on
6
GoalsfortheTerm
4
EE381V:GenomicSignalProcessingandDataScience
ABI Prism ® 310 Genetic Analyzer Affymetrix GeneChip ® Roche LightCycler ®
DNA Sequencing DNA Microarrays DNA Amplification: QPCR systems
Theme#1:EnablingTechnologies
7
EE381V:GenomicSignalProcessingandDataScience 8
Theme#1:EnablingTechnologiesCont’d• DetecJonandquanJficaJonofmolecules:highprecision(quan<ta<ve
polymerasechainreac<on--QPCR)orhighthroughput(DNAmicroarrays)
• QPCR:highprecision(quanJfiessmall#ofDNAmolecules)– K.MullisandF.Faloona,“SpecificsynthesisofDNAinvitroviaapolymerase-catalyzed
chainreac<on,” MethodsEnzymol(1987).
– invitroreplica<on(amplifica<on)ofDNAmolecules
– applica<onstodiagnos<cs(viralandbacterialdetec<on),cancermarkers
iden<fica<on,gene<cfingerprin<ng(asinforensics),etc.
• DNAMicroarrays:highthroughput(screens10,000sofmolecules)– M.Schena,D.Shalon,R.W.Davis,P.O.Brown:“Quan<ta<vemonitoringofgene
expressionpaUernswithacomplementaryDNAmicroarray” Science(1995).
– massivelyparallelbiosensorarrays
– usedforstudiesofgene<cdiseases,drugdiscovery,genotyping(thespecificgenomeofanindividual),gene<cpathwaydiscovery,etc.
5
EE381V:GenomicSignalProcessingandDataScience 9
Theme#1:EnablingTechnologiesCont’d• QPCR,DNAmicroarrays:detect/quan<fyDNAmoleculesofknownstructure
• DNAsequencingsystems:iden<fyunknownstructure
• High-throughputsequencingisrevolu<onizingresearchandmedicine
• rou<nesequencingtasksgenera<ngmassiveamountsofdata
• computa<onallychallenging“bigdata”problems
Sangersequencing:1977–1990s
2ndgenera<onsequencing:since2007
3rdgenera<onsequencing:since2010
EE381V:GenomicSignalProcessingandDataScience 10
Theme#1:EnablingTechnologiesCont’d• Drama<cimprovementinaffordability:
6
EE381V:GenomicSignalProcessingandDataScience 11
Theme#2:CellularSystems
10bases
=3.4nm
2nm
DNA RNA ProteinTranscrip<on Transla<on
• Informa<onflowinacell(tradi<onalview:CentralDogma):
• Informa<on(signal)iscarriedbymolecules.
EE381V:GenomicSignalProcessingandDataScience 12
Theme#2:CellularSystems• Previouslymen<onedbiotechnologiesinterrupttheinforma<onflowand
soprovideinsightintothecellularstructureandfunc<ons
Sequences
Mechanisms
• Moreover,studythetemporalchangesintheinforma<onflow;givesinsight
inregula<onmechanisms,biologicalnetworkstructure,etc.
7
EE381V:GenomicSignalProcessingandDataScience 13
Theme#2:CellularSystems
GenefindingDNA
SequencingandGenomeassembly
Regulatorymo<fdiscovery
Compara<vegenomics
Evolu<onarytheory
ACATGCTATACGTGATAAAGAGGATATATATCATAT
ATATGATTT
Databaselookup
Geneexpressionanalysis Clusterdiscovery
Regulatorynetworksinference
Emergingnetworkproper<es
Proteinnetworkanalysis
SEQUENCES
INTERACTIONS
EE381V:GenomicSignalProcessingandDataScience 14
SignalProcessingandDataScienceTasks• Datasciencetasksonsequencingdatacanbecategorizedasfollows:
• Tocompletethosetasks,werelyonavarietyoftools:
• sta<s<calsignalprocessingandmachinelearning
• combinatorialalgorithms
• informa<ontheory
8
EE381V:GenomicSignalProcessingandDataScience 15
ExampleApplica<on#1:SequenceAssembly• Sequencing:determiningtheorderofnucleo<desinatargetDNAstring
• Shotgunsequencing:assemblethetargetfromoverlappingshortreads
• denovo:nosideinforma<on,onlythereadsareavailable
• reference-guided:relyonapre-exis<ngreferencesequence
EE381V:GenomicSignalProcessingandDataScience 16
ExampleApplica<on#1:SequenceAssembly
• Reference-guidedassemblyreliesonmappingthereadsontoareference;sequencealignment/mappingisafundamentalfirststep
• dynamicprogrammingsolu<ons(Viterbi,forward-backwardalgorithms)
• es<ma<oninHiddenMarkovModels(EMalgorithm)
• datacompressionconcepts(Burrows-Wheelertransform)
• Reference-free(denovo)assembly
• greedymerging+extensionoftheoverlappingfragments
• findingEulerianpathinthedeBruijngraph• condi<onsforerror-freereconstruc<on
9
EE381V:GenomicSignalProcessingandDataScience 17
ExampleApplica<on#2:HaplotypeAssembly• Inmanyapplica<ons,therearemul<pletargetsequencesofinterestthat
cannotbeseparatedpriortosequencing
• haplotypeassembly,viralquasispeciesreconstruc<on,bacterialcommuni<es,immunecellrepertoire
• Thesimplestone:haplotypeassemblyfordiploids
• reconstructvariablepartsofchromosomepairs
EE381V:GenomicSignalProcessingandDataScience 18
ExampleApplica<on#2:HaplotypeAssembly• Shotgunsequencingforhaplotypeassembly:
• Datamodel:shortreadsobtainedbysampling(withreplacement)fromacomplementarypairofbinarystrings
• thetaskistoreconstructthepairofstrings
10
EE381V:GenomicSignalProcessingandDataScience 19
ExampleApplica<on#2:HaplotypeAssembly• Methodsforsolvingthehaplotypeassemblyproblem
• (correla<on)clustering
• communica<on-theore<ctechniques:decodingnoisycodewordstransmiUedoverabinaryerasurechannel
• low-ranksparsematrixcomple<on/factoriza<on
• Analysisoffundamentallimitsofperformance(accuracy,dataredundancy)
• Informa<on-theore<ctools
EE381V:GenomicSignalProcessingandDataScience 20
• RecentIEEEspecialissues(canbeaccessedviaIEEEXplore):
• IEEESignalProcessingMagazine,SpecialIssueonSignalProcessinginGenomics
andProteomics,vol.29,no.1,January2012.
• IEEETransac<onsonInforma<onTheory,SpecialIssueonMolecularBiologyand
Neuroscience,vol.56,no.2,February2010.
• IEEEJournalofSelectedTopicsinSignalProcessing,SpecialIssueonGenomicand
ProteomicSignalProcessing,vol.2,no.3,June2008.
• IEEESignalProcessingMagazine,SpecialIssueonSignalProcessinginGenomics,
vol.24,no.1,January2007.
• IEEETrans.onSignalProcessing,SpecialIssueonGenomicSignalProcessing,vol.
54,no.6,June2006.
RecentSpecialIssuesinEE/CSCommunity
Top Related