H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir
Interactive Scientific Image Analysis using Spark
-
Upload
kevin-mader -
Category
Technology
-
view
5.482 -
download
0
Transcript of Interactive Scientific Image Analysis using Spark
SUMMIT EASTSUMMIT EAST
InteractiveScientificImageAnalysisandAnalyticsusingSparkKevinMaderSparkEast,NYC,19March2015
SUMMIT EAST
OutlineBackground:OurTechnique(whywehavebigdata)
X-RayTomographicMicroscopy
Imagingin2015
TheProblem(s)
TheToolsSparkImagingLayer
3DImaging
HyperspectralImaging
InteractiveAnalysis/Streaming
TheScienceGenomeScaleStudies
LargeDatasets
Outlook/Developments
SUMMIT EAST
Synchrotron-basedX-RayTomographicMicroscopyTheonlytechniquewhichcandoall
peerdeepintolargesamples
achieve isotropicspatialresolution
with1.8mmfieldofview
achieve>10Hztemporalresolution
8GB/sofimages
[1]Moksoetal.,J.Phys.D,46(49),2013
< 1μm
CourtesyofM.PistoneatU.Bristol
SUMMIT EAST
ImageSciencein2015:MoreandfasterX-Ray
SwissLightSource(SRXTM)imagesat(>1000fps) 8GB/s,diffractionpatterns(cSAXS)at30GB/s
Nanoscopium(Soleil),10TB/day,10-500GBfilesizes,veryheterogenousdata
OpticalLight-sheetmicroscopy(see ofJeremyFreeman)producesimages500MB/s
High-speedconfocalimagesat(>200fps)78Mb/s
GeospatialNewsatelliteprojects(Skybox,etc)willmeasurehundredsofterabytestopetabytesofimagesayear
→
talk→
→
PersonalGoPro4Black-60MB/s(3840x2160x30fps)for$600
-400MB/s(640x480x840fps)for$400fps1000
SUMMIT EAST
HowmuchisaTB,really?Ifyoulookedatone1000x1000sizedimageeverysecond
Itwouldtakeyou139hourstobrowsethroughaterabyteofdata.
Year Timeto1
TB
Manpowerto
keepup
SalaryCosts/
Month
2000 4096min 2people 25kCHF
2008 1092min 8people 95kCHF
2014 32min 260people 3255kCHF
2016 2min 3906people 48828kCHF
SUMMIT EAST
Computinghaschanged:ParallelMooresLaw
Basedondatafrom
Transistors ∝ 2T/(18 months)
https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d
Therearenowmanymoretransistorsinsideasinglecomputerbuttheprocessingspeedhasn'tincreased.Howcanthisbe?
MultipleCore
Manymachineshavemultiplecoresforeachprocessorwhichcanperformtasksindependently
MultipleCPUs
Morethanonechipiscommonlypresent
Newmodalities
GPUsprovidemanycoreswhichoperateatslowspeed
ParallelCodeisimportant
SUMMIT EAST
CloudComputingCostsThefigureshowstherangeofcloudcosts(determinedbypeakusage)comparedtoalocalworkstationwithutilizationshownastheaveragenumberofhoursthecomputerisusedeachweek.
Thefigureshowsthecostofacloudbasedsolutionasapercentageofthecostofbuyingasinglemachine.Thevaluesbelow1showthepercentageasanumber.Thepanelsdistinguishtheaveragetimetoreplacementforthemachinesinmonths
SUMMIT EAST
TheProblemThereisafloodofnewdataWhattookanentirePhD3-4yearsago,cannowbemeasuredinaweekend,orevenseveralseconds.Analysistoolshavenotkeptup,aredifficulttocustomize,andusuallyhighlyspecific.
OptimizedData-StructuresdonotfitData-structuresthatwerefastandefficientforcomputerswith640kbofmemorydonotmakesenseanymore
Single-corecomputingistooslowCPU'sarenotgettingthatmuchfasterbuttherearealotmoreofthem.Iteratingthroughahugearraytakesalmostaslongon2014hardwareas2006hardware
SUMMIT EAST
ExploratoryImageProcessingPrioritiesCorrectnessThemostimportantjobforanypieceofanalysisistobecorrect.
Apowerfultestingframeworkisessential
Avoidrepetitionofcodewhichleadstoinconsistencies
Usecompilerstofindmistakesratherthanusers
Easilyunderstood,changed,andusedAlmostallimageprocessingtasksrequireanumberofpeopletoevaluateandimplementthemandarealmostalwaysmovingtargets
Flexible,modularstructurethatenablesreplacingspecificpieces
FastThelastofthemajorprioritiesisspeedwhichcoversbothscalability,rawperformance,anddevelopmenttime.
Longwaitsforprocessingdiscouragesexploration
Manualaccesstodataonsepareatedisksisahugespeedbarrier
Real-timeimageprocessingrequiresmillisecondlatencies
Implementingnewideascanbedonequickly
SUMMIT EAST
TheFrameworkFirstRatherthanbuildingananalysisasquicklyaspossibleandthentryingtohackittoscaleuptolargedatasets
chosetheframeworkfirst
thenstartmakingthenecessarytools.
Google,Amazon,Yahoo,andmanyothercompanieshavemadehugein-roadsintotheseproblems
Therealneedisafast,flexibleframeworkforrobustly,scalablyperformingcomplicatedanalyses,asortofExcelforbigimagingdata.
ApacheSparkandHadoop2Thetwoframeworksprovideafreeoutoftheboxsolutionfor
scalingto>10000computers
storingandprocessingexabytesofdata
faulttolerance
2/3rdsofcomputerscancrashandarequeststillaccuratelyfinishes
hardwareandsoftwareplatformindpendence(Mac,Windows,Linux)
SUMMIT EAST
Spark->Microscopy?TheseframeworksarereallycoolandSparkhasabigvocabulary,butflatMap,filter,aggregate,join,groupBy,andfoldstilldonotsoundlikeanythingIwanttodotoanimage.
Iwantto
filteroutnoise,segment,chooseregionsofinterest
contour,componentlabel
measure,count,andanalyze
…
SparkImageLayerDevelopedat , ,and
TheSparkImageLayerisaDomainSpecificLanguageforMicroscopyforSpark.
Itconvertscommonimagingtasksintocoarse-grainedSparkoperations
4Quant ETHZurichPaulScherrerInstitut
SUMMIT EAST
SparkImageLayerWehavedevelopedanumberofcommandsforSILhandlingstandardimageprocessingtasks
Fullyexensiblewith
SUMMIT EAST
Usecase:HyperspectralImagingHyperspectralimagingisarapidlygrowingareawiththepotentiallyformassivedatasetsandaseveredeficitofusuabletools.
Thescaleofthedataislargeandstandardimageprocessingtoolsareill-suitedforhandlingthem,althoughtheideasusedinimageprocessingareequallyapplicabletohyperspectraldata(filtering,thresholding,segmentation,…)anddistributed,parallelapproachesmakeevenmoresenseonsuchmassivedatasets
SUMMIT EAST
FlexibilitythroughTypesDevelopinginScalabringsadditionalflexibilitythroughtypes[1],withmicroscopythestandardformatsare2-,3-andeven4-ormoredimensionalarraysormatriceswhichcanbeiteratedthroughquicklyusingCPUandGPUcode.WhilestillpossibleinScala,thereisagreatdealmoreflexibilityfordatatypesallowinganythingtobestoredasanimageandthenprocessedaslongasbasicfunctionsmakesense.
[1]FightingBitRotwithTypes(ExperienceReport:ScalaCollections),MOdersky,FSTTCS2009,December2009
Whatisanimage?Acollectionofpositionsandvalues,maybemore(notanarrayofdouble).Arraysareefficientforstoringincomputermemory,butoftenapoorwayofexpressingscientificideasandanalyses.
FilterNoise?
combine information from nearbypixels
Findobjects
determine groups of pixelswhich are very similar todesired result
SUMMIT EAST
MakingCodingSimplerwithTypestrait BasicMathSupport[T] extends Serializable { def plus(a: T, b: T): T def times(a: T, b: T): T def scale(a: T, b: Double): T def negate(a: T): T = scale(a,-1) def invert(a: T): T def abs(a: T): T def minus(a: T, b: T): T = plus(a, negate(b)) def divide(a: T, b: T): T = times(a, invert(b)) def compare(a: T, b: T): Int}
SUMMIT EAST
ContinuingwithTypesSimplefilterimplementation
Spectraaswellsupportedtypes
def SimpleFilter[T](inImage: Image[T])(implicit val wst: BasicMathSupport[T]) = {val width: Double = 1kernel = (pos: D3int,value: T) => value * exp(-(pos.mag/width)**2)kernelReduce = (ptA,ptB) => (ptA + ptB) * 0.5runFilter(inImage,kernel,kernelReduce)}
implicit val SpectraBMS = new BasicMathSupport[Array[Double]] { def plus(a: Array[Double], b: Array[Double]) = a.zip(b).map(_ + _)... def scale(a: Array[Double], b: Double) = a.map(_*b)
SUMMIT EAST
InteractiveAnalysisCombiningmanydifferentcomponentstogetherinsideoftheSparkShell,IPythonorZeppelin,makeiteasiertoassembleworkflows
SUMMIT EAST
ScientificCases:Genome-scaleImagingWewanttounderstandtherelationshipbetweengeneticbackgroundandbonestructure
Withexistingtools,analysisispossibleandanumberofpublicationshavebeenmade,evenonesthatshowdifferencesbetweenstrainsofmice
But
n<12
time-consuming(yearsbetweenmeasurementandpublication)
notflexibleorreproducible
notcloud-based
SUMMIT EAST
Genome-ScaleImagingGeneticstudiesrequirehundredstothousandsofsamples,inthiscasethedifferencebetween717and1200samplesisthedifferencebetweenfindingthelinksandfindingnothing.
2008approach-120yearsHandIdentification->30s/object
30-40kobjectspersample
OneSamplein6.25weeks
2014approach-1.5yearsImageJmacroforsegmentation(2-4hours/sample)
Pythonscriptforshapeanalysis(3hours/sample)
Paraviewmacrofornetworkandconnectivity(2hours/sample)
Pythonscripttopoolresults(3-4hours)
MySQLDatabasestoringresults(5minutes/query)
SUMMIT EAST
GeneticStudiesusingSparkImageLayerAnalysiscouldbecompletedinseveralmonths(insteadof120years,couldnowbecompletedindaysinthecloud)
Datacanbefreelyexploredandanalyzed
val bones = sc.loadImages("work/f2_bones/*/bone.tif")Segmenthardandsofttissues
Labelcells
Exportresults
val hardTissue = bones.threshold(OTSU)val softTissue = hardTissue.invert
val cells = hardTissue.componentLabel. filter(c=>c.size>100 & c.size<1000)
cells.shapeAnalysis.WriteOutput("lacuna.csv")
SUMMIT EAST
ParallelToolsforImageandQuantitativeAnalysisval cells = sqlContext.csvFile("work/f2_bones/*/cells.csv")val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROMcells GROUP BY SAMPLE")Collaborators/Competitorscanverifyresultsandextendonanalyses
CombineImageswithResults
avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones)Seeimmediatelyindatasetsofterabyteswhichimagehadthelargestcells
Newhypothesesandanalysescanbedoneinseconds/minutes
Task SingleCoreTime SparkTime(40cores)
LoadandPreprocess 360minutes 10minutes
SingleColumnAverage 4.6s 400ms
1K-meansIteration 2minutes 1s
SUMMIT EAST
ScienceProblems:FullBrainImagingCollaborationwithA.AstolfoandA.Patera
Measureafullmousebrain(1cm )withcellularresolution(1 m)
10x10x10scansat2560x2560x216014TVoxels
0.000004%oftheentiredataset
3
μ
→
14TVoxels=56TB
Eachscanneedstoberegisteredandalignedtogether
Therearenocomputerswith56TBofmemory
Evenmultithreadedapproachsarenotfeasibleandrequiremanylogistics
Analysisofthestitcheddataisalsoofinterest(segmentation,vesselanalysis,distributionandnetworkconnectivity)
SUMMIT EAST
ScienceProblems:BigStitchingImages : RDD[((x, y, z), Img[Double])] =
[( , Img),…]x dispField = Images. cartesian(Images).map{ case ((xA,ImA), (xB,ImB)) => xcorr(ImA,ImB,in=xB-xA) }
SUMMIT EAST
FromMatchingtoStitchingFromtheupdatedinformationprovidedbythecrosscorrelationsandbyapplyingappropriatesmoothingcriteria(ifnecessary).
Thestitchingitself,ratherthanrewritingtheoriginaldatacanbedoneinalazyfashionascertainregionsoftheimageareread.
Thisalsoensurestheoriginaldataisleftunalteredandallanalysisisreversible.
def getView(tPos,tSize) = stImgs. filter(x=>abs(x-tPos)<img.size). map { case (x,img) => val oImg = new Image(tSize) oImg.copy(img,x,tPos)}.addImages(AVG)
SUMMIT EAST
ViewingRegionsgetView(Pos(26.5,13),Size(2,2))
SUMMIT EAST
Real-timewithSparkStreaming:WebcamInthebiologicalimagingcommunity,theopensourcetoolsofImageJ2andFijiarewidelyacceptedandhavealargenumberofreadilyavailablepluginsandtools.
WecanintegratethefunctionalitydirectlyintoSparkandperformoperationsonmuchlargerdatasetsthanasinglemachinecouldhaveinmemory.Additionallytheseanalysescanbeperformedonstreamingdata.
SUMMIT EAST
StreamingAnalysisReal-timeWebcamProcessing
Filterimages
Createabackgroundimage
val wr = new WebcamReceiver()val ssc = sc.toStreaming(strTime)val imgList = ssc.receiverStream(wr)
val filtImgs = allImgs.mapValues(_.run("Median...","radius=3"))
val totImgs = inImages.count()val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)
SUMMIT EAST
IdentifyOutliersinStreamsRemovethebackgroundimageandfindthemeanvalue
Showtheoutliers
val eventImages = filtImgs. transform{ inImages => val corImage = inImages.map { case (inTime,inImage) => val corImage = inImage.subtract(bgImage) (corImage.getImageStatistics().mean, (inTime,corImage)) } corImage }
eventImages.filter(iv => Math.abs(iv._1)>20). foreachRDD(showResultsStr("outlier",_))
SUMMIT EAST
StreamingDemowithWebcam
SUMMIT EAST
Asascientist(notadata-scientist)ApacheSparkisbrilliantplatformandutilizingGraphX,MLLib,andotherpackagesthereunlimitedpossibilities
Scalacanbeabeautifulbutnoteasylanguage
Pythonisaneasierlanguage
Bothsufferfrom
Non-obviousworkflows
Scriptsdependingonscriptsdependingonscripts(canbeveryfragile)
Althoughallanalysescanbeexpressedasaworkflow,thisisoftendifficulttoseefromthecode
Non-technicalpersonshavelittleabilitytounderstandormakeminoradjustmentstoanalysis
Parametersrequirerecompilingtochange
orGUIsneedtobeplacedontop
SUMMIT EAST
AbasicimagefilteringoperationThankstoSpark,itiscached,inmemory,approximate,cloud-ready
ThankstoMap-Reduceitisfault-tolerant,parallel,distributed
ThankstoJava,itishardwareagnostic
Butitisalsonotreallysoreadable
def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = { val wind=(-windSize to windSize) val pos=pvec._1 val scalevalue=pvec._2/(wind.length*wind.length) for(x<-wind; y<-wind) yield ((pos._1+x,pos._2+y),scalevalue)}
val filtImg=roiImg. flatMap(cvec => spread_voxels(cvec)). filter(roiFun).reduceByKey(_ + _)
SUMMIT EAST
LittleblocksforbigdataHereweusea -basedworkflowandourSparkImagingLayerextensionstocreateaworkflowwithoutanyScalaorprogrammingknowledgeandwithaneasilyvisibleflowfromoneblocktothenextwithoutanyperformanceoverheadofusingothertools.
KNIME
SUMMIT EAST
RealityCheckSparkisnotperformant dedicated,optimizedCPUandGPUcodeswillperformslightlytomuchmuchbetterwhenevaulatedbypixelspersecondperprocessingpowerunit
thesecodeswillbewildlyoutperformedbydedicatedhardware/FPGAsolutions
Serializationoverheadandnetworkcongestionarenotneglibleforlargedatasets
→ ButScala/PythoninSparkissubstantiallyeasiertowriteandtest
Highlyoptimizedcodesareveryinflexible
Humantimeis400xmoreexpensivethanAWStime
Mistakesduetopoortestingcanbefatal
Sparkscalessmoothlytoenormousdatasets
GPUsrarelyhavemorethanafewgigabytes
Writingcodethatpagestodiskispainful
Sparkishardwareagnostic(nodriversorvendorlock-in)
SUMMIT EAST
Wehaveacooltool,butwhatdoesthismeanforme?Aspinoff-4Quant:Fromimagestoinsight
CloudImageProcessing
UseourdistributedversionofImageJinthecloudtoanalyzethousandsofremotedatasetsusingyourown,ours,orcommunityprovidedprocessingroutines
CustomAnalysisSolutions
Custom-tailoredsoftwaretosolveyourproblems
OneStopShop
Measurement,analysis,andstatisticalanalysis
Education/TrainingConsulting
Adviceonimagingtechniques,analysispossibilities
Developmentofnewanalysistoolsandworkflows
Education
WorkshopsonImageAnalysis
Courses/Training
QuantitativeBigImaging
SUMMIT EAST
AcknowledgementsAITatPSIandScientificComputeratETH
TOMCATGroup
Weareinterestedinpartnershipsandcollaborations
Learnmoreat4Quant:FromImagestoStatistics-
X-RayImagingGroupatETHZurich-
http://www.4quant.com
http://bit.ly/1gD8wKb
QuantitativeBigImagingCourseatETHZurich
SUMMIT EAST
FeatureVectorsApairingbetweenspatialinformation(position)andsomeotherkindofinformation(value).
Weareusedtoseeingimagesinagridformatwherethepositionindicatestherowandcolumninthegridandtheintensity(absorption,reflection,tipdeflection,etc)isshownasadifferentcolor
→x f
Thealternativeformforthisimageisasalistofpositionsandacorrespondingvalue
x y Intensity
1 1 12
2 1 68
3 1 81
4 1 89
5 1 87
1 2 40
ThisrepresentationcanbecalledthefeaturevectorandinthiscaseitonlyhasIntensity
= ( , )I x f
SUMMIT EAST
WhyFeatureVectorsIfweusefeaturevectorstodescribeourimage,wearenolongertoworryingabouthowtheimageswillbedisplayed,andcanfocusonthesegmentation/thresholdingproblemfromaclassificationratherthanaimage-processingstandpoint.
ExampleSowehaveanimageofacellandwewanttoidentifythemembrane(thering)fromthenucleus(thepointinthemiddle).
Asimplethresholddoesn'tworkbecauseweidentifythepointinthemiddleaswell.Wecouldtrytousemorphologicaltrickstogetridofthepointinthemiddle,orwecouldbettertuneoursegmentationtotheringstructure.
SUMMIT EAST
AddinganewfeatureInthiscaseweaddaverysimplefeaturetotheimage,thedistancefromthecenteroftheimage(distance).
x y Intensity Distance
-10 -10 0.9350683 14.14214
-10 -9 0.7957197 13.45362
-10 -8 0.6045178 12.80625
-10 -7 0.3876575 12.20656
-10 -6 0.1692429 11.66190
Wenowhaveamorecomplicatedimage,whichwecan'taseasilyvisualize,butwecanincorporatethesetwopiecesofinformationtogether.
SUMMIT EAST
ApplyingtwocriteriaNowinsteadoftryingtofindtheintensityforthering,wecancombinedensityanddistancetoidentifyit
if f (5 < Distance < 10&0.5 < Intensity > 1.0)
SUMMIT EAST
CommonFeaturesThedistancewhileillustrativeisnotacommonlyusedfeatures,morecommonvariousfiltersappliedtotheimage
GaussianFilter(informationonthevaluesofthesurroundingpixels)
Sobel/CannyEdgeDetection(informationonedgesinthevicinity)
Entroy(informationonvariabilityinvicinity)
x y Intensity Sobel Gaussian
1 1 0.94 0.32 0.53
1 10 0.48 0.50 0.45
1 11 0.50 0.50 0.46
1 12 0.48 0.64 0.46
1 13 0.43 0.78 0.45
1 14 0.33 0.94 0.42
SUMMIT EAST
AnalyzingthefeaturevectorThedistributionsofthefeaturesappearverydifferentandcanthuslikelybeusedforidentifyingdifferentpartsoftheimages.
Combinethiswithouraprioriinformation(calledsupervisedanalysis)
SUMMIT EAST
UsingMachineLearningNowthattheimagesarestoredasfeaturevectors,theycanbeeasilyanalyzedwithstandardMachineLearningtools.Itisalsomucheasiertocombinewithtraininginformation.
x y Absorb Scatter Training
700 4 0.3706262 0.9683849 0.0100140
704 4 0.3694059 0.9648784 0.0100140
692 8 0.3706371 0.9047878 0.0183156
696 8 0.3712537 0.9341989 0.0334994
700 8 0.3666887 0.9826912 0.0453049
704 8 0.3686623 0.8728824 0.0453049
WanttopredictTrainingfromx,y,Absorb, and Scatter MLLib:LogisticRegression,RandomForest,K-NearestNeighbors,…
→
SUMMIT EAST
BeyondImageProcessingFormanydatasetsprocessing,segmentation,andmorphologicalanalysisisalltheinformationneededtobeextracted.Formanysystemslikebonetissue,cellulartissues,cellularmaterialsandmanyothers,thestructureisjustthebeginningandthemostinterestingresultscomefromtheapplicationtophysical,chemical,orbiologicalrulesinsideofthesestructures.
= m∑j
F ij xi
Suchsystemscanbeeasilyrepresentedbyagraph,andanalyzedusingGraphXinadistributed,faulttolerantmanner.
SUMMIT EAST
HadoopFilesystem(HDFSnotHDF5)Bottleneckisfilesystemconnection,manynodes(10+)readinginparallelbringsevenGPFS-basedinfinibandsystemtoacrawl
OneofthecentraltenantsofMapReduce™isdata-centriccomputation insteadofdatatocomputation,movethecomputationtothedata.
Usefastlocalstorageforstoringeverythingredundantly lesstransferandfault-tolerance
Largestfilesize:512yottabytes,Yahoohas14petabytefilesysteminuse
→
→