Parallel W avelet Radiosity

Parallel WaveletRadiosity

Xavier Cavin, LaurentAlonsoandJean-ClaudePaul{Xavier.Cavin,Laurent.Alonso,Jean-Claude.Paul}@loria.fr

LORIA - INRIA Lorraine,Campusscientifique,BP239,

54506Vandœuvre-les-Nancy Cedex, France

Abstract: This paperpresentsparallelversionsof awaveletradiosityalgorithm.Wavelet radiosity is basedon a generalframework of projectionmethodsandwavelet theory. The resultingalgorithmhasa costproportionalto

� � � �versus

the� � � � �

complexity of theclassicalradiosityalgorithms.However, designinga parallelwavelet radiosity is challengingbecauseof its irregular anddynamicnature.Sinceexplicit messagepassingapproachesfail to dealwith suchappli-cations,we have experimentedvariousparallel implementationson a hardwareccNUMA architecture,theSGIOrigin2000.Ourexperimentsshow thatloadbal-ancingis acrucialperformanceissueto handlethedynamicdistribution of workandcommunication,while we do makeall reasonableefforts to exploit datalo-cality efficiently. Ourbestresultsyield aspeed-upof 24with 36processors,evenwhendealingwith extremelycomplex models.

1 Intr oduction

Radiositymethodshavebeenprovento beanefficientmeanto simulatetheinter-reflectionsof light in Lambertian(diffuse)environments.Theradiosity- powerperunit area[ � � � � ] -on a givensurface,is governedby anintegral equationwhich canbesolvedby projectingtheunknownradiosityfunctionontoasetof basisfunctionswith limited supports,resultingin asetof � � � interactions.Thewavelettheoryintroducedby [5] and[9], hasbeenshownto reducetherequiredinteractionsto � � . Unfortunately, wavelet radiositystill requirestoomuchcomputationtimewhendealingwith extremelycomplex models(severalmillionsof polygonswith physicalproperties),evenonmodernworkstations[3].

Designinga parallelwaveletradiosityalgorithminvolvesdealingwith complex issues,suchas:

– thegeometricaldomainbeingsimulatedis typically non-uniform,which hasimplica-tionsfor bothloadbalancingandcommunication;

– work andcommunicationchangedynamicallyacrossthecomputationof thesolutionbecauseof thehierarchicalnatureof wavelets.

Most previousworkson parallelradiosityarerelatedto classicalradiosityalgorithms.Althoughthemostrecentof them[8] obtainsquitegoodresultson a SGI Origin2000andyields radiosity computationsfor very large models,the implementedalgorithmsuffersfrom having a � � � complexity. Only afew papersaddressthedesignproblemof parallelhierarchicalalgorithm.Zareski implementedin [12] a parallel versionof the hierarchi-cal radiosityalgorithmon a networkof workstationsusinga master-slave architecture,inwhicheachslave performessurface-elementsinteractionsfor aseparatesubsetof elementsin thescene.Speed-upwith thisfine-grainedapproachis limited by bothmasterprocessingbottleneckandtheoverheadof inter-processcommunication,resultingin longerexecutiontimesasthe numberof processorsincreases.Aiming to usea coarse-grainedparallelismin a similar master-slave approach,Funkhouser[4] designedpartitioningandscheduling

algorithmsthatallowsmultiplehierarchicalradiositysolversto work onthesameradiositysolutionin parallel.In this way, the authorachieved significantspeed-ups, � � to � � � ,with a very large model,but the scalabilityof the parallelismhasto staymoderate(lessthan8 slave workstations).A moregeneralstudyof hierarchicalalgorithmshasbeenad-dressedin [11]. Experimentson the 48 processorsStanford’s DASH machine(a CacheCoherentSharedAddressSpaceMultiprocessor)have yieldedvery goodperformance,al-thoughit wasona very smallmodel(94 inputpolygons).Whendealingwith muchlargerenvironments(typically in theorderof hundredsof thousandsof polygons),theamountofcommunicationchangesthedimensionof theproblem.

In summary, all relatedworks failed to provide a parallel responseto the followingcombinedgoals:

– for a goodasymptoticcomplexity radiosityalgorithm,– dealingwith extremelylargemodels,– andeffectively scalableto a largenumberof processors.

Theparallelsolutionpresentedhereaddressesthesethreeproblems.Loadbalancingal-gorithmsimplementedontheSGIOrigin2000achievesignificantspeed-upsfor thewaveletradiosityalgorithm,runningon32processors,for modelscomposedof morethan � � � � � �polygonsof arbitrarygeometry.

Theorganizationof thepaperis asfollows.In Section2, we presentthemathematicalframework of thewaveletradiositymethodanda very efficient sequentialimplementationof this algorithm.Then,we presentour parallel implementationstrategy on a SGI Ori-gin2000in Section3.Wethendescribeourexperimentsin Section4, resultsanddiscussionin Section5. Finally, weconcludeandpresentfutureworksin Section6.

2 WaveletRadiosity

2.1 The Radiosity Equation

Givensomephysicalassumptions,theradiosityequationcanbeformulatedasfollows.Let�denotethe collectionof all surfacesin an environment,which we assumeto form an

enclosurefor simplicity. Let � bea spaceof real-valuedfunctionsdefinedon� � �

; thatis, over all surfacepoints andall wavelengths(

�). Given the surfaceemissionfunction� � � , which specifiestheorigin anddirectionaldistributionof emittedlight, we wish to

determinethesurfaceradiosityfunction � � � thatsatisfies:

� � � � � � � � � � � � � � � � � ! " � # � � � � � � � # � $ � # � (1)

where:

– � is thewavelengthat whichfunctionsarecomputed,– � � � � � is thelocalreflectancefunctionof thesurface(i.e., wesupposethatthesurfaces

areideallydiffuse),–

" � # � � � � %& ' ( ) * + , ' ( ) * + � - �+ , + . � # � � � is a geometrytermconsistingof thecosinesmadeby the local surfacenormalswith the vectorconnectingthe two surfacepoints� and � # , of thedistance� betweenthesestwo points,andof thevisibility function .whosevalueis in / � � � 0 accordingto whethertheline betweenthetwo points � and � #is respectively obstructedor un-obstructed.

Usingtheoperatornotationwecanexpresstheequationto besolvedas � � � � 1 " � ,which is a linearoperatorof thesecondkind, or morecompactlyas 2 � � � .

Projectionmethods,whoserole is to recastinfinite-dimensionalproblemsin finite di-mensions,canbeusedto solve theradiosityequation.Theideais to constructanapprox-imatesolutionfrom a subspace� 3 4 � , wheretheparameter� denotesthedimensionofthesubspace.In any case,eachelementof thefunctionspace� 3 is a linearcombinationofa finite numberof basisfunctions / 5 % � 6 6 6 � 5 3 0 .

Given this spaceof basisfunctions � 3 � ) 7 8 9 5 % � 6 6 6 � 5 3 � , we seekan approxima-tion � 3 in the space� 3 that is closeto � . This is equivalentto determining� unknowncoefficients / : % � 6 6 6 � : 3 0 suchthat � � ; 3< = % : < 5 < .

Projectionmethodsselectsuchapproximationsfrom � 3 by imposinga finite numberof conditionson theresidualerror, which is definedby � 3 > 2 � 3 ? � .

Specifically, we attemptto find � 3 suchthat � 3 simultaneouslysatisfies� linearcon-straints,that is @ A � 3 � � � , for B � � � 6 6 6 � � , wherethe @ A are linear functionals.Anycollectionof � linearly independentfunctionalsdefinesanapproximation� 3 by “pinningdown” theresidualerrorwith sufficiently many constraintsto uniquelydeterminethecoef-ficients.Notethatfunctionalsandbasisfunctionstogetherdefinea projectionoperator[1].Thisway, wehave:

@ A CD 2 3E< = % : < 5 < ? � FG � � � (2)

which is a systemof � equationsfor theunknown coefficients / : % � 6 6 6 � : 3 0 . Thequalityof theapproximationaswell asthecomputationsrequiredto obtainit, dependon theap-proximationpropertiesof thespace� 3 , thechoiceof functionals,andfinally thenumericaltechniqueswhichareusedto computethematrixcoefficientsandtosolvethelinearsystem.Thereareothersourcesof errordueto imprecisegeometryor boundaryconditions[2].

2.2 Radiosity and WaveletTheory

The wavelet radiosityalgorithmis a specialcaseof the formulationin which the matrixform of (2) is:H

IIJ K L % � L % � K L % � L � � M M M K L % � L 3 �K L � � L % � K L � � L � � M M M K L � � L 3 �...

......

...K L 3 � L % � K L 3 � L � � M M M K L 3 � L 3 �N OOP

HIIJ : %: �...: 3

N OOP �HIIJ Q � � L % RQ � � L � R...Q � � L 3 R

N OOP � (3)

where L A � L < � arewavelets.As shown in [9], waveletscanbeusedasbasesfor functionspaces,soa linearoperator

canbeexpressedin termsof them.If thisoperatorsatisfiescertainsmoothnessconditions-astheradiosityoperatordoes- theresultingmatrix is approximatelysparseandthesystemcanbesolvedasymptoticallyfasterif only finite precisionis requiredfor theanswer.

More precisely, the linear systemin (3) hasentrieswhich arethe coefficientsof thekernelfunctionwith respectto somebasis.In classicalradiosity, theGalerkinmethodgivesrise to a systemrelatingall of thesebasisfunctionswith eachotherresultingin a systemof size � � � . However, suchbasespossesspropertieswhichderive directly from theker-nel itself. Usingwaveletsasbasisfunctions,theresultingmatrix systemis approximatelysparseif thekernelis smooth.By ignoringsomeentrieswhosevalueis below somethresh-old, theresultinglinearsystemhasonly � � remainingentriesleadingto fastresolutionalgorithms.To realizean algorithmof � � complexity, an oracle function is neededtohelpenumeratetheimportantentriesin thematrixsystem.

2.3 SequentialImplementation: The CandelaProject

TheCandelaprojectwasdesignedto provide a flexible architecturefor testingandimple-mentingnew radiosityandradiancealgorithms[7]. It wasat thesametime intendedto beableto dealwith realdata(complex geometricalsurfaces,accuratelight sourcesmodels,realspectrummodeling)andto computephysicallycorrectresults.Candelais basedontheOpenInventorlibrary in orderto gettherequiredflexibility , bothoninputsandalgorithmi-cal combinations.

Several sequentialwavelet algorithmsandaccelerationtechniqueshave alreadybeenimplemented.This part is detailedin [3]. For the sakeof completeness,we give herethemaincharacteristicsof theexisting algorithms:

– thescenemodelis describedusingtheOpenInventorfile format:S polygonalsurfaces(�

) canbeof arbitrarygeometry;S thespatialpartof anemitter � is describedusingthe T U representation;S thespectraldistributionof a light source� or of a diffusecoefficient � of a surfacecanbecodedin severalfunctionbases,

– theavailablebasesof functions( L A ) aretheHaar,� � and

� Vbases,

– the spectra( � ) arecomputedexactly andwhenthe resultis storedon a surface,it isprojectedin a basisof functionswhichcanbedefinedby theuser,

– the visibility . � # � � � canbe acceleratedusingthe hardwareor a BSP(i.e., a BinarySpacePartition),

– thekernelcoefficients K L A � L < � arecomputedby Gaussianquadrature,– bothgeometrybasedandenergy basedoraclesexist,– two solvers are implemented,one implementsa gatheringiterative schemeand the

otheronea progressive shootingscheme.Thelinks canbestoredor recomputedeachtime thereareneeded.

Oursequentialexperimentshaveshown thatthewaveletcodingof theradiosityfunctionis very importantin orderto obtaincorrectresults,but that the links storagecanquicklybecomea greatbottleneckbecauseof its hugememoryrequirement.The effect is evenmoreannoyingwith theprogressive shootingsolver: in thefirst stepsof this algorithm,alot of energy is exchangedbetweensurfaces,whichleadsto many storedlinks. Mostof thetime, theselinks areover refined,with respectto theenergy exchangedin thenext steps.

This over-costdisappearsif the links arenot stored,of courseat the cost of longercomputationtimes.However, this is currentlythe only way to dealwith large scenes.Inthiscase,theprogressiveshootingalgorithmconvergesfasterthanthegatheringalgorithm,at leastduringthefirst iterations,allowing to getresultsearlier. In conclusion,progressiveshootingwavelet radiositywithout links storageappearsto be a very efficient sequentialalgorithmfor complex scenes.It will bedescribedin detailin Section4.

Nevertheless,despitethesechoicesof algorithmsandacceleratingtechniques,compu-tation times andmemoryrequirementsstill remaintoo importantfor effective useon asingleworkstation,especiallywhenthesimulatedscenesarevery large(thecommoncasefor architecturalsimulations).Thealternateway to bypassthe limitationsof singlework-stationsis to turntowardsparallelism.Thiswill bethesubjectof thereminderof thepaper.

3 Parallel Issues

3.1 GeneralConsiderations

Implementingsomeparallelalgorithmcan be doneon two main kinds of architectures:clusterof workstationsor sharedmemorysupercomputer. Usingthefirst approach,thepar-allel programhasto bebasedonthemessagepassingparadigm.Thiscanbeveryattractive,becauseit canbeappliedbothonasupercomputeror ona networkof workstations(a very

commonresourcein today’s researchcenters).For the otherapproach,communicationisimplicitely managedthroughsharedvariables,makingalgorithmsimplementationmucheasier. Recentsharedmemorysupercomputers(like the SGI Origin2000)arebuilt on thescalableDistributedSharedMemory(DSM) architecture:thesharedmemoryis distributedamongprocessorsandcanbeaddressedtransparently;its efficiency heavily dependsonthecachingof dataaccessedin a remotememory. Singhetal. show in [10] thatsharedaddressspaceis bestsuitedfor dynamicandirregularalgorithmslike thehierarchicalradiosity.

As pointedout in [11], effective speed-upover the bestsequentialalgorithmcan beconstrainedby six kindsof overhead:

– inherentlysequentialcode:this includesinitialisationandterminationphases;– redundantwork: this is a partof thework thathave to dedoneby eachprocess,while

it wouldonly bedoneoncein thesequentialalgorithm;– overheadof parallelismmanagement:thedistributionof work to processescanintro-

duceextra costs;– overheadof synchronization:this includessynchronizationbarriersandmutualexclu-

sionlocks;– imbalanceddistributionof work amongprocesses:the loadbalancingproblem;– inter-processescommunicationandoverheadof communication:this both concerns

datalocality andfalsesharing.

As far asscientificapplicationsareconcerned,the two key, but conflicting,aspectstoaddressareloadbalancing[11] anddatalocality [8].

3.2 Parallelism and Radiosity

Whendesigninga parallelprogram,animportantaspectto determineis thelevel of paral-lelism(i.e., whichpartsof thesequentialprogrammightbeparallelized).Let usnow focuson specificissuesinvolved in the parallelizationof radiosityalgorithmsby re-examiningthetermsof equation(1):

– thefirst two terms � � � � � and � � � � � representthe inputsof the algorithm:they arecomposedof a spatialanda spectraldistribution. They do not leadto any particularproblemfor parallelism;

– thethird term � � � � � is theradiosityfunction,thatis thesolutionweaim to compute.We have chosento codethis function using wavelet bases:this givesvery efficientsequentialalgorithmsbut hastwo drawbacksin parallelism.Contrarilyto classicalra-diosity implementations,we arenot able to allocateall the requiredmemoryat thebeginning of the algorithm,but we needto makedynamicallocations.Moreover, itis very difficult to forecasthow long aninteractionbetweentwo surfaceswill taketocompute;

– thefourth term" � # � � � , which is mainlycomposedof thevisibility function . � # � � � ,

is the first global termof the equation.In sequentialimplementations,many acceler-atingtechniquesmaybeusedto computethe visibility , which representsa hugepartof theoverall computation[3]. Someof themrely on hardwareuseandsocannot beappliedin parallelbecauseof the“limitation” of ourOrigin2000,whichhasnographiccard.Otheronesrely on the building andthe useof the well-known BSPstructure,which hasbeenchosenfor its efficiency. It is importantto usethe latter techniquesinparallelimplementationsif we wantto obtainefficient algorithms,thoughthestorageof the BSPcouldbecomeproblematicasscenesizegrows. We expectedthateven ifthisstructureis too largeto fit in theprocessormemorycache,thenaturaldatalocalityof thevisibility requestswill not generatetoo many memoryproblems.Indeed,whenanenergy transferis donebetweentwo surfaces,weonly have to testthevisibility be-tweenpointsof thesetwo surfacesandwecanhopethatthesamepartsof theBSPwill

betraversed.We couldobtaina similar propertyif we canhave eachprocessdealingwith neighbouringsurfacesacrossits successive computations;

– the last part is the integral equationin itself: it codesthe interactionsbetweenthedifferentsurfacesof the scene.Many sequentialand parallel algorithmshave beendevotedto solvingthis complex integral equation.As we mentionedin Section2, wehave chosento focuson a very efficient sequentialalgorithmallowing to dealwithlargerealworld scenes:theprogressiveshootingwaveletradiosity[3]. Moreprecisely,we considerthe versionwhich doesnot store links betweensurfaces,both becauseof memoryproblemsandbecauseit would leadto really tricky problemsin parallelalgorithms.

The parallelizationwork is doneon the solving part of the algorithmandwill be thesubjectof thenext paragraph.

3.3 Parallelization of the ProgressiveShootingWaveletRadiosity

Theprogressive shootingwavelet radiosityalgorithmsequentiallyhandlesthemostener-geticemitter(eithera direct light sourceor a reflectingsurface)andpropagatesits energyto all scenereceivers (surfaces),decomposingthem both into smallersurfaceelementswhenneeded.Unfortunately, theamountof work thatwill berequiredfor a givenemitter-receiver interactionis highly unpredictableandevendependsontheresolutionstage.Thesedynamicandunpredictablecharacteristicsof thealgorithmmakeit challengingto bepar-allelizedin awell loadbalancedway.

Severalgranularitiescanbeconsideredin orderto getanefficient tasksdecompositionof theproblem:

– thefinestgranularitycanbefoundinsideanemitter-receiver interaction.In [11], a taskcaneitherbea surface-elementor even anelement-elementinteraction.Implementa-tion with distributedtaskqueuesandclever taskstealingallows a goodcontrol overloadbalancingwhile preservingenoughdatalocality whendealingwith a smallnum-berof surfaces;

– at theopposite,a very coarsegranularitywould beto definea taskasthesetof inter-actionsbetweenone(or several) emittersandall its associatedreceivers.This allowsseveralemitters(oneperprocess)to propagatetheirenergy to all thescenereceiversinparallel.A givenreceiving surfacecansobeshotto by two emittersat thesametimeandmutualexclusionis neededfor accessingit;

– an intermediate(rathersmall) granularityis to considera taskasa standardsurface-surfaceinteraction.Theway thesetasksaredistributedto processesleadsto differentconstraintsandalgorithms.If emittersareprocessedoneafter the other, asin the se-quentialalgorithm,energy propagationfrom anemitterto all its receiving surfacesisparallelizedamongthe processeswithout restrictive accessbut is boundedby a syn-chronizationbarrierbeforedealingwith the next emitter. If we seethe problemasaglobal pool of tasksthat processeshave to execute,mutual exclusion is neededforaccessingsurfacesanda loadbalanceddistributionalgorithmhasto befound.

Wehavechosento testtheintermediatesurface-surfacegranularitywhichseemedto bebestsuitedto ourproblem.We hopedit wouldbefineenoughin thecaseof largescenes.

4 Experimentation

We focusin this sectionon the implementedparallelizationsof the progressive shootingwavelet radiosity (with the differentconsiderationsof the previous sections)andon thedifferenttestswhichwehave performed.

Wehave parallelizedthesolverphaseof thisalgorithm.Dueto thedifferencesbetweenthe emissionof a light sourceon a surfaceand the redistribution of the energy betweentwo surfaces,this phaseis partitionedfor efficiency in two phases.We presentbelow thepseudo-codeof thesetwo sequentialalgorithmsfor clarity in thefollowingsubsections.

1 Direct Illumination Algorithm:2 begin3 /* Getall thelight sources.*/5 /* Getall thescenesurfaces.*/7 foreach W B � X Y do8 foreach Z 5 � � K [ \ do9 call B W W 5 ] B � K Y \ W B � X Y � Z 5 � � K [ \ � ^

10 od od11 foreach Z 5 � � K [ \ do12 call _ 5 Z X ` 5 W W Z 5 � � K [ \ � ^ od13 where14 proc B W W 5 ] B � K Y \ W B � X Y � Z 5 � � K [ \ � >15 call B W W 5 ] B � K Y \ a \ [ 5 � Z B . \ W B � X Y � Z 5 � � K [ \ � ^ .16 end

1 IntereflectionsAlgorithm:2 begin3 /* Getall thescenesurfaces.*/5 Z b � Y \ $ c B Z Y d � e ^ /* A sortedlist of energeticsurfaces.*/6 foreach Z 5 � � K [ \ do7 if X K Z f � \ � � g h b i X b b Y Z 5 � � K [ \ �8 then B � Z \ � Y i b � Y \ $ Z 5 � � K [ \ � Z b � Y \ $ c B Z Y � ^ fi9 od

10 while j [ b � . \ � � \ $ � do11 Z X b b Y \ � d � � \ ] b . \ k B � Z Y Z b � Y \ $ c B Z Y � ^13 foreach Z 5 � � K [ \ do14 call Z X b b Y Z X b b Y \ � � Z 5 � � K [ \ � ^15 od od16 foreach Z 5 � � K [ \ do17 call _ 5 Z X ` 5 W W Z 5 � � K [ \ � ^ od18 where19 proc Z X b b Y Z X b b Y \ � � Z 5 � � K [ \ � >20 call _ 5 Z X ` 5 W W Z X b b Y \ � � ^22 call � \ ] b . \ Z 5 � � K [ \ � Z b � Y \ $ c B Z Y � ^24 call Z X b b Y a \ [ 5 � Z B . \ Z X b b Y \ � � Z 5 � � K [ \ � ^26 call Z \ Y f � \ � � g h b i X b b Y Z X b b Y \ � � � � ^28 call B � Z \ � Y i b � Y \ $ Z 5 � � K [ \ � Z b � Y \ $ c B Z Y � ^ .29 end

4.1 Parallelization of Dir ect Illumination

The direct illuminationalgorithmis the easiestoneto parallelize.For instance,thenum-ber of energy propagationsfor high-level emitter-all receivers interactions% is known inadvance.

We chooseto implementit with the emitter-surfacegranularityfor its bettercompro-misebetweeneaseof programmingandexpectedefficiency. Thecoarsergranularityis notinterestingin thiscase,becausetherelativelysmallnumberof light sourcescomparedto thel

i.e. thenumberof light sourcesin thescene.

numberof processeswouldleadto severeloadimbalance.Ontheotherhand,exploiting thefinestgranularitydoesnot seemnecessaryin thecaseof largescenes.We anticipatedthattheintermediategranularitywouldbeenoughto getgoodloadbalancingandimplementedtwo algorithmsbasedon it.

Inner Loop. Thefirst algorithmis theparallelizationof the innerof the two loopsof thesequentialalgorithm:emittersareshotoneafter the other, andeachshootis doneby allthe processesin parallel.A given receiver cannot be shotto by two emitters,so it doesnotrequirelockingaccess.However, synchronizationbarriersareneededattheendof eachshootbeforeprocessingthenext one.

SingleQueue. To remove thepotentiallycostlysynchronizationbarriers,wecanhave theprocessesstartingto shootthe next emitter insteadof waiting for the othersto completetheir job. This canbe donein a dynamictaskingapproachwherethe algorithmconsistsof a pool of emitter-receiver interactiontasks.In this case,restrictive accessto receiversis neededto prevent two emittersfrom shootingto the samereceiver. The way tasksaredistributedto processescanleadto differentalgorithms.Wehavechosenasimpleapproachwith a singlecentralizedtasksqueuein which interactionsaresortedby emittersandthenby receivers.Thisgivesusanextensionof theinnerloopalgorithm.

4.2 Parallelization of Inter-reflectionsAlgorithm

This algorithmdiffers from the direct illumination algorithm in that it hasto manageasortedlist of theemitters.This makesdynamictaskingalgorithmslike the“single queue”very hardto design.Anotherdifferenceis thatwe have to makea “copy” of the emittingsurfacebeforeprocessingit, eitherto have it treatedby severalprocessesor to allow it toreceive energy from anotheremitter, while it sendsenergy itself.

After takinginto accounttheseconsiderations,we implementedthetwo loop level par-allel inter-reflectionsalgorithms.

Inner Loop. Herealso,the inner loop is themostdirectway to geta parallelalgorithm.Its mainadvantageis thatthesortedlist canbedealtwith asin thesequentialversion.Thedrawbacksarethesameasin thedirectilluminationalgorithm.

In this case,copyinganemittercanbevery timeconsuming,especiallywhenthis sur-faceis alreadystronglysubdivided;moreover thecopyoperationhasto bedonein acriticalsection,becauseit usesthe OpenInventorgraph.So we hadto implementa “lazy copy”mechanismthatallows to usea “light” copyof anemitter, andto rebuild its childrenonlywhenneeded.

Outer Loop. Theouterloop is in fact theimplementationof thecoarsergranularitycon-sideration.It seemedto beinterestingto considerit, becausewehave herea largernumberof tasks(emitter-all receiversinteractions)in comparisonwith thenumberof processes.

Theconvergencespeedof thealgorithmheavily dependsontheorderby whichemitterspropagatetheirenergy. Thiscanbeproblematicif emittersareshotin parallel,becauseafterm

shots,muchlessenergy will have beenexchangedthanwithm

shotsdonesequentially.

If parallelexecutionof theinnerloopprovesto beenoughload-balanced,on theaverage,thefirst algorithmis muchmoreefficient (no locking on thereceiver is required)becauseof its fasterconvergencespeed.

4.3 TechnicalConstraints

TheCandelalibrariesarewritten in C++ andconsistof approximatively 360classes.Evenif it canseemchallengingto parallelizeC++ algorithmsinsidesucha large platform, it

is really importantnot to move apartfrom thesequentialpart in orderto beableto makecomparisonsandto takeadvantageof its lastenhancements.

Furthermore,Candelais built over the Silicon Graphics’OpenInventor library andintensivelyusesthegraphstructureandits associatednodes.Hence,wehave absolutelynocontrolover thestorageandmanipulationof thedatastructures.It is evenworsein parallel,becausetheOpenInventorlibrary is not providedthread-safe:thatis to sayit canbeveryproblematicif two or moreprocessesmanipulatethegraphat thesametime.

All thishasled to severalimplementationproblems,for instance:

– we had to overloadthe memoryallocationC functions(“malloc”, “free”, “realloc”,“calloc”) to avoid thread-safememory manipulationlocks: keep in mind that thewavelet radiosityalgorithmimplies many memoryallocationsand that OpenInven-tor providesnocontrolover thestorageof its nodes;

– anotherproblemwasto determinevariableswhich could potentiallybe modifiedbyseveral processesat the sametime (typically staticclassvariables)andto build a setof preprocessormacroswhich allow to transparentlytransformtheminto anarrayofvariables(oneperprocess);

thisgaveusageneralpurposeC/C++framework for “assisted”parallelizationwhichcouldbereusedfor otherprojects.

4.4 Protocol Considerations

TestScenes.We performedour experimentson threetestscenescomingfrom realworldapplications,but with differentcharacteristics,in orderto analyzeour implementedalgo-rithms:

– StanislasSquare Opera in Nancy. This testscenecomesfrom an evaluationprojectof potentialnew lighting design.The geometricalmodelwascreatedfrom architec-turaldrawings.Thedirectilluminationis computedusingaccuratelight andreflectancemodels.In this paper, weonly considerthefirst floor in orderto beableto computeasolutiononasingleprocessorin a reasonabletime(lessthanoneday).

– Cloister in Quito. It is alsoa lighting designproject,but it waschosenbecausetheeffectsof indirect illumination (inter-reflections)aremorevisible. It servesasa life-sizetest.

– SodaHall. TheSodaHall building hasbecomea referencetestscene.It is suitableforvirtual reality environmentsandinteractive walk-through.We both considera singleroomof thisbuilding (with highprecisionparameters)for speed-upmeasures,andonecompletefloor with furnituresasanotherlife-sizetest.

Table1 gives their numericalcharacteristicsand associatedimagescan be found inFigure2 of thecolorplate.

Testscene StanislasSquareQuito CloisterSodaHall RoomSodaHall FloorNumberof initial surfaces n o p q o r s o t u u n t u n s s v r rNumberof light sources p w t p p n w pNumberof final meshes v n o p q o r u o n p r v p v p s u n o v n p r s

Time[Processors] n n s u r s [1] p r t t p s [4] n n q q n s [1] –Time[Processors] r u p s [36] u w w p s [24] w n p s [36] r s w v o s [16]

Table1. Thetestscenes

Measures. The experimentswere ran on a ccNUMA Silicon GraphicsOrigin2000[6]with 64 processorsorganizedin 32 nodes.Eachnodeconsistsof two a � � � � � processorswith 32 KBytesof first level cache(L1) of dataon thechip, 4 MBytesof externalsecondlevel cache(L2) of dataand instructionsand256 MBytes of local memory, for a totalof 8 GBytesof physicalmemory. Its hardwareperformancecounterscombinedwith thesoftwaretool perfex allow performancemeasuresof thebehavior of theparallelprogram.We have chosento study:

– Speed-up. This is definedby the fraction betweenthe bestsequentialtime over theparalleltime obtainedwith � processors.

– Memoryoverhead. Thisis thefractionof timespentin memoryoverthetotalexecutiontime.

– L1 cachehit rate. This is thefractionof dataaccesseswhicharesatisfiedfrom a cacheline alreadyresidentin theprimarydatacache.

– L2 cachehit rate. This is thefractionof dataaccesseswhicharesatisfiedfrom a cacheline alreadyresidentin thesecondarydatacache.

5 Resultsand Discussion

5.1 PerformanceEvaluation

Table1 givesthesurfacescomplexity of eachsceneandthecomputationaltime neededtosimulateeachscenefirst with a smallnumberof processorsandthenwith a largenumberof processors.Eachresultshows goodspeed-up.More preciseresultsappearin Figure1for theStanislasSquareOperaandtheroomof theSodaHall andareexaminedin thenextsubsection.

Let usrecallfor correctnessof thesefiguresthatwe work with polygonswhichcanbeconcave or convex for greaterefficiency. Thenumberof surfaceswhich will appearif wedecidedto triangulatethemto work only with trianglesand/orparallelogramswould bemuchgreater. For instance,for the StanislasSquare,a straightforwardtriangulationgivestwice moresurfaces,anda triangulationwhich providesgoodshapedtrianglesgivestentimesmoreoriginal surfaces.

Moreoverwewantto remindthatdueto theabsenceof graphicscardontheOrigin2000,wehave parallelizeda versionof thealgorithmwhichonly usesthe“BSP” decompositionin order to acceleratethe visibility anddoesnot useany hardware graphic acceleration.Whenusingthesequentialversion[3], thischoicedecreasestheoverallperformanceof thegeneralsimulationby a factorof x . This degradationis importantenoughto relativize theparallelspeed-up.

However, evenwith suchacceleratingtechniques,hugesceneslike thecloisterof Quito(seeFigure2.d)canhardlybesimulatedon a sequentialcomputer. In parallelism,our testshows that this scenecanbesimulatedwith 24 processorsin 2 hours41 minutes.Thesesortof resultsarevery promisingfor thelighting designandotherreal-worldapplications.

5.2 Discussion

Loadbalancinganddatalocalityaretwokey but conflictinggoalsto reachin ordertoobtaingoodparallelperformance.Weshow heretheir impactonourdifferentalgorithms.

Load balancing. Figure1.ashows thatall themethodsachieve a speed-upin CPUtimeswhenusedin parallel.The speed-upis very high (actuallysupra-linear)whenwe usenovisibility for thedirect illumination. In theothercases,it is approximatelyequalto � 6 � � ,where � is thenumberof processes.

In thecaseof a direct illumination without visibility , a supra-linearspeed-upis not sosurprising.Indeed,in this case,whenwewantto illuminatea surfaceweneedonly to use

thedataof thatsurface,this leadsto excellentdatalocality andvery similar computationtimesfor theilluminationof eachsurface.In thiscase,theOrigin2000is veryefficient.

If we look at the resultsfor the two differentmethodsusedfor the PlaceStanislasillumination, problemsbegin to appear. Of course,the algorithmthatdoesnot stopaftereachlight givesbetterspeed-upthantheotheroneasexpected,but thespeed-upresultsarenotvery good.In fact,whatseemsto happenis thefollowing:

– in this simulation,eachlight illuminatesonly a part of the building andonly a fewinteractionsarevery time consuming.Therefore,whensomeprocessfinishesthe il-luminationof the last surfaceseenby the first light, someprocessesarestill stackedprocessingoneof the few interactionswhich arevery time consuming.Thesynchro-nizationof the processors,aftera light sourcehashandledall the surfacesit sees,isthereforeverycostlyin the“inner loop” method;

– but thesituationis evenworsebecausein thissimulationtwo neighbouringlightsoftensharea groupof very time consuminginteractions.Indeedthelighting designershavepositionedthelights very closeto thefacade,sothatmostof theenergy receivedby asinglesurfacecamewith highprobabilityfrom thosetwo light sources.This hasa large impacton the “single queue”method.Whensomeprocessesbeginto computethe interactionbetweenthe secondlight and the building, threeor fourtime consuminginteractionsbetweenthe first light and the setsof surfacesarestillprocessed.The first processorswhich will attemptto shooton oneof thesesurfaces(thesurfaceswhichhave notyetbeentotally illuminatedby thepreviousillumination)will belocked.This is problematicbecausetheseinteractionsarewith highprobabilityalsovery time consuminginteractionsfor thesecondlight.

We think that this situationappearsbecausewe have useda coarse� granularityforthissimulation.This is slightly annoyingbecausethesameproblemappearsin thegeneralshootphasewhentheresidualenergy to shootbecomestoosmall.

Decreasingthegranularityof thealgorithmto tacklethisproblemis feasiblefor thedi-rectilluminationbut becomesmuchtrickier for interactionsbetweentwo surfaces.Anothersolutionseemsmorereasonable:whena processorattemptsto shooton a surfacewhichis actuallythereceiver handledby anotherprocessor, a locking situationappears.We canremove this lock, if theprocessorstartsto shooton a lure: a disjoint copyof this surface;of coursewe will needafter to retrieve theenergy receivedby thelure andto re-projectiton therealsurface.

As in thesimulationof theSodaHall room,Figure1.b exhibits thesamebehavior wejust notedfor the PlaceStanislasillumination in the shootingphase.The sameproblemappearsduring the first shoots:the shootof anemitteron the walls or on the floor takesmuchmoretime thanon theothersmallsurfaces.Also, after someshoots,theamountofenergy left to emit becomesvery small. In this case,the time neededfor distributing thework amongprocessesis largerthanthetimeneededto dotheshoot.Thegranularityof theparallelizationmustthenbeincreased.

Data locality. Whenwedesignedourparallelalgorithms,wedid notfocusondatalocalityproblems.First,wehadexpectedthat:

– theusedgranularitywould naturallyincreasethedatalocality. Indeedwhenonepro-cessordealswith an emitter-receiver interaction,it computesthis interactionat eachneededlevel. Moreover, in orderto computetheform factorof thekernel,it needstocomputethe visibility betweenlists of pointsof the two surfaces.We expectedthiswouldleadto thesameportionsof theBSPstructurebeingtraversed;�

A trick to artificially reducethe granularityproblemwould be to adda preprocessingstepthatwouldsubdividelargesurfacesinto smallerones.

– thetypeof architectureusedby theOrigin2000,acc-NUMA, is well knowntodecreasetheimportanceof thisproblemcomparedto explicit messagepassing.

Our testsona roomof theSodaHall andon theStanislasSquareplaceshow that:

– the ratio of time spentaccessingthe memorydecreasesslightly with the numberofprocessorsused(seeFigures1.cand 1.d);

– thehit ratesareapproximativelyof y � � for the c � cacheandof y z � for the c x cache(seeFigures1.eand 1.f).

Thefirst behavior of thealgorithmcanbeexplainedby thefact thatthesimulationsofthesetwo scenesusea lot of memory. It is notsosurprising,in thiscase,becauseit canbebetterto distributethedatamemoryto all processors,which will only accessa partof thetotalmemory.

The cachehit rateshave surprisinglyhigh values,which seemsin total contradictionwith resultspresentedby [8]. Thisseemsto provethatourhypothesisis valid andthatdatalocality is not themainproblemwith waveletradiosity.

However, wewantto moderatethis conclusionsincetwo otherfactorsmayhave influ-encedthis behavior:

– first, in orderto do meaningfultests,we have limited ourselvesto simulationswhichcouldbecompletedon a singleprocessorin lessthanoneday. Thus,thoughsomeofourscenesaresomewhatbig, theircomplexity is not tremendous;

– second,weusescenescomingfrom real-worldapplications.This meansthatsurfacesappearingin the scenedescriptionarenot listed in a randomorder. On the contrary,neighbouringsurfaces(as geometricalentities)often appearcloseto oneanotherinthe scenedescription.This locality of the scenedescriptionclearly impactson thealgorithmdatalocality.

The two precedingparagraphsshow that the progressive wavelet radiosityalgorithmcan“efficiently” beparallelizedevenif reachingagoodloadbalancingbetweenprocessre-mainsdifficult. Someproblemsareannoyingfor smallscenes;fortunatelywhenthesceneandtheamountof work to bedonebecomelarger, theimportanceof theseproblemsseemsto decrease.This remarkis importantbecausethesescenescannot reasonablybe simu-latedon a singlesequentialcomputer:in thosecases,parallelcomputingappearsto be aworthwhilesolution.

6 Conclusion

We have experimentedin this papervariousparallel implementationsof the wavelet ra-diosity methodon a hardwareccNUMA architecture,the SGI Origin2000.Experimentsshowedthatstraightforwardparallelizationprovidesgoodspeed-up- even if this applica-tion is highly irregular anddynamic- andis a reasonablemeanto solve extremely largemodelswith a largenumberof processors.

Even if theapplicationis written in “C++” andif thegranularityof theparallelizationremainsa little coarsein this implementation,weobtainanexcellentdatalocality - whichconfirmsthegoodpropertiesof theccNUMA architecture- anda reasonableloadbalanc-ing. However, on somescenes,loadbalancingcanlikely beenhancedusingsomespecificimprovements(shootona lure if needed,6 6 6 ).

Theseresultsareinteresting,first for thelighting designerswhocanexpectto simulatesomerealworld projects.Moreoverweanticipatethattheseresultsareprobablygeneraliz-ableto numericalapplicationswith linearprojectionoperators,in thecasewherethespace� is somespace� 3 in amulti-resolutionanalysis.

Acknowledgments

The authorswould like to thankFrançoisCuny, SlimaneMerzoukandChristopheWin-kler from the ISA teamof the LORIA (http://www.loria.fr) for their work onCandela,Didier Bur and BertrandCourtois from the schoolof Architectureof Nancy(http://www.crai.archi.fr) for their work on the models,Marc Albouy, fromElectricitédeFrancewhichprovidedtheStanislasSquareandtheQuitomodels,CarloSe-quin who provided the SodaHall. We alsothankSilicon Graphicsfor taking the CentreCharlesHermite(http://cch.loria.fr) asa beta-testsite of the Origin2000,LucRenambotfor his helpon performancemeasures,andlastbut not least,Alain Filbois forhisgreattechnicalsupporton thecomputer.

References

1. JamesArvo. The Role of FunctionalAnalysis in Global Illumination. In P. M.HanrahanandW. Purgathofer, editors,RenderingTechniques’95 (Proceedingsof theSixthEurographicsWorkshopon Rendering), pages115–126,New York, NY, 1995.Springer-Verlag.

2. JamesArvo, KennethTorrance,andBrian Smits. A Framework for theAnalysisofErrorin GlobalIlluminationAlgorithms. In ComputerGraphicsProceedings,AnnualConferenceSeries,1994(ACM SIGGRAPH’94 Proceedings), pages75–84,1994.

3. FrançoisCuny, ChristopheWinkler, andLaurentAlonso. Wavelet Algorithms forComplex Models.TechnicalReport,LORIA, INRIA Lorraine,1998.

4. ThomasA. Funkhouser. Coarse-GrainedParallelismfor HierarchicalRadiosityUs-ing GroupIterative Methods. In ComputerGraphicsProceedings,AnnualConfer-enceSeries,1996(ACM SIGGRAPH’96 Proceedings), pages343–352,1996.

5. StevenJ.Gortler, PeterSchroder, MichaelF. Cohen,andPat Hanrahan.WaveletRa-diosity. In ComputerGraphicsProceedings,AnnualConferenceSeries,1993(ACMSIGGRAPH’93 Proceedings), pages221–230,1993.

6. JamesLaudonandDaniel Lenoski. The SGI Origin: A ccNUMA Highly ScalableServer. In Proceedingsof the 24th Annual InternationalSymposiumon ComputerArchitecture, pages241–251,Denver, June1997.ACM Press.

7. SlimaneMerzouk. Architecture logicielle et algorithmespour la résolution del’équation de radiance. PhD thesis,Institut National Polytechniquede Lorraine,1997.

8. Luc Renambot,Bruno Arnaldi, Thierry Priol, andXavier Pueyo. TowardsEfficientParallel Radiosityfor DSM-BasedParallel ComputersusingVirtual Interfaces. InProceedingsof the Third Parallel RenderingSymposium(PRS’97), Phoenix,AZ,October1997.IEEEComputerSociety.

9. PeterSchroder, StevenJ.Gortler, MichaelF. Cohen,andPatHanrahan.WaveletPro-jectionsfor Radiosity. ComputerGraphicsForum, 13(2):141–151,June1994.

10. JaswinderP. Singh, Annop Gupta,and Marc Levoy. Parallel VisualizationAlgo-rithms: PerformanceandArchitecturalImplications. IEEE Computer, 27(7):45–55,July1994.

11. JaswinderPal Singh,ChrisHolt, TakashiTotsuka,AnoopGupta,andJohnHennessy.LoadBalancingandDataLocality in AdaptiveHierarchicalN-bodyMethods:Barnes-Hut, FastMultipole, andRadiosity. Journal of Parallel andDistributedComputing,27(2):118,June1995.

12. David Zareski.ParallelDecompositionof View-IndependentGlobalIlluminationAl-gorithms.M.Sc.thesis,Ithaca,NY, 1995.

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40

Spe

edup{

Processors

Stanislas Square Illumination : Speedup

IdealInner Loop

Single QueuedNo Visibility

(a)StanislasSpeedup

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40

Spe

edup{

Processors

Soda Hall : Speedup

Ideal30 Shoots

(b) RoomSpeedup

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

Rat

io (

%)|

Processors

Stanislas Square Illumination : Time accessing memory/Total time

Inner LoopSingle Queued

No Visibility

(c) StanislasMemory

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

Rat

io (

%)|

Processors

SodaHall : Time accessing memory/Total time

(d) RoomMemory

0.75

0.8

0.85

0.9

0.95

1

0 5 10 15 20 25 30 35 40

Hit

Rat

e}Processors

Stanislas Square Illumination : Cache Hits Rates

Inner Loop L1Inner Loop L2

Single Queued L1Single Queued L2

No Visibility L1No Visibility L2

(e)StanislasCaches

0.75

0.8

0.85

0.9

0.95

1

0 5 10 15 20 25 30 35 40

Hit

Rat

e}Processors

SodaHall : Cache Hits Rates

L1L2

(f) RoomCaches

Fig. 1. Experimentationresults

(a)StanislasSquareOpera

(b) SodaHall Room (c) SodaHall Floor

(d) Cloisterin Quito

Fig. 2. Imagesgeneratedwith ourparallelalgorithms

Parallel W avelet Radiosity

Documents

Transcript of Parallel W avelet Radiosity