Distributed Processing of Future Radio Astronomical Observations Ger van Diepen ASTRON, Dwingeloo...

17
Distributed Processing of Future Radio Astronomical Observations Ger van Diepen ASTRON, Dwingeloo ATNF, Sydney
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    3

Transcript of Distributed Processing of Future Radio Astronomical Observations Ger van Diepen ASTRON, Dwingeloo...

Distributed Processingof Future

Radio Astronomical Observations

Ger van DiepenASTRON, Dwingeloo

ATNF, Sydney

ADASS2007; GvD 24-9-2007

ContentsContents

IntroductionIntroduction

Data DistributionData Distribution

ArchitectureArchitecture

Performance issuesPerformance issues

Current status and future workCurrent status and future work

ADASS2007; GvD 24-9-2007

Data Volume in future telescopesData Volume in future telescopes

LOFARLOFAR37 stations (666 baselines) grows to 63 stations (1953 baselines)37 stations (666 baselines) grows to 63 stations (1953 baselines)128 subbands of 256 channels (32768 channels)128 subbands of 256 channels (32768 channels)666*32768*4*8 bytes/sec = 700 MByte/sec666*32768*4*8 bytes/sec = 700 MByte/sec5 hour observation gives 12 TBytes5 hour observation gives 12 TBytes

ASKAP (spectral line observation)ASKAP (spectral line observation)45 stations (990 baselines)45 stations (990 baselines)32 beams, 16384 channels each32 beams, 16384 channels each990*32*16384*4*8 bytes/10 sec = 1.6 GByte/sec990*32*16384*4*8 bytes/10 sec = 1.6 GByte/sec12 hour observation gives 72 Tbytes12 hour observation gives 72 TbytesOne day observing > entire world radio archiveOne day observing > entire world radio archive

ASKAP continuum: ASKAP continuum: 280 GBytes (64 channels)280 GBytes (64 channels)

MeerKAT similar to ASKAPMeerKAT similar to ASKAP

ADASS2007; GvD 24-9-2007

Key IssuesKey Issues

TraditionallyTraditionally FutureFuture

SizeSize Few GBytesFew GBytes Several TBytesSeveral TBytes

Processing timeProcessing time weeks-monthsweeks-months < 1 day< 1 day

ModeMode InteractivelyInteractively Automatic pipelineAutomatic pipeline

Archived?Archived? AlwaysAlways Some dataSome data

WhereWhere DesktopDesktop Dedicated machineDedicated machine

IOIO Many passes Many passes through datathrough data

Only few passes Only few passes possiblepossible

Package usedPackage used AIPS,Miriad,Casa,..AIPS,Miriad,Casa,.. ??

ADASS2007; GvD 24-9-2007

Data DistributionData Distribution

Visibility data need to be stored in a distributed Visibility data need to be stored in a distributed wayway

Limited use for parallel IOLimited use for parallel IO

Too many data to share across networkToo many data to share across network

Bring processes to the data Bring processes to the data

NOT bring data to processesNOT bring data to processes

ADASS2007; GvD 24-9-2007

Data DistributionData Distribution

Distribution must be efficient for all purposes Distribution must be efficient for all purposes (flagging, calibration, imaging, (flagging, calibration, imaging,

deconvolution)deconvolution)

Process locally where possible and exchange as Process locally where possible and exchange as few data as possiblefew data as possible

Loss of a data partition should not be too painfulLoss of a data partition should not be too painful

Spectral partitioning seems best candidateSpectral partitioning seems best candidate

ADASS2007; GvD 24-9-2007

ArchitectureArchitecture

Connection types:Connection types:

SocketSocket

MPIMPI

MemoryMemory

DBDB

ADASS2007; GvD 24-9-2007

Data ProcessingData Processing

A series of steps have to be performed on the dataA series of steps have to be performed on the data

(solve, subtract, correct, image, ...)(solve, subtract, correct, image, ...)

Master get steps from control process (e.g. Master get steps from control process (e.g. Python)Python)

If possible, step is directly sent to appropriate If possible, step is directly sent to appropriate workersworkers

Some steps (e.g. solve) need iterationSome steps (e.g. solve) need iterationSubsteps are sent to workersSubsteps are sent to workers

Replies are received and forwarded to other workersReplies are received and forwarded to other workers

ADASS2007; GvD 24-9-2007

Calibration ProcessingCalibration Processing

Solving non-linearlySolving non-linearly

do {do {

1: get normal equations1: get normal equations

2: send eq to solver2: send eq to solver

3: get solution3: get solution

4: send solution4: send solution

} while (!converged)} while (!converged)

ADASS2007; GvD 24-9-2007

Performance: IOPerformance: IO

Distributed IO, yet 24 minutes to read 72 TByte onceDistributed IO, yet 24 minutes to read 72 TByte onceIO should be asynchronous to avoid idle CPUIO should be asynchronous to avoid idle CPU

Deployment decision what storage to use Deployment decision what storage to use Local disks (RAID)Local disks (RAID)SAN or NASSAN or NASSufficient IO-bandwidth to all machines is neededSufficient IO-bandwidth to all machines is needed

Calibration and imaging are used repeatedly, so the data Calibration and imaging are used repeatedly, so the data will be accessed multiple timeswill be accessed multiple timesBUT operate on chunks of data (work domain) to keep BUT operate on chunks of data (work domain) to keep data in memory while performing many steps on themdata in memory while performing many steps on themPossibly store in multiple resolutionsPossibly store in multiple resolutionsTiling for efficient IO if different access patternsTiling for efficient IO if different access patterns

ADASS2007; GvD 24-9-2007

Performance: NetworkPerformance: Network

Process locally where possibleProcess locally where possible

Send as few data as possibleSend as few data as possible

(normal equations are small matrices)(normal equations are small matrices)

Overlay operationsOverlay operations

e.g. Form normal equations for next work domain e.g. Form normal equations for next work domain while Solver solves current work domainwhile Solver solves current work domain

ADASS2007; GvD 24-9-2007

Performance: CPUPerformance: CPU

Parallelisation (OpenMP, ...)Parallelisation (OpenMP, ...)

Vectorisation (SSE instructions)Vectorisation (SSE instructions)

Keep data in CPU cache as much as possible, so Keep data in CPU cache as much as possible, so smallish data arrayssmallish data arrays

Optimal layout of data structuresOptimal layout of data structures

Keep intermediate results if not changingKeep intermediate results if not changing

Reduce number of operations by reducing the Reduce number of operations by reducing the resolutionresolution

ADASS2007; GvD 24-9-2007

Current statusCurrent status

Basic framework has been implemented and is Basic framework has been implemented and is used in LOFAR and CONRAD calibration and used in LOFAR and CONRAD calibration and imagingimagingCan be deployed on cluster or super (or desktop)Can be deployed on cluster or super (or desktop)Tested on SUN cluster, Cray XT3, Tested on SUN cluster, Cray XT3,

IBM PC cluster, MacBookIBM PC cluster, MacBookResource DB describes cluster layout and data Resource DB describes cluster layout and data partitioning.partitioning.Hence the master can derive which processor Hence the master can derive which processor should process with part of the data. should process with part of the data.

ADASS2007; GvD 24-9-2007

Parallel processed image Parallel processed image (Tim (Tim Cornwell)Cornwell)

Runs on ATNF’s Sun cluster “minicp” 8 nodesRuns on ATNF’s Sun cluster “minicp” 8 nodesEach node = 2 * dual core Opterons, 1TB, 12GBEach node = 2 * dual core Opterons, 1TB, 12GB

Also on CRAY XT3 at WASP (Perth, WA)Also on CRAY XT3 at WASP (Perth, WA)Data simulated using AIPS++Data simulated using AIPS++Imaged using CONRAD synthesis softwareImaged using CONRAD synthesis software

New software using casacoreNew software using casacoreRunning under OpenMPIRunning under OpenMPI

Long integration continuum imageLong integration continuum image8 hours integration8 hours integration128 channels over 300MHz128 channels over 300MHzSingle beamSingle beam

Use 1, 2, 4, 8, 16 processing nodes for Use 1, 2, 4, 8, 16 processing nodes for calculation of residual imagescalculation of residual images

Scales wellScales well

Must scale up hundred foldMust scale up hundred foldOr more….Or more….

ADASS2007; GvD 24-9-2007

Future workFuture work

More work needed on robustnessMore work needed on robustnessDiscard partition when processor or disk failsDiscard partition when processor or disk fails

Move to other processor if possible (e.g. replicated)Move to other processor if possible (e.g. replicated)

Store data in multiple resolutions?Store data in multiple resolutions?

Use master-worker in flagging, deconvolutionUse master-worker in flagging, deconvolution

Worker can use accelerators like GPGPU, FPGA, CellWorker can use accelerators like GPGPU, FPGA, Cell

(maybe through RapidMind)(maybe through RapidMind)

Worker can be a master itself to make use of BG/L in a PC Worker can be a master itself to make use of BG/L in a PC clustercluster

ADASS2007; GvD 24-9-2007

Future workFuture work

Extend to image processing (few TBytes)Extend to image processing (few TBytes)Source findingSource findingAnalysisAnalysisDisplayDisplayVO access?VO access?

ADASS2007; GvD 24-9-2007

Thank youThank you

Joint work with people at ASTRON, ATNF, and KATJoint work with people at ASTRON, ATNF, and KAT

More detail in next talk about LOFAR calibrationMore detail in next talk about LOFAR calibration

See poster about CONRAD softwareSee poster about CONRAD software

Ger van DiepenGer van [email protected]@astron.nl