Post on 15-Jan-2016
Flexible and Efficient Control of Data Transfers for
Loosely Coupled Components
Joe Shang-Chieh Wuhttp://meou.us
Department of Computer ScienceUniversity of Maryland, USA
What & How
• Obtain more accurate results by coupling existing (parallel) physical simulation components
• Different time and space scales for data produced in shared or overlapped regions
• Runtime decisions for which time-stamped data objects should be exchanged
• Performance might be a concern
Roadmap
• Approximate Match [Grid 2004]
• Collective Buffering [IPDPS 2007]
• Distributed App Match + Eager Transfer [under submission]
• Conclusion
Matching is OUTSIDE components
• Separate matching (coupling) information from the participating componentsMaintainability – Components can be
developed/upgraded individuallyFlexibility – Change participants/components
easilyFunctionality – Support variable-sized time
interval numerical algorithms or visualizations
Distributed Array Transfer Library
Basic Operation
Runtime-based Approximate Match Library
Importer component
Request Array for T = 2.5
Matched Array for T = 3
ApproximateMatch
Exporter component
T=4
T=3
T=2
Exported Distributed
Array
ImportedDistributed
Array
Arrays are distributed among multiple processes
T=1
Separate codes from matching
define region R1define region R4define region R5...Do t = 1, N, Step0 ... // computation jobs export(R1,t) export(R4,t) export(R5,t)EndDo
define region R2...Do t = 1, M, Step1 import(R2,t) ... // computation jobsEndDo
Importer App1
Exporter App0 Configuration file#App0 cluster0 /bin/App0 2 ...App1 cluster1 /bin/App1 4 ...App2 cluster2 /bin/App2 16 ...App4 cluster4 /bin/App4 4#App0.R1 App4.R0 REGL 0.05App0.R1 App2.R0 REG 0.1App0.R4 App1.R2 REGU 0.5#
Connection-Wise Approximate
Match
Policy Precision
Find t’ in App0, s.t. (a) t <= t’ <= t + 0.5 (b) minimize t’ – t
Source
Sink
• Execution time is composed of Computation time (Tcomp)
Buffering time (Tbuf)
Matched data transfer time (Ttran)
• Tbuf matters when exporter components (data sources) run more slowly
• Ttran matters when import components (data sinks) run more slowly
Dissection of Execution Time
Collective Buffering (when exporters run more slowly)
• Fastest export process sends runtime match results to slower processes in the same program
• Unnecessary memory copies can be avoided in slower processes
• Optimal State: only required exported data are buffered
Collective Buffering Result
Data Exporting Time for the Slowest Process
Copy All
CopySome Only Copy
Required
Optimal State
Eager Transfer + Distributed Match(when importer runs more slowly)
• Bandwidth and Latency both contribute matched data transfer time
• Eager transfer, transferring predicted data in advance, solves bandwidth issue
• Distributed approximate match, running on both exporter and importer, solves latency issue
Original
ET Only
ET+DM
Conclusion
• Runtime-based approximate match is a solution to couple different time scale components
• Performance can be improved – When exporter runs more slowly, avoid
unnecessary memory copies – When importer runs more slowly, transfer
predicted data and meta-data in advance
The End
Questions ?(http://meou.us)
Distributed Array Transfer Library
Basic Operation
Runtime-based Approximate Match Library
Importer component
Request Array for T = 2.5
Matched Array for T = 3
ApproximateMatch
Exporter component
T=4
T=3
T=2
Exported Distributed
Array
ImportedDistributed
Array
Arrays are distributed among multiple processes
T=1
On-Demand Approach
• Import Component Makes Request
• Perform Approx Match on Export Component, and then Transfer Matched Data
• Need Data Transfer Time (T3 – T2) and 2 one-way delays (T2
– T1)
Eager Transfer Only
• Get permission to push predicted data
• Transfer predicted data in advance
• Import component makes request
• Perform approx match on export component
• Need 2 one-way delays ( T16 – T15)
Eager Transfer With Distributed Match• …• Transfer predicted
data + meta-data in advance
• Import component makes request becomes local operations
• Local operation time T26 – T25 is needed, independent to one-way delay
All Together
Supported matching policies
<importer request, exporter matched, desired precision> = <x, f(x), p>
• LUB minimum f(x) with f(x) ≥ x• GLB maximum f(x) with f(x) ≤ x• REG f(x) minimizes |f(x)-x| with |f(x)-x| ≤ p• REGU f(x) minimizes f(x)-x with 0 ≤ f(x)-x ≤ p• REGL f(x) minimizes x-f(x) with 0 ≤ x-f(x) ≤ p• FASTR any f(x) with |f(x)-x| ≤ p• FASTU any f(x) with 0 ≤ f(x)-x ≤ p• FASTL any f(x) with 0 ≤ x-f(x) ≤ p