Proceedings of the 2013 ACM International Conference on ... · TheARMv8Simulator 477 TaoJiang,...

6
t ;!' % %r ICS'13 .Pr<keep;^S'Off S/^jSS^--' International Conlirenceon Supercornputing Sponsored'-by:; v w-V; \ ACM SIMARCH Supported by: Microsofmesearch, Aim Microsoft, EGly NVIDIA, IBM, Intel, Hewlett Packard Data Direct Networks, \dhdsUn^ :y-:::::: '{"-'

Transcript of Proceedings of the 2013 ACM International Conference on ... · TheARMv8Simulator 477 TaoJiang,...

Page 1: Proceedings of the 2013 ACM International Conference on ... · TheARMv8Simulator 477 TaoJiang, LeleZhang(ChineseAcademyofSciences&UniversityofChineseAcademyofSciences), RuiHou(ChineseAcademyofSciences),

t

;!'%

%r

ICS'13.Pr<keep;^S'Off S/^jSS^--'

International Conlirenceon Supercornputing

Sponsored'-by:; v

*

w-V; \

ACM SIMARCH

Supported by:

Microsofmesearch, Aim Microsoft, EGly NVIDIA, IBM, Intel,Hewlett Packard Data Direct Networks,

\dhdsUn^ :y-:::::: '{"-'

Page 2: Proceedings of the 2013 ACM International Conference on ... · TheARMv8Simulator 477 TaoJiang, LeleZhang(ChineseAcademyofSciences&UniversityofChineseAcademyofSciences), RuiHou(ChineseAcademyofSciences),

Table of Contents

ICS 2013 Organization List xi

ICS 2013 Sponsor & Supporters xiv

Session 1: Keynote Address

• Business Meets Supercomputing 1

Bob Blainey (IBMCanada)

Session 2: DSLs and Semantic Based Compilation 1

• Abstractions to Separate Concerns in Semi-Regular Grids 3

Andrew Stone, Michelle Mills Strout (Colorado State University)

• A Stencil Compiler for Short-Vector SIMD Architectures 13

Tom Henretty (The Ohio State University),

Richard Veras, Franz Franchetti, Louis-Noel Pouchet (University ofCalifornia, Los Angeles),J. Ramanujam (Louisiana State University), P. Sadayappan (The Ohio State University)

• Exploiting Domain Knowledge to Optimize Parallel Computational Mechanics Codes 25

Chenyang Liu, M. Hasan Jamal, Milind Kulkami, Arun Prakash, Vijay Pai (Purdue University)

Session 3: Tools and Performance Debugging• TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile

Graphics Systems 37

Jose-Maria Arnau, Joan-Manuel Parcerisa (UniversitatPolitecnica de Catalunya),Polychronis Xekalakis (Intel Corporation)

• Scaling Data Race Detection for Partitioned Global Address Space Programs 47

Chang-Seo Park, Koushik Sen (University ofCalifornia, Berkeley),Costin Iancu (Lawrence BerkeleyNational Laboratories)

• Elastic and Scalable Tracing and Accurate Replay of Non-Deterministic Events 59

Xing Wu, Frank Mueller (North Carolina State University)

• A New Approach for Performance Analysis of OpenMP Programs 69

Xu Liu, John Mellor-Crummey, Michael Fagan (Rice University)

Session 4: Memory and Storage

• Conservative Row Activation to Improve Memory Power Efficiency 81

Kun Fang, Zhichun Zhu (University ofIllinois at Chicago)

• Active Disk Meets Flash: A Case for Intelligent SSDs 91

Sangyeun Cho (Samsung Electronics Co. & University ofPittsburgh), Chanik Park (Samsung Electronics Co.),Hyunok Oh (Hanyang University), Sungchan Kim (Chonbuk National University),Youngmin Yi (University ofSeoul), Gregory R. Ganger (Carnegie Mellon University)

• Design of a Large-Scale Storage-Class RRAM System 103

Myoungsoo Jung (The Pennsylvania State University & Lawrence Berkeley NationalLaboratory),John Shalf (Lawrence Berkeley National Laboratory), Mahmut Kandemir (The Pennsylvania State University)

• Memorage: Emerging Persistent RAM Based Malleable Main Memory and StorageArchitecture 115

Ju-Young Jung (University ofPittsburgh), Sangyeun Cho (SamsungElectronics Co. & University ofPittsburgh)

Session 5: Keynote Address

• Function, Latency, Bandwidth, Power: Towards a Better Computer 127

Steve Teig (Tabula, Inc.)

v

Page 3: Proceedings of the 2013 ACM International Conference on ... · TheARMv8Simulator 477 TaoJiang, LeleZhang(ChineseAcademyofSciences&UniversityofChineseAcademyofSciences), RuiHou(ChineseAcademyofSciences),

Session 6: Communication and Heterogeneous Systems

• Improving Communication in PGAS Environments: Static and Dynamic Coalescingin UPC 129

Michafl Alvanos (Barcelona Supercomputer Center), Montse Farreras (Universitat Politecnica de Catalunya),Ettore Tiotto (IBMToronto Laboratory), Jose Nelson Amaral (University ofAlberta),Xavier Martorell (Universitat Politecnica de Catalunya)

• Bandwidth-optimal Ail-to-AII Exchanges in Fat Tree Networks 139

Bogdan Prisacari, German Rodriguez, Cyriel Minkenberg (IBMResearch), Torsten Hoefler (ETHZurich)

• An Automatic Input-Sensitive Approach for Heterogeneous Task Partitioning 149

Klaus Kofler, Ivan Grasso, Biagio Cosenza, Thomas Fahringer (University ofInnsbruck)

• libWater: Heterogeneous Distributed Computing Made Easy 161

Ivan Grasso, Simone Pellegrini, Biagio Cosenza, Thomas Fahringer (University ofInnsbruck)

Session 7: Architecture 1

• Exploring Hardware Overprovisioning in Power-Constrained, HighPerformance Computing 173

Tapasya Patki, David K. Lowenthal (The University ofArizona),

Barry Rountree, Martin Schulz, Bronis de Supinski (Lawrence Livermore National Laboratory)

• The Power 775 Architecture at Scale 183

Ramakrishnan Rajamony, Mark W. Stephenson, Evan Speight (IBMResearch)

• Bubble Coloring: Avoiding Routing- and Protocol-induced Deadlocks with Minimal

Virtual Channel Requirement 193

Ruisheng Wang, Lizhong Chen, Timothy Mark Pinkston (University ofSouthern California)

• Evaluating On-Die Interconnects for a 4 TB/s Router 203

Keith D. Underwood, Eric Borch (Intel Federal),

John Sizer, Timothy Stremcha, Michael Strom (Open-Silicon, Inc.)

Session 8: Algorithms

• Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs UsingRecursive Algorithms 213

Matthew Badin (University ofCalifornia Irvine), Paolo D'Alberto (FastMMW),

Lubomir Bic, Michael Dillencourt, Alexandra Nicolau (University ofCalifornia Irvine)

• Toward a Scalable Multi-GPU Eigensolver via Compute-Intensive Kernels

and Efficient Communication 223

Azzam Haidar, Mark Gates, Stanimire Tomov (University ofTennessee),Jack Dongarra (OakRidge NationalLab)

• High Quality Real-Time Image-to-Mesh Conversion for Finite Element Simulations 233

Panagiotis Foteinos (College ofWilliam and Mary & OldDominion University),Nikos Chrisochoides (Old Dominion University)

Session 9: Architecture 2

• Tuning the Continual Flow Pipeline Architecture 243

Komal Jothi, Haitham Akkary (American University ofBeirut)

• Towards More Efficient Execution: A Decoupled Access-Execute Approach 253

Konstantinos Koukos, David Black-Schaffer, Vasileios Spiliopoulos, Stefanos Kaxiras (Uppsala University)

• Quantifying Performance Bottleneck Cost Through Differential Analysis 263

Souad KoliaT, Zakaria Bendifallah, Mathieu Tribalat, C6dric Valensi, Jean-Thomas Acquaviva(Exascale Computing Research), William Jalby (University ofVersailles)

vi

Page 4: Proceedings of the 2013 ACM International Conference on ... · TheARMv8Simulator 477 TaoJiang, LeleZhang(ChineseAcademyofSciences&UniversityofChineseAcademyofSciences), RuiHou(ChineseAcademyofSciences),

Session 10: Irregular Algorithms

• Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors 273

Xing Liu (Georgia Institute ofTechnology), Mikhail Smelyanskiy (Intel Corporation),Edmond Chow (Georgia Institute ofTechnology), Pradeep Dubey (Intel Corporation)

• Expressing Graph Algorithms Using Generalized Active Messages 283

Nicholas Edmonds, Jeremiah Willcock, Andrew Lumsdaine (Indiana University)

• HykSort: A New Variant of Hypercube Quicksort on Distributed

Memory Architectures 293

Had Sundar, Dhairya Malhotra, George Biros (University ofTexas atAustin)

Session 11: Memory

• Diagnosis and Optimization of Application Prefetching Performance 303

Gabriel Marin (University ofTennessee), Collin McCurdy, Jeffrey S. Vetter (Oak Ridge National Laboratory)

• Address-Aware Fences 313

Changhui Lin (University ofCalifornia, Riverside), Vijay Nagarajan (University ofEdinburgh),Rajiv Gupta (University ofCalifornia, Riverside)

• Prefetching and Cache Management Using Task Lifetimes 325

Vassilis Papaefstathiou, Manolis G. H. Katevenis (FORTH-ICS),Dimitrios S. Nikolopoulos (Queen's University ofBelfast), Dionisios Pnevmatikatos (FORTH-ICS)

Session 12: Keynote Address

• The Role of Computer Designers in Reverse-Engineering the Brain 335

James E. Smith (Independent)

Session 13: Runtime Techniques

• Holistic Run-time Parallelism Management for Time and Energy Efficiency 337

Srinath Sridharan, Gagan Gupta, Gurindar S. Sohi (University ofWisconsin-Madison)

• G-Charm: An Adaptive Runtime System for Message-Driven Parallel Applicationson Hybrid Systems 349

R Vasudevan, Sathish Vadhiyar (Indian Institute ofScience),Laxmikant V. Kale (University ofIllinois at Urbana-Champaign)

• Implementing OmpSs Support for Regions of Data in Architectures with MultipleAddress Spaces 359

Javier Bueno, Xavier Martorell (Universitat Politecnica de Catalunya),Rosa M. Badia (Barcelona Supercomputing Center, Artificial Intelligence Research Institute,

& Spanish National Research Council), Eduard Ayguade, Jesus Labarta (UniversitatPolitecnica de Catalunya)

• Automatically Adapting Programs for Mixed-Precision Floating-Point Computation 369

Michael O. Lam, Jeffrey K. Hollingsworth (University ofMaryland),Bronis R. de Supinski, Matthew P. Legendre (Lawrence Livermore National Laboratory)

Session 14: Order in the House

• CMP Off-chip Bandwidth Scheduling Guided by Instruction Criticality 379

Pablo Prieto, Valentin Puente, Jose Angel Gregorio (University ofCantabria)

• Massively Parallel Loading 389

Wolfgang Frings (Forschungszentrum Juelich),

Dong H. Ahn, Matthew LeGendre, Todd Gamblin, Bronis R. de Supinski(Lawrence Livermore National Laboratory), Felix Wolf(RWTHAachen University)

• MIC-RO: Enabling Efficient Remote Offload on Heterogeneous Many Integrated Core

(MIC) Clusters with InfiniBand 399

Khaled Hamidouche, Sreeram Potluri, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda

(The Ohio State University)

vii

Page 5: Proceedings of the 2013 ACM International Conference on ... · TheARMv8Simulator 477 TaoJiang, LeleZhang(ChineseAcademyofSciences&UniversityofChineseAcademyofSciences), RuiHou(ChineseAcademyofSciences),

Session 15: GPUs

• Efficient Scheduling of Recursive Control Flow on GPUs 409

Xin Huo (The Ohio State University), Sriram Krishnamoorthy (Pacific Northwest NationalLaboratory),Gagan Agrawal (The Ohio State University)

• SemCache: Semantics-aware Caching for Efficient GPU Offloading 421

Nabeel AlSaber, Milind Kulkarni (Purdue University)

• Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency,and Opportunistic Reliability Enhancement 433

Ping Xiang (North Carolina State University), Yi Yang (NEC Laboratories America),

Mike Mantor, Norm Rubin, Lisa R. Hsu (AMD Inc.), Huiyang Zhou (North Carolina State University)

• Scaling Large-Data Computations on Multi-GPU Accelerators 443

Amit Sabne, Putt Sakdhnagool, RudolfEigenmann (Purdue University)

Session 16: Posters

• Hybrid Approach for Data-flow Analysis of MPI Programs 455

Sriram Aananthakrishnan (University ofUtah), Greg Bronevetsky (Lawrence Livermore NationalLaboratory),Ganesh Gopalakrishnan (University ofUtah)

• Improving Performance of AII-to-AII Communication Through Loop Schedulingin PGAS Environments 457

Michail Alvanos (Barcelona Supercomputer Center), Gabriel Tanase (IBM T.J. Watson Research Center),Montse Farreras (Universitat Politecnica de Catalunya), Ettore Tiotto (IBM Toronto Laboratory),Jose Nelson Amaral (University ofAlberta), Xavier Martorell (Universitat Politecnica de Catalunya)

• CUPL: A Compile-time Uncoalesced Memory Access Pattern Locator for CUDA 459

Madhur Amilkanthwar, Shankar Balachandran (Indian Institute ofTechnology)

• Imbalance Optimization in Scientific Workflows 461

Weiwei Chen, Ewa Deelman (University ofSouthern California), Rizos Sakellariou (University ofManchester)

• FASTER Run-time Reconfiguration Management 463

Catalin Bogdan Ciobanu (Chalmers University ofTechnology),Dionisios N. Pnevmatikatos, Kyprianos D. Papadimitriou (Foundationfor Research and Technology - Hellas),

Georgi N. Gaydadjiev (Chalmers University ofTechnology)

• MAD7: A Memory Architecture Simulator Targeted at Design Space Exploration 465

Hadrien A. Clarke (Institute ofSystems, Information Technologies andNanotechnologies & Kyushu University),Antoine Trouve (Institute ofSystems, Information Technologies andNanotechnologies),Kazuaki J. Murakami (Institute ofSystems, Information Technologies andNanotechnologies& Kyushu University)

• A Decomposition Method with Minimal Communication Volume for Parallelization

of Multi-dimensional FFTs 467

Truong Vinh Truong Duy (Japan Advanced Institute ofScience and Technology & The University ofTokyo),Taisuke Ozaki (Japan AdvancedInstitute ofScience and Technology)

• A Massively Parallel Domain Decomposition Method for Large-Scale DFT Electronic

Structure Calculations 469

Truong Vinh Truong Duy (Japan Advanced Institute ofScience and Technology & The University ofTokyo),Taisuke Ozaki (Japan AdvancedInstitute ofScience and Technology)

• Multi-Layered Unstructured Mesh Generation 471

Panagiotis Foteinos (College ofWilliam and Mary & OldDominion University),

Darning Feng, Andrey Chernikov, Nikos Chrisochoides (OldDominion University)

• Network-on-Chip for a Partially Reconfigurable FPGASystem 473

Justin A. Hogan, Raymond J. Weber, Brock J. LaMeres, Todd Kaiser (Montana State University)

• Exploiting Data Parallelism in the yConvex Hypergraph Algorithm for ImageRepresentation Using GPGPUs 475

Saurabh Jha, Tejaswi Agarwal, B. Rajesh Kanna (VLT University)

viii

Page 6: Proceedings of the 2013 ACM International Conference on ... · TheARMv8Simulator 477 TaoJiang, LeleZhang(ChineseAcademyofSciences&UniversityofChineseAcademyofSciences), RuiHou(ChineseAcademyofSciences),

. The ARMv8 Simulator 477

Tao Jiang, Lele Zhang (Chinese Academy ofSciences & University ofChinese Academy ofSciences),Rui Hou (Chinese Academy ofSciences),Yi Zhang (Chinese Academy ofSciences & University ofChinese Academy ofSciences),Qianlong Zhang (Chinese Academy ofSciences), Lin Chai, Jing Han (Chinese Academy ofSciences& University ofChinese Academy ofSciences), Wuxiang Zhang (Chinese Academy ofSciences),

Cong Wang (Chinese Academy ofSciences & University ofChineseAcademy ofSciences),Lixin Zhang (Chinese Academy ofSciences)

• Imogen: A Parallel 3D Fluid and MHD Code for GPUs 479

Erik Keever, James N. Imamura (University ofOregon)

• SMIO: I/O Similarity Aware Virtual Machine Management in Virtual

Desktop Environments 481

Min Li, Sushil Mantri (Virginia Tech), Pin Zhou (IBMAlmaden Research Center), Ali R. Butt (Virginia Tech)

• Inspector/Executor Load Balancing Algorithms for Block-SparseTensor Contractions 483

David Ozog, Sameer Shende, Allen Malony (University ofOregon),Jeff R. Hammond, James Dinan, Pavan Balaji (Argonne National Laboratory)

• Improving Performance of OpenSHMEM Reference Library by Portable PE

Mapping Technique 485

Swaroop Pophale, Tony Curtis, Barbara Chapman (University ofHouston)

• Using Platform-independent Data Locality Analysis to Predict Cache Performance

on Abstract Hardware Platforms 487

Sonish Shrestha (University ofTexas at El Paso)

• Towards Shared Memory Consistency Models for GPUs 489

Tyler Sorensen, Ganesh Gopalakrishnan (University ofUtah), Vinod Grover (NVIDIA)

• Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches 491

Alejandro Valero, Julio Sahuquillo, Salvador Petit, Jose Duato (Universitat Politecnica de Valencia)

• V-OpenCL: A Method to Use Remote GPGPU 493

Cong Wang, Tao Jiang, Rui Hou (Chinese Academy ofSciences)

• Power Efficiency in a Partially Reconfigurable Multiprocessor System 495

Raymond J. Weber, Justin A. Hogan, Brock J. LaMeres, Todd Kaiser (MontanaState University)

Author Index 497

ix