CCNI HPC 2 Activities

Slide 1

CCNI HPC2 Activities

NYS High Performance Computation Consortium funded by NYSTAR at $1M/year for 3 yearsGoal is to provide NY State users support in the application of HPC technologies in:Research and discoveryProduct developmentImproved engineering and manufacturing processesThe HPC2 is a distributed activity - participantsRensselaer, Stony Brook/Brookhaven, SUNY Buffalo, NYSERNETHPC2 Activities2NY State Industrial PartnersXeroxCorningITT Fluid Technologies: Goulds PumpsGlobal FoundriesModeling Two-phase FlowsObjectivesDemonstrate end-to-end solution of two-phase flow problems. Couple with structural mechanics boundary condition.Provide interfaced, efficient and reliable software suite for guiding design.ToolsSimmetrix SimAppS Graphical Interface mesh generation and problem definition PHASTA two-phase level set flow solverPhParAdapt solution transfer and mesh adaptation driverKitware Paraview visualizationSystems CCNI BG/L, CCNI Opterons Cluster

Modeling Two-phase Flows3D Example SimulationREPLACE WITH ANIMATIONFluid ejected into air.Ran on 4000 CCNI BG/L cores.Two-phase Automated Mesh Adaptation

Six iterations of mesh adaptation on two-phase simulation. Autonomously ran on 128 cores of CCNI Opterons for approximately 4 hours

Modeling Two-phase FlowsSoftware Support for Fluid Structure InteractionsInitial work interfaces simulations through serial file formats for displacement and pressure data.Structural mechanics simulation runs in serial. PHASTA simulation runs in parallel.Distribute serial displacement data to partitioned PHASTA mesh. Aggregate partitioned PHASTA nodal pressure data to serial input file.Modifications to automated mesh adaptation Perl script.

Structural Mechanics Mesh of Input FacePHASTA Partitioned Mesh of Input Face7Modeling Free Surface FlowsObjectivesDemonstrate capability of available computational tools/resources for parallel simulation of highly viscous sheet flows.Solve a model sheet flow problem relevant to the actual process/geometry.Develop and define processes for high fidelity twin screw extruder parallel CFD simulation.Investigated Tools (to date)ACUSIM AcuConsole and AcuSolve, Simmetrix MeshSim, Kitware ParaviewSystems CCNI Opterons ClusterHigh Aspect Ratio SheetAspect ratio : 500:1Element count: 1.85 Million7 mins on 512 cores300 mins on 8 cores

9

Parallel 3D Sheet Flow SimulationMesh generation in Simmetrix SimAppS graphical interface.Gaps that are ~1/180 of large feature dimension.10

* http://en.wikipedia.org/wiki/Plastics_extrusion** https://sites.google.com/site/oscarsalazarcespedescaddesign/project03Single Screw Extruder CAD**Conceptual Rendering of Single Screw Extruder Assembly*Screw Extruder: Simulation Based Design ToolsScrew Extruder: Simulation Based Design Tools10Modeling Pump FlowsObjectivesApply HPC systems and software to setup and run 3D pump flow simulations in hours instead of days.Provide automated mesh generation for fluid geometries with rotating components. Tools ACUSIM Suite, PHASTA, ANSYS CFX, FMDB, Simmetrix MeshSim, Kitware ParaviewSystems CCNI Opterons Cluster

Modeling Pump FlowsGraphical InterfacesAcuConsole InterfaceProblem definition, mesh generation, runtime monitor, and data visualization

Software install Security modules & wrappersRemote user support vnc

Modeling Pump Flows Critical Mesh Regions

Modeling Pump Flows Critical Mesh Regions

Mesh Generation ToolsSimmetrix provided customized mesh generation and problem definition GUI after iterating with industrial partner.Supports automated identification of pump geometric model features and application of attributesProblem definition with support for exporting data for multiple CFD analysis tools.Reduced mesh generation time frees engineers to focus on simulation and design optimizations improved productsIndustrial partner wanted to use mesh generation capabilities demonstrated with ACUSIM Suite with a CFD code that had a cavitation model.

NEERAVS SLIDES HEREScientific Computation Research Center

Scientific Computation Research CenterGoal: Develop simulation technologies that allow practitioners to evaluate systems of interest.To meet this goal weDevelop adaptive methods for reliable simulationsDevelop methods to do all computation on massively parallel computers Develop multiscale computational methodsDevelop interoperable technologies that speed simulation system developmentPartner on the construction of simulation systems for specific applications in multiple areas

SCOREC Software ComponentsSoftware available (http://www.scorec.rpi.edu/software.php) Some tools not yet linked email [email protected] with any questionsSimulation Model and Data ManagementGeometric model interface to interrogate CAD modelsParallel mesh topological representationRepresentation of tensor fieldsRelationship managerParallel ControlNeighborhood aware message packing - IPComManIterative mesh partition improvement with multiple criteria - ParMAProcessor mesh entity reordering to improve cache performanceSCOREC Software Components (Continued)Adaptive MeshingAdaptive mesh modificationMesh curvingAdaptive ControlSupport for executing parallel adaptive unstructured mesh flow simulations with PHASTAAdaptive multimodel simulation infrastructureAnalysisParallel Hierarchic Adaptive Stabilized Transient Analysis software for compressible or incompressible, laminar or turbulent, steady or unsteady flows on 3D unstructured meshes (with U. Colorado)Parallel hierarchic multiscale modeling of soft tissues

Interoperable Technologies for Advanced Petascale Simulations (ITAPS)MeshGeometryRelationsFieldCommonInterfacesComponentTools

Are unified byPetascaleIntegratedTools

Build on

Mesh AdaptInterpolationKernelsSwapping DynamicServicesGeom/MeshServicesAMRFront trackingShapeOptimizationSolutionAdaptiveLoopSolutionTransferPetascaleMeshGenerationSmoothingFront tracking

PHASTA Scalability(Jansen, Shephard, Sahni, Zhou)

Excellent strong scaling Implicit time integrationEmploys the partitioned mesh for system formulation and solutionSpecific number of ALL-REDUCE communications also required

#Proc.El./coret(sec)scale512204,800212011,024102,40010521.012,048 51,2005291.004,096 25,6002670.998,192 12,8001311.0216,384 6,40064.51.0332,7683,20035.60.93105M vertex mesh (CCNI Blue Gene/L)1 billion element anisotropic mesh on Intrepid Blue Gene/P#of coresRgn imbVtx imbTime (s) Scaling 16k2.03%7.13%222.03132k1.72%8.11%112.430.98764k1.6%11.18%57.090.972128k5.49%17.85%31.350.885

Strong Scaling 5B Mesh up to 288k CoresWithout ParMA partition improvement strong scaling factor is 0.88 (time is 70.5 secs).Can yield 43 cpu-years savings for production runs!

AAA 5B elements: full-system scale on Jugene (IBM BG/P system)

Requires functional support forMesh distribution Mesh level inter-processor communicationsParallel mesh modification Dynamic load balancingHave parallel implementations for each focusing on increasing scalabilityParallel Adaptive Analysis

Initial mesh: uniform, 17 million mesh regionsAdapted mesh: 160 air bubbles 2.2 billion mesh regionsMultiple predictive load balance steps used to make the adaptation possible Larger meshes possible (not out of memory) Parallel Mesh Adaptation to 2.2 Billion Elements

Initial and adapted mesh (zoom of a bubble), colored by magnitude of mesh size fieldMesh size field of air bubbles distributing in a tube (segment of the model 64 bubbles total)

Initial Scaling Studies of parallel MeshAdaptTest strong scaling uniform refinement on Ranger 4.3M to 2.2B elements

Nonuniform field driven refinement (with mesh optimization) on Ranger 4.2M to 730M elements (time for dynamic load balancing not included)

Nonuniform field driven refinement (with mesh optimization operations) on Blue Gene/P 4.2M to 730M elements (time for dynamic load balancing not included)# of PartsTime (s)Scaling204821.51.0409611.20.9681925.670.95163842.730.99# of PartsTime (s)Scaling2048110.61.0409657.40.96819235.40.79# of PartsTime (s)Scaling40961731.081921050.821638466.10.653276836.10.60Tightly coupledAdv: Computationally efficientDisadv: More complex code developmentExample: Explicit solution of cannon blasts

Loosely coupledAdv: Ability to use existing analysis codesDisadv: Overhead of multiple structures and data conversionExample: Implicit high-orderActive flow control modeling

t=0.0t=2e-4

t=5e-4Adaptive Loop Construction

Adaptive Loop Driver C++Coordinates API calls to execute solve-adapt loopphSolver Fortran 90Flow solver scalable to 288k cores of BG-P, Field API phParAdapt C++Invokes parallel mesh adaptationSCOREC FMDB and MeshAdapt, Simmetrix MeshSim and MeshSimAdapt

Adaptive Loop DriverphSolverphParAdapt28Compact Mesh and Solution DataMesh Data BaseSolution FieldsField APIField APIControlControlField DataField DataFile Free Parallel-Adaptive LoopIPComManGeneral-purpose communication package built on top of MPIArchitecture independent neighborhood based inter-processor communications.Neighborhood in parallel applicationsSubset of processors exchanging messages during a specific communication round.Bounded by a constant, typically under 40, independent of the total number of processors.Several useful features of the libraryAutomatic message packing.Management of sends and receives with non-blocking MPI functions.Asynchronous behavior unless the other is specified.Support of dynamically changing neighborhood during communication steps.IPComMan ImplementationBuffer Memory ManagementAssemble messages in pre-allocated buffers for each destination.Send each package out when its buffer size is reached.Provide memory allocation for both sending and receiving buffers.Deal with constant or arbitrary message sizes.Processor-Neighborhood-Domain ConceptSupport efficient communication to processor neighbors based on knowledge of neighborhoods.No collective call verifications if neighbors are fixed.If new neighbors are encountered, perform a collective call to figure out the correctness of communication.Communication ParadigmNo need to verify and send the number of packages to neighbors, it is wrapped in the last buffer.If nothing to send to its neighbor a constant is sent notifying that the communication is done.No message order rule, thus save communication time by processing the first available buffer.IPComMan Vs. MPI ImplementationTiling patterns to test the message flow control in a pseudo-unstructured neighborhood environment on 1024 cores.

N/4 processors has 2 neighbors, N/8 processors has 3 neighbors, N/4 processors has 4 neighbors, 3N/16 processors has 5 neighbors, N/16 processors has 9 neighbors, N/16 processors has 14 neighbors, N/16 processors has 36 neighborsSending and receiving 8 byte messages without buffering.Mesh modification before load balancing can lead to memory problems Predictive load balancing performs weighted dynamic load balanceMesh metric field at any point P is decomposed to three unit direction (e1,e2,e3) and desired length (h1,h2,h3) in each corresponding direction.The volume of desired element (tetrahedron) : h1h2h3/6Estimate # of elements to be generated:

Predictive Load Balancing ParMA - Partition Improvement ProceduresIncremental redistribution of mesh entities to improve overall balancePartitioning using Mesh Adjacencies - ParMADesigned to improve balance for multiple entities typesUse mesh adjacencies directly to determine best candidates for movement Current implementation based on neighborhood diffusion

Table: Region and vertex imbalance for a 8.8 million region uniform mesh on a bifurcation pipe model partitioned to different number of parts

Selection of vertices to be migrated: ones bounding small number of elementsVertices with only one remote copy considered to avoid the possibility to create nasty part boundaries

Vertex imbalance: from 14.3% to 5%Region imbalance: from 2.1% to 5%ParMA - Partition Improvement ProceduresMesh Curving for COMPASS Analyses35

mesh close-up before and after correcting invalid mesh regions marked in yellowMesh curving applied to 8-cavity cryomodule simulations2.97 Million curved regions1,583 invalid elements corrected leads to stable simulation and executes 30% fasterMoving Mesh AdaptationFETD for short-range wakefield calculationsAdaptively refined meshes have 1~1.5 million curved regionsUniform refined mesh using small mesh size has 6 million curved regions

Electric fields on the three refined curved meshes

Patient Specific Vascular Surgical PlanningInitial mesh has 7.1 million regionsInitial mesh is isotropic outside boundary layerThe adapted mesh: 42.8 million regions 7.1M->10.8M->21.2M->33.0M->42.8MBoundary layer based mesh adaptationMesh is anisotropic

Multiscale Simulations for Collagen Indentation Multiscale simulation linking microscale network model to a macroscale finite element continuum model. Collaborating with experimentalists at the University of Minnesota

Macroscale ModelMicroscale ModelConcurrent Multiscale: Atomistic-to-ContinuumNano-indentation of a thin film.Concurrent modelconfiguration at 60th loadstep (3 A indentationdisplacement). Colors representthe sub-domains in whichvarious models are used.

Nano-void subjected to hydrostatictension. Finite element discretization of the problem domain anddislocation structures.ParallelComputingMethodsFab-Aware High-Performance Chip Designsize scalecircuitsdevicesatoms/carriersdesignmanufactureuse/performanceSimulation AutomationComponentsDevice simulationSuper-resolutionlithography toolsReactive ionetchingvariation-awarecircuit design1st principlesCMOS modelingModeling/simulation developmentTechnology developmentMechanics ofdamage nucleation in devices40First-Principles Modeling for Nanoelectronic CMOS (Nayak)

EFermi levelNUUNPoissonSchrdingerUIInput to circuit level from atomic level physicsAs Si CMOS devices shrink nanoelectronic effects emerge.Fermi-function based analysis gives way to quantum energy-level analysis.Poisson and Schrodinger equations reconciled iteratively, allowing for current predictions.Carrier dynamics respond to strain in increasingly complex ways from mobility changes to tunneling effects. New functionalities might be exploitedSingle-electron transistorsGraphene semiconductorsCarbon nanotube conductorsSpintronics encoding information into charge carriers spin 41This is similar to what was in my thesis, but larger computational scale. Using PHASTA for level set/geometry tracker. Writing transport code now in parallel. Kinetics solver being re-written/parallelized, quite novel.Inputs are structure, outputs are structures and times. Rate expressions can be thought of as model parameters.

Super-Resolution Lithography Analysis (Oberai)Motivation:Reducing feature size in has made the modeling of underlying physics critical.In projective lithography simple biases not adequateIn holographic lithography near-field phenomenon is predominantModeling approach must be based on Maxwells equationsGoal: Develop unified computational algorithms for the design and analysis of super-resolution lithographic processes that model the underlying physics with high fidelity

Projective LithographyHolographic Lithography

42

Virtual Nanofabrication: Reactive-Ion Etching Simulation (Bloomfield)To handle SRAM-scale systems, we expect much larger computational systems, e.g., 105 - 106 surface elements.Transport tracking scales O(n2) with number of surface elements n.Parallelizes well every view factor can be computed completely independently of every other view factor, giving almost linear speed up.Computational complexity of chemistry solver depends upon particular chemical mechanisms associated with etch recipe. Tend to be O(n2).Cut away view of reactive ion etch simulation of an aspect ratio 1.4 via into a dielectric substrate with 7% porosity, and complete selectivity with respect to the underlying etch stop. A generic ion-radical etch model was used. ~103 surface elements. [Bloomfield et al., SISPAD 2003, IEEE.]43This is the tiny version of what we want to do. SRAM-scale means multiple etch holes at the same time, some trenches as well as these round holes. Neighboring holes interacting.View factors are element-to-element coefficients, so there are n^2 view factors. Expensive to compute, but each view factor can be computed independently.Chemistry is the R in the equation on next slide. Can be arbitrarily complex, tends to be a rational expression of fluxes (the etas) of high energy incoming ions and highly reactive neutral species.Stress-induced Dislocation Formation in Silicon Devices (Picu)At 90 nm and below, devices have come to rely on increased carrier mobility produced by strained silicon.As devices scale down, the relative importance of scattering centers increases.Can we have our cake and eat it too? How much strain can be built into a given device before processing variations and thermo-mechanical load during use cause critical dislocation shedding?

Continuum FEM calculationsautomatically identify critical high-stress regions.A local atomistic problem is constructed and an MD simulation is run, looking for criticality. Results feed back to continuum.44

Advanced Meshing Tools for Nanoelectronic Design (Shephard)Advanced meshing tools and expertise exist at RPI and associated spin-offLeverage tools to support CCNI projects such as the advanced device-modeling.Local refinement and adaptivity can help carry the computation resources further. More bang for the buck.

45

CCNI HPC 2 Activities

Documents

Transcript of CCNI HPC 2 Activities