HPC Compass 2016_17
-
Upload
marco-van-der-hart -
Category
Documents
-
view
157 -
download
7
Transcript of HPC Compass 2016_17
HIGH PERFORMANCE COMPUTING Technology Compass 2016/17
& BIG DATA
S
N
EW
SESW
NENW
More Than 35 Years of Experience
in Scientific Computing
1980 marked the beginning of a decade where numerous startups
were created, some of which later transformed into big players in
the IT market. Technical innovations brought dramatic changes
to the nascent computer market. In Tübingen, close to one of
Germany’s prime and oldest universities, transtec was founded.
In the early days, transtec focused on reselling DEC computers
and peripherals, delivering high-performance workstations to
university institutes and research facilities. In 1987, SUN/Sparc
and storage solutions broadened the portfolio, enhanced by IBM/
RS6000 products in 1991. These were the typical workstations
and server systems for high performance computing then, used
by the majority of researchers worldwide. Meanwhile, transtec
is the biggest European-wide and Germany-based provider of
comprehensive HPC solutions. Many of our HPC clusters entered
the TOP 500 list of the world’s fastest computing systems.
Thus, given this background and history, it is fair to say that tran-
stec looks back upon a more than 35 years’ experience in scientif-
ic computing; our track record shows more than 1,000 HPC instal-
lations. With this experience, we know exactly what customers’
demands are and how to meet them. High performance and ease
of management – this is what customers require today. HPC sys-
tems are for sure required to peak-perform, as their name indi-
cates, but that is not enough: they must also be easy to handle.
Unwieldy design and operational complexity must be avoided or
at least hidden from administrators and particularly users of HPC
computer systems.
High Performance Computing has changed in the course of the
last 10 years. Whereas traditional topics like simulation, large-
SMP systems or MPI parallel computing still are important cor-
nerstones of HPC, new areas of applicability like Business Intelli-
gence/Analytics, new technologies like in-memory or distributed
databases or new algorithms like MapReduce have entered the
arena which is now commonly called HPC & Big Data.
Knowledge about Hadoop or R nowadays belongs to the basic
knowledge of any serious HPC player.
Dynamical deployment scenarios are becoming more and more
important for realizing private (or even public) HPC cloud sce-
narios. The OpenStack ecosystem, object storage concepts like
Ceph, application containers with Docker – all these have been
bleeding-edge technologies once, but have become state-of-the-
art meanwhile.
Therefore, this brochure is now called “Technology Compass
HPC & Big Data”. It covers all of the above-mentioned currently
discussed core technologies. Besides those, more familiar tradi-
tional updates like the Intel Omni-Path architecture are covered
as well, of course. No matter which ingredient an HPC solution
requires – transtec masterly combines excellent and well-cho-
sen components that are already there to a fine-tuned, custom-
er-specific, and thoroughly designed HPC solution.
Your decision for a transtec HPC solution means you opt for most
intensive customer care and best service in HPC. Our experts will
be glad to bring in their expertise and support to assist you at
any stage, from HPC design to daily cluster operations, to HPC
Cloud Services.
Last but not least, transtec HPC Cloud Services provide custom-
ers with the possibility to have their jobs run on dynamically pro-
vided nodes in a dedicated datacenter, professionally managed
and individually customizable. Numerous standard applications
like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like
Gromacs, NAMD, VMD, and others are pre-installed, integrated
into an enterprise-ready cloud management environment, and
ready to run.
Have fun reading the
transtec HPC Compass 2016/17!
High Performance Computing for Everyone
Interview with Dr. Oliver Tennert
Director Technology Management
and HPC Solutions
Dr. Tennert, what’s the news in HPC?
Besides becoming more and more commodity, the HPC
market has seen two major changes within the last couple of
years: on the one hand side HPC has reached over to adjacent
areas that would not have been recognized as part of HPC 5
years ago. Meanwhile however every serious HPC company
tries to find their positioning within the area of “big data”,
often vaguely understood, but sometimes put more precisely
as “big data analytics”.
Superficially, it’s all about analyzing large amounts of data as
has been part of HPC like ever before. But the areas of appli-
cation are becoming more: “big data analytics” is a vital ap-
plication of business intelligence/business analytics. Along
with this new methods, new concepts, and new technologies
emerge. The other change has to do with the way HPC environ-
ments are going to be deployed and managed. Dynamic provi-
sioning of resources for private-cloud scenarios are becoming
more and more important, to which OpenStack has contrib-
uted a lot. In the near future, dynamic deployment will be as
commodity as efficient, but static deployment today.
What’s the deal with “dynamic deployment” anyway?
Dynamic deployment is the core provisioning technology in
cloud environments, public or private ones. Virtual machines
with different OSses, web servers, compute nodes, or any kind
of resources are only deployed when they are needed and
requested.
Dynamic deployment makes sure that the available hardware
capacity will be prepared and deployed for exactly the pur-
pose for which it is needed. The advantage is: less idle capacity
and better utilization than could happen with static environ-
ments where e.g. inactive Windows servers block hardware
capacity whereas Linux-based compute nodes would be
needed at a certain point in time.
Everyone is discussing OpenStack at the moment. What do you
think of the hype and the development of this platform? And
what is its importance for transtec?
OpenStack has indeed been discussed at a time when it still
had little business relevance for an HPC solutions provider.
But this is the case with all innovations, be they hardware- or
software-related.
But the fact that OpenStack is a joint development effort of
several big IT players, had added to the demand for private
clouds in the market on the one hand side, and to its relevance
as the software solution for that on the other hand side.
We at transtec have several customers who either ask us to
deploy a completely new OpenStack environment, or would
like us to migrate their existing one to OpenStack.
OpenStack has the reputation of being a very complex beast,
given the many components. Implementing it can take a while.
Is there any way to simplify this?
Indeed life can be rough if you do it wrong. But there is no need
to: no one goes ahead and installs Linux from scratch nowa-
days, apart from playing around and learning: instead every-
one uses professional distributions with commercial support.
There are special OpenStack distributions too, with commer-
cial support: especially within HPC, Bright Cluster Manager
is often the distribution of choice, but also long-term players
like Red Hat offer simple deployment frameworks.
Our customers can rest assured that they get an individually
optimized OpenStack environment, which is not only de-
signed by us, but also installed and maintained throughout.
It has always been our job to hide any complexity underneath
a layer of “ease of use, ease of management”.
6
Who do you want to address with this topic?
Everyone who is interested in a highly flexible, but also effec-
tive utilization of their existing hardware capacity. We address
customers from academia, as well as SMB customers who
have to turn towards new concepts and new technologies,
like big enterprises.
As said before: in our opinion, this way of resource provision-
ing will become mainstream soon.
Let’s turn to the other big topic: What is the connection between
HPC and Big Data Analytics, and what does the market have to
say about this?
As already mentioned, the HPC market is reaching out towards
big data analytics. “HPC & Big Data” is named in one breath at
various key events like ISC or Supercomputing.
This is not without reason of course: as different as the technol-
ogies in use or the goals to reach may be, there is one common
denominator: and this is a high demand for computational
capacity. An HPC solution provider like transtec, who is ca-
pable of designing and deploying a one-petaflop HPC cluster,
naturally also manages the provisioning of say a ten-petabyte
Hadoop cluster for big data analytics to a customer.
But the target group may be a different one: classical HPC
mostly takes place within the value chain of the customer. The
users are developers. Important areas for big data analytics,
however, can be found in business controlling, i.e. BI/BA, but
also in companies’ operations departments, when operation-
al decisions have to be made depending on certain vital state
data, in real time. This is what “internet of things” is all about.
High Performance Computing for Everyone
Interview with Dr. Oliver Tennert
Director Technology Management
and HPC Solutions
7
Where are the technical differences between HPC and Big Data
Analytics, and what do both have in common?
Both areas share the “scale-out” approach. It is not the single
server that gets more and more powerful, but rather accelera-
tion is reached via parallelization and massive load balancing
on several servers.
In HPC, parallel computational jobs that run using the MPI
programming interface are state of the art. Simulations run on
many compute nodes at once, thus cutting down wall-clock
time. Within the context of big data analytics, distributed and
above all in-memory databases are becoming more and more
important. Database requests can thus be handled by several
servers, and with a low latency.
How do companies handle the backup and the archiving of
these huge amounts of data?
Given this enormous amount of data, standard backup strat-
egies or archiving concepts do indeed not apply anymore. In-
stead, for big data analytics, data are stored in a redundant
way in principle, to become independent from hardware fail-
ures. Storage concepts like Hadoop or object storage like Ceph
offer redundancy in data storage as well.
Moreover, when it comes to big data analytics, the basic prem-
ise is that it really does not matter much when a tiny fraction
out of these huge amounts of data go missing for some rea-
son. The goal of big data analytics is not to get a precise ana-
lytical result out of these often unstructured data, but rather
the quickest possible extraction of relevant information and
a sensible interpretation, whereby statistical methods are of
great importance.
For which companies is it important to do this very quick extrac-
tion of information?
Holiday booking portals that use booking systems have to get
an immediate overview of the hotel rooms that are available
in some random town, sorted by price, in order to provide cus-
tomers with this information. Energy or finance companies
who do exchange tradings of their assets, have to know their
current portfolio’s value in the fraction of a second to do the
right pricing.
Search engine providers – there are more than just Google –
investigative authorities, credit card clearance houses who
have to react quickly to attempted frauds: the list of potential
companies who could make use of big data analytics grows
alongside with the technical progress.
8
High Performance Computing ............................................. 10Performance Turns Into Productivity .........................................................12
Flexible Deployment With xCAT .....................................................................14
Service and Customer Care From A to Z ...................................................16
OpenStack for HPC Cloud Environments ...................... 20What is OpenStack? ...............................................................................................22
Object Storage with Ceph ..................................................................................24
Application Containers with Docker ...........................................................30
Advanced Cluster Management Made Easy ................ 38Easy-to-use, Complete and Scalable ............................................................40
Cloud Bursting With Bright ...............................................................................48
Intelligent HPC Workload Management ........................ 50Moab HPC Suite ........................................................................................................52
IBM Platform LSF .....................................................................................................56
Slurm ...............................................................................................................................58
IBM Platform LSF 8 ..................................................................................................61
Intel Cluster Ready ...................................................................... 64A Quality Standard for HPC Clusters............................................................66
Intel Cluster Ready Builds HPC Momentum ...........................................70
The transtec Benchmarking Center .............................................................74
Remote Visualization and Workflow Optimization ............................................................ 76NICE EnginFrame: A Technical Computing Portal ...............................78
Remote Visualization ............................................................................................82
Desktop Cloud Visualization ............................................................................84
Cloud Computing With transtec and NICE ..............................................86
NVIDIA Grid Boards .................................................................................................88
Virtual GPU Technology .......................................................................................90
Technology Compass
Table of Contents and Introduction
9
Big Data Analytics with Hadoop ......................................... 92Apache Hadoop ........................................................................................................94
HDFS Architecture ..................................................................................................98
NVIDIA GPU Computing –The CUDA Architecture ...........................................................108What is GPU Computing? ................................................................................ 110
Kepler GK110/210 GPU Computing Architecture .............................. 112
A Quick Refresher on CUDA ............................................................................ 122
Intel Xeon Phi Coprocessor ..................................................124The Architecture.................................................................................................... 126
An Outlook on Knights Landing and Omni Path ............................... 128
Vector Supercomputing .........................................................138The NEC SX Architecture .................................................................................. 140
History of the SX Series..................................................................................... 146
Parallel File Systems – The New Standard for HPC Storage ................................154Parallel NFS .............................................................................................................. 156
What’s New in NFS 4.1? ..................................................................................... 158
Panasas HPC Storage ......................................................................................... 160
BeeGFS ........................................................................................................................ 172
General Parallel File System (GPFS) ........................................................... 176
What’s new in GPFS Version 3.5 ................................................................... 190
Lustre ........................................................................................................................... 192
Intel Omni-Path Interconnect .............................................196The Architecture.................................................................................................... 198
Host Fabric Interface (HFI) .............................................................................. 200
Architecture Switch Fabric Optimizations ........................................... 202
Fabric Software Components ....................................................................... 204
Numascale ......................................................................................206NumaConnect Background ............................................................................ 208
Technology ............................................................................................................... 210
Redefining Scalable OpenMP and MPI .................................................... 216
Big Data ...................................................................................................................... 220
Cache Coherence .................................................................................................. 222
NUMA and ccNUMA ............................................................................................. 224
Allinea Forge for parallel Software Development ..........................................................226Application Performance and Development Tools ......................... 228
What’s in Allinea Forge? ................................................................................... 230
Glossary ............................................................................................242
11
High Performance Computing – Performance Turns Into ProductivityHigh Performance Computing (HPC) has been with us from the very beginning of the
computer era. High-performance computers were built to solve numerous problems
which the “human computers” could not handle.
The term HPC just hadn’t been coined yet. More important, some of the early principles
have changed fundamentally.
Engineering | Life Sciences | Automotive | Price Modelling | Aerospace | CAE | Data Analytics
12
Performance Turns Into ProductivityHPC systems in the early days were much different from those
we see today. First, we saw enormous mainframes from large
computer manufacturers, including a proprietary operating
system and job management system. Second, at universities
and research institutes, workstations made inroads and scien-
tists carried out calculations on their dedicated Unix or VMS
workstations. In either case, if you needed more computing
power, you scaled up, i.e. you bought a bigger machine. Today
the term High-Performance Computing has gained a fundamen-
tally new meaning. HPC is now perceived as a way to tackle
complex mathematical, scientific or engineering problems. The
integration of industry standard, “off-the-shelf” server hardware
into HPC clusters facilitates the construction of computer net-
works of such power that one single system could never achieve.
The new paradigm for parallelization is scaling out.
Computer-supported simulations of realistic processes (so-called
Computer Aided Engineering – CAE) has established itself as a
third key pillar in the field of science and research alongside theo-
ry and experimentation. It is nowadays inconceivable that an air-
craft manufacturer or a Formula One racing team would operate
without using simulation software. And scientific calculations,
such as in the fields of astrophysics, medicine, pharmaceuticals
and bio-informatics, will to a large extent be dependent on super-
computers in the future. Software manufacturers long ago rec-
ognized the benefit of high-performance computers based on
powerful standard servers and ported their programs to them
accordingly.
The main advantages of scale-out supercomputers is just that:
they are infinitely scalable, at least in principle. Since they are
based on standard hardware components, such a supercomput-
er can be charged with more power whenever the computational
capacity of the system is not sufficient any more, simply by add-
ing additional nodes of the same kind. A cumbersome switch to a
different technology can be avoided in most cases.
“transtec HPC solutions are meant to provide customers with
unparalleled ease-of-management and ease-of-use. Apart from
that, deciding for a transtec HPC solution means deciding for the
most intensive customer care and the best service imaginable.”
Dr. Oliver Tennert | Director Technology Management & HPC Solutions
High Performance Computing
Performance Turns Into Productivity
S
N
EW
SESW
NENW
tech BOX
13
Variations on the theme: MPP and SMP Parallel computations exist in two major variants today. Ap-
plications running in parallel on multiple compute nodes are
frequently so-called Massively Parallel Processing (MPP) appli-
cations. MPP indicates that the individual processes can each
utilize exclusive memory areas. This means that such jobs are
predestined to be computed in parallel, distributed across the
nodes in a cluster. The individual processes can thus utilize the
separate units of the respective node – especially the RAM, the
CPU power and the disk I/O.
Communication between the individual processes is implement-
ed in a standardized way through the MPI software interface
(Message Passing Interface), which abstracts the underlying
network connections between the nodes from the processes.
However, the MPI standard (current version 2.0) merely requires
source code compatibility, not binary compatibility, so an off-the-
shelf application usually needs specific versions of MPI libraries
in order to run. Examples of MPI implementations are OpenMPI,
MPICH2, MVAPICH2, Intel MPI or – for Windows clusters – MS-MPI.
If the individual processes engage in a large amount of commu-
nication, the response time of the network (latency) becomes im-
portant. Latency in a Gigabit Ethernet or a 10GE network is typi-
cally around 10 µs. High-speed interconnects such as InfiniBand,
reduce latency by a factor of 10 down to as low as 1 µs. Therefore,
high-speed interconnects can greatly speed up total processing.
The other frequently used variant is called SMP applications.
SMP, in this HPC context, stands for Shared Memory Processing.
It involves the use of shared memory areas, the specific imple-
mentation of which is dependent on the choice of the underlying
operating system. Consequently, SMP jobs generally only run on
a single node, where they can in turn be multi-threaded and thus
be parallelized across the number of CPUs per node. For many
HPC applications, both the MPP and SMP variant can be chosen.
Many applications are not inherently suitable for parallel execu-
tion. In such a case, there is no communication between the in-
dividual compute nodes, and therefore no need for a high-speed
network between them; nevertheless, multiple computing jobs
can be run simultaneously and sequentially on each individual
node, depending on the number of CPUs.
In order to ensure optimum computing performance for these
applications, it must be examined how many CPUs and cores de-
liver the optimum performance.
We find applications of this sequential type of work typically in
the fields of data analysis or Monte-Carlo simulations.
BOX
14
xCAT as a Powerful and Flexible Deployment Tool
xCAT (Extreme Cluster Administration Tool) is an open source
toolkit for the deployment and low-level administration of HPC
cluster environments, small as well as large ones.
xCAT provides simple commands for hardware control, node
discovery, the collection of MAC addresses, and the node deploy-
ment with (diskful) or without local (diskless) installation. The
cluster configuration is stored in a relational database. Node
groups for different operating system images can be defined.
Also, user-specific scripts can be executed automatically at in-
stallation time.
xCAT Provides the Following Low-Level Administrative
Features
� Remote console support
� Parallel remote shell and remote copy commands
� Plugins for various monitoring tools like Ganglia or Nagios
� Hardware control commands for node discovery, collecting
MAC addresses, remote power switching and resetting of
nodes
� Automatic configuration of syslog, remote shell, DNS, DHCP,
and ntp within the cluster
� Extensive documentation and man pages
For cluster monitoring, we install and configure the open source
tool Ganglia or the even more powerful open source solution Na-
gios, according to the customer’s preferences and requirements.
High Performance Computing
Flexible Deployment With xCAT
15
Local Installation or Diskless Installation
We offer a diskful or a diskless installation of the cluster nodes.
A diskless installation means the operating system is hosted par-
tially within the main memory, larger parts may or may not be
included via NFS or other means. This approach allows for de-
ploying large amounts of nodes very efficiently, and the cluster
is up and running within a very small timescale. Also, updating
the cluster can be done in a very efficient way. For this, only the
boot image has to be updated, and the nodes have to be reboot-
ed. After this, the nodes run either a new kernel or even a new
operating system. Moreover, with this approach, partitioning the
cluster can also be very efficiently done, either for testing pur-
poses, or for allocating different cluster partitions for different
users or applications.
Development Tools, Middleware, and Applications
According to the application, optimization strategy, or underly-
ing architecture, different compilers lead to code results of very
different performance. Moreover, different, mainly commercial,
applications, require different MPI implementations. And even
when the code is self-developed, developers often prefer one MPI
implementation over another.
According to the customer’s wishes, we install various compilers,
MPI middleware, as well as job management systems like Para-
station, Grid Engine, Torque/Maui, or the very powerful Moab
HPC Suite for the high-level cluster management.
BOX
16
transtec HPC as a ServiceYou will get a range of applications like LS-Dyna, ANSYS, Gromacs,
NAMD etc. from all kinds of areas pre-installed, integrated into an
enterprise-ready cloud and workload management system, and
ready-to run. Do you miss your application?
> Ask us: [email protected]
transtec Platform as a ServiceYou will be provided with dynamically provided compute nodes
for running your individual code. The operating system will be
pre-installed according to your requirements. Common Linux dis-
tributions like RedHat, CentOS, or SLES are the standard. Do you
need another distribution?
> Ask us: [email protected]
transtec Hosting as a ServiceYou will be provided with hosting space inside a professionally
managed and secured datacenter where you can have your ma-
chines hosted, managed, maintained, according to your require-
ments. Thus, you can build up your own private cloud. What
range of hosting and maintenance services do you need?
> Tell us: [email protected]
HPC @ transtec: Services and Customer Care from A to Ztranstec AG has over 30 years of experience in scientific computing
and is one of the earliest manufacturers of HPC clusters. For nearly a
decade, transtec has delivered highly customized High Performance
clusters based on standard components to academic and industry
customers across Europe with all the high quality standards and the
customer-centric approach that transtec is well known for. Every
transtec HPC solution is more than just a rack full of hardware – it is
a comprehensive solution with everything the HPC user, owner, and
operator need. In the early stages of any customer’s HPC project,
transtec experts provide extensive and detailed consulting to the
customer – they benefit from expertise and experience. Consulting
is followed by benchmarking of different systems with either spe-
cifically crafted customer code or generally accepted benchmarking
routines; this aids customers in sizing and devising the optimal and
detailed HPC configuration. Each and every piece of HPC hardware
that leaves our factory undergoes a burn-in procedure of 24 hours or
High Performance Computing
customertraining
individual Presalesconsulting
application-,customer-,
site-specificsizing of
HPC solution
burn-in testsof systems
benchmarking of different systems
continualimprovement
software& OS
installation
applicationinstallation
onsitehardwareassembly
integrationinto
customer’senvironment
maintenance,support &
managed services
Services and Customer Care from A to Z
Services and Customer
Care from A to Z
17
more if necessary. We make sure that any hardware shipped meets
our and our customers’ quality requirements. transtec HPC solutions
are turnkey solutions. By default, a transtec HPC cluster has every-
thing installed and configured – from hardware and operating sys-
tem to important middleware components like cluster management
or developer tools and the customer’s production applications.
Onsite delivery means onsite integration into the customer’s pro-
duction environment, be it establishing network connectivity to the
corporate network, or setting up software and configuration parts.
transtec HPC clusters are ready-to-run systems – we deliver, you
turn the key, the system delivers high performance. Every HPC proj-
ect entails transfer to production: IT operation processes and poli-
cies apply to the new HPC system. Effectively, IT personnel is trained
hands-on, introduced to hardware components and software, with
all operational aspects of configuration management.
transtec services do not stop when the implementation projects
ends. Beyond transfer to production, transtec takes care. transtec
offers a variety of support and service options, tailored to the cus-
tomer’s needs. When you are in need of a new installation, a major
reconfiguration or an update of your solution – transtec is able to
support your staff and, if you lack the resources for maintaining the
cluster yourself, maintain the HPC solution for you.
From Professional Services to Managed Services for daily opera-
tions and required service levels, transtec will be your complete HPC
service and solution provider. transtec’s high standards of perfor-
mance, reliability and dependability assure your productivity and
complete satisfaction. transtec’s offerings of HPC Managed Services
offer customers the possibility of having the complete management
and administration of the HPC cluster managed by transtec service
specialists, in an ITIL compliant way. Moreover, transtec’s HPC on De-
mand services help provide access to HPC resources whenever they
need them, for example, because they do not have the possibility
of owning and running an HPC cluster themselves, due to lacking
infrastructure, know-how, or admin staff.
transtec HPC Cloud ServicesLast but not least transtec’s services portfolio evolves as custom-
ers‘ demands change. Starting this year, transtec is able to provide
HPC Cloud Services. transtec uses a dedicated datacenter to provide
computing power to customers who are in need of more capacity
than they own, which is why this workflow model is sometimes
called computing-on-demand. With these dynamically provided
resources, customers with the possibility to have their jobs run on
HPC nodes in a dedicated datacenter, professionally managed and
secured, and individually customizable. Numerous standard ap-
plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes
like Gromacs, NAMD, VMD, and others are pre-installed, integrated
into an enterprise-ready cloud and workload management environ-
ment, and ready to run.
Alternatively, whenever customers are in need of space for host-
ing their own HPC equipment because they do not have the space
capacity or cooling and power infrastructure themselves, transtec
is also able to provide Hosting Services to those customers who’d
like to have their equipment professionally hosted, maintained, and
managed. Customers can thus build up their own private cloud!
Are you interested in any of transtec’s broad range of HPC related
services? Write us an email to [email protected]. We’ll be happy to
hear from you!
18
High Performance Computing
Services and Customer
Care from A to Z
Are you an end user?Then doing your work as productive as possible is of utmost im-
portance to you! You are interested in an IT environment running
smoothly, with a fully functional infrastructure, in competent and
professional support by internal administrators and external ser-
vice providers, and a service desk with guaranteed availability at
working times. HPC cloud services ensure you are able to run your
simulation jobs even if computational requirements exceed the lo-
cal capacities.
� support services
� professional services
� managed services
� cloud services
Are you an administrator?Then you are responsible for providing a well-performing IT infra-
structure, for supporting the end users when problems occur. In
situations where problem management fully exploits your capacity,
or in cases where you just don’t have time or maybe lack experience
in performing certain tasks, you might be glad to be assisted by spe-
cialized IT experts.
� support services
� professional services
Are you an R & D manager?Then your job is to ensure that developers, researchers, and other
end users are able to do their job. To have a partner at hand that pro-
vides you with a full range of operations support, either in a com-
pletely flexible, on-demand base, or in a fully-managed way based
on ITIL best practices, is heartily welcome for you.
� professional services
� managed services
19
Services Continuum
SupportServices
Paid to Support
Installation
Integration
Migration
Upgrades
Configuration Changes
ProfessionalServices
Paid to Do
Service Desk
Event Monitoringand Notification
Release Management
CapacityManagement
ManagedServices
Paid to Operate
Hosting
Infrastructure(IaaS)
HPC(HPCaaS)
Platform(PaaS)
CloudServices
Paid to Use
PerformanceAnalysis
TechnologyAssessment
InfrastructureDesign
PLAN BUILD RUN
ConsultingServices
Paid to Advise
Are you a CIO?Then your main concern is a smoothly-running IT operations envi-
ronment with a maximum of efficiency and effectivity. Managed
Services to ensure business continuity are a main part of your con-
siderations regarding business objectives. HPC cloud services are
a vital part of your capacity plannings to ensure the availability of
computational resources to your internal clients, the R & D depart-
ments.
� managed services
� cloud services
Are you a CEO?Profitability, time-to-result, business continuity, and ensuring future
viability are topics that you continuously have to think about. You
are happy to have a partner at hand, who gives you insight into fu-
ture developments, provides you with technology consulting, and is
able to help you ensure business continuity and fall-back scenarios,
as well as scalability readiness, by means of IT capacity outsourcing
solutioons.
� cloud services
� consulting services
Life Sciences | CAE | High Performance Computing | Big Data Analytics | Simulation | CAD
OpenStack for HPCCloud EnvironmentsOpenStack is a set of software tools for building and managing cloud computing
platforms for public and private clouds. Backed by some of the biggest companies in
software development and hosting, as well as thousands of individual community
members, many think that OpenStack is the future of cloud computing.
OpenStack is managed by the OpenStack Foundation, a non-profit organization which
oversees both development and community-building around the project.
22
What is OpenStack? Introduction to OpenStackOpenStack lets users deploy virtual machines and other instances
which handle different tasks for managing a cloud environment
on the fly. It makes horizontal scaling easy, which means that
tasks which benefit from running concurrently can easily serve
more or less users on the fly by just spinning up more instances.
For example, a mobile application which needs to communicate
with a remote server might be able to divide the work of commu-
nicating with each user across many different instances, all
communicating with one another but scaling quickly and easily
as the application gains more users.
And most importantly, OpenStack is open source software, which
means that anyone who chooses to can access the source code,
make any changes or modifications they need, and freely share
these changes back out to the community at large. It also means
that OpenStack has the benefit of thousands of developers all
over the world working in tandem to develop the strongest, most
robust, and most secure product that they can.
How is OpenStack used in a cloud environment?The cloud is all about providing computing for end users in a
remote environment, where the actual software runs as a service
on reliable and scalable servers rather than on each end users
computer. Cloud computing can refer to a lot of different things,
but typically the industry talks about running different items “as
a service” – software, platforms, and infrastructure. OpenStack
falls into the latter category and is considered Infrastructure as
a Service (IaaS). Providing infrastructure means that OpenStack
makes it easy for users to quickly add new instance, upon which
other cloud components can run. Typically, the infrastructure
then runs a “platform” upon which a developer can create soft-
ware applications which are delivered to the end users.
OpenStack for HPC
Cloud Environments
What is OpenStack?
23
What are the components of OpenStack?OpenStack is made up of many different moving parts. Because
of its open nature, anyone can add additional components to
OpenStack to help it to meet their needs. But the OpenStack
community has collaboratively identified nine key components
that are a part of the “core” of OpenStack, which are distributed
as a part of any OpenStack system and officially maintained by
the OpenStack community.
� Nova is the primary computing engine behind OpenStack. It
is used for deploying and managing large numbers of virtual
machines and other instances to handle computing tasks.
� Swift is a storage system for objects and files. Rather than
the traditional idea of a referring to files by their location on
a disk drive, developers can instead refer to a unique identi-
fier referring to the file or piece of information and let Open-
Stack decide where to store this information. This makes
scaling easy, as developers don’t have the worry about the
capacity on a single system behind the software. It also
allows the system, rather than the developer, to worry about
how best to make sure that data is backed up in case of the
failure of a machine or network connection.
� Cinder is a block storage component, which is more analo-
gous to the traditional notion of a computer being able to
access specific locations on a disk drive. This more tradi-
tional way of accessing files might be important in scenarios
in which data access speed is the most important consider-
ation.
� Neutron provides the networking capability for OpenStack.
It helps to ensure that each of the components of an Open-
Stack deployment can communicate with one another
quickly and efficiently.
� Horizon is the dashboard behind OpenStack. It is the only
graphical interface to OpenStack, so for users wanting to
give OpenStack a try, this may be the first component they
actually “see.” Developers can access all of the components
of OpenStack individually through an application program-
ming interface (API), but the dashboard provides system
administrators a look at what is going on in the cloud, and to
manage it as needed.
� Keystone provides identity services for OpenStack. It is
essentially a central list of all of the users of the OpenStack
cloud, mapped against all of the services provided by the
cloud which they have permission to use. It provides multiple
means of access, meaning developers can easily map their
existing user access methods against Keystone.
� Glance provides image services to OpenStack. In this case,
“images” refers to images (or virtual copies) of hard disks.
Glance allows these images to be used as templates when
deploying new virtual machine instances.
� Ceilometer provides telemetry services, which allow the
cloud to provide billing services to individual users of the
cloud. It also keeps a verifiable count of each user’s system
usage of each of the various components of an OpenStack
cloud. Think metering and usage reporting.
� Heat is the orchestration component of OpenStack, which
allows developers to store the requirements of a cloud appli-
cation in a file that defines what resources are necessary for
that application. In this way, it helps to manage the infra-
structure needed for a cloud service to run.
24
OpenStack for HPC
Cloud Environments
Object Storage with Ceph
Object Storage with Ceph Why “Ceph”?
“Ceph” is an odd name for a file system and breaks the typical
acronym trend that most follow. The name is a reference to the
mascot at UCSC (Ceph’s origin), which happens to be “Sammy,”
the banana slug, a shell-less mollusk in the cephalopods class.
Cephalopods, with their multiple tentacles, provide a great meta-
phor for a distributed file system.
Ceph goalsDeveloping a distributed file system is a complex endeavor, but
it’s immensely valuable if the right problems are solved. Ceph’s
goals can be simply defined as:
� Easy scalability to multi-petabyte capacity
� High performance over varying workloads (input/output
operations per second [IOPS] and bandwidth)
� Strong reliability
Unfortunately, these goals can compete with one another (for
example, scalability can reduce or inhibit performance or impact
reliability). Ceph has developed some very interesting concepts
(such as dynamic metadata partitioning and data distribution
and replication), which this article explores shortly. Ceph’s design
also incorporates fault-tolerance features to protect against
single points of failure, with the assumption that storage failures
on a large scale (petabytes of storage) will be the norm rather
than the exception. Finally, its design does not assume particular
workloads but includes the ability to adapt to changing distrib-
uted workloads to provide the best performance. It does all of
this with the goal of POSIX compatibility, allowing it to be trans-
parently deployed for existing applications that rely on POSIX
semantics (through Ceph-proposed enhancements). Finally, Ceph
is open source distributed storage and part of the mainline Linux
kernel (2.6.34).
25
User/Admin
Advanced Services (Consume Iaas)
Cloud Management
Infrastucture as a Service
API(REST)
CLI(python - client)
UI(Horizon)
DataProcessing
(Sahara)
KeyManagement
(Barbican)
Compute(Nova)
Compute(Ironic)
Network(Neutron)
Image Management(Glance)
Storage(Swift)
(Cinder)(Manila)
DatabaseManagement
(Trove)
MessageQueue(Zaqar)
DNSManagement
(Designate)
Deployment(Triple O)
Telemetry(Ceilometer)
Orchestration(HEAT)
Common Library(Oslo)
Identity(Keystone)
Common/Shared
Message Queue/Database(s)
OpenStack Overview
26
OpenStack for HPC
Cloud Environments
Object Storage with Ceph
Ceph architectureNow, let’s explore the Ceph architecture and its core elements at
a high level. I then dig down another level to identify some of the
key aspects of Ceph to provide a more detailed exploration.
The Ceph ecosystem can be broadly divided into four segments
(see Figure 1): clients (users of the data), metadata servers (which
cache and synchronize the distributed metadata), an object
storage cluster (which stores both data and metadata as objects
and implements other key responsibilities), and finally the cluster
monitors (which implement the monitoring functions).
As Figure 1 shows, clients perform metadata operations (to iden-
tify the location of data) using the metadata servers. The meta-
data servers manage the location of data and also where to store
new data. Note that metadata is stored in the storage cluster (as
indicated by “Metadata I/O”).
ObjectStorageCluster
MetadataServerCluster
ClusterMonitors
Metadata ops
File (object) I/O
Metadata I/O
Clients
Figure 1: Conceptual architecture of the Ceph ecosystem
27
Actual file I/O occurs between the client and object storage
cluster. In this way, higher-level POSIX functions (such as open,
close, and rename) are managed through the metadata servers,
whereas POSIX functions (such as read and write) are managed
directly through the object storage cluster.
Another perspective of the architecture is provided in Figure 2. A
set of servers access the Ceph ecosystem through a client inter-
face, which understands the relationship between metadata
servers and object-level storage. The distributed storage system
can be viewed in a few layers, including a format for the storage
devices (the Extent and B-tree-based Object File System [EBOFS]
or an alternative) and an overriding management layer designed
to manage data replication, failure detection, and recovery and
subsequent data migration called Reliable Autonomic Distrib-
uted Object Storage (RADOS). Finally, monitors are used to iden-
tify component failures, including subsequent notification.
Server
Object storage daemon
BTRFS/EBOFS/...
Storage
Metadata servers
Ceph client interface
Figure 2: Simplified layered view of the Ceph ecosystem
Ceph componentsWith the conceptual architecture of Ceph under your belts, you
can dig down another level to see the major components imple-
mented within the Ceph ecosystem. One of the key differences
between Ceph and traditional file systems is that rather than
focusing the intelligence in the file system itself, the intelligence
is distributed around the ecosystem.
Figure 3 shows a simple Ceph ecosystem. The Ceph Client is the
user of the Ceph file system. The Ceph Metadata Daemon provides
the metadata services, while the Ceph Object Storage Daemon
provides the actual storage (for both data and metadata). Finally,
the Ceph Monitor provides cluster management. Note that
there can be many Ceph clients, many object storage endpoints,
numerous metadata servers (depending on the capacity of the
file system), and at least a redundant pair of monitors. So, how is
this file system distributed?
Applications/Users
cmon (Ceph monitor)
cosd (Ceph object storage daemon)
BTRFS/EBOFS/...
cmds (Ceph metadata daemon)
DRAM cache
(Enhanced) POSIX
VFS
User-space
Linux kernelCeph
Disk DiskDiskDisk
Ceph client
Figure 3: Simple Ceph ecosystem
28
Kernel or user space
Early versions of Ceph utilized Filesystems in User SpacE (FUSE),
which pushes the file system into user space and can greatly
simplify its development. But today, Ceph has been integrated
into the mainline kernel, making it faster, because user space
context switches are no longer necessary for file system I/O.
Ceph clientAs Linux presents a common interface to the file systems
(through the virtual file system switch [VFS]), the user’s perspec-
tive of Ceph is transparent. The administrator’s perspective will
certainly differ, given the potential for many servers encom-
passing the storage system (see the Resources section for infor-
mation on creating a Ceph cluster). From the users’ point of view,
they have access to a large storage system and are not aware of
the underlying metadata servers, monitors, and individual object
storage devices that aggregate into a massive storage pool.
Users simply see a mount point, from which standard file I/O can
be performed.
The Ceph file system–or at least the client interface–is imple-
mented in the Linux kernel. Note that in the vast majority of
file systems, all of the control and intelligence is implemented
within the kernel’s file system source itself. But with Ceph, the
file system’s intelligence is distributed across the nodes, which
simplifies the client interface but also provides Ceph with the
ability to massively scale (even dynamically).
Rather than rely on allocation lists (metadata to map blocks on
a disk to a given file), Ceph uses an interesting alternative. A file
from the Linux perspective is assigned an inode number (INO)
from the metadata server, which is a unique identifier for the
file. The file is then carved into some number of objects (based
on the size of the file). Using the INO and the object number
(ONO), each object is assigned an object ID (OID). Using a simple
hash over the OID, each object is assigned to a placement group.
OpenStack for HPC
Cloud Environments
Object Storage with Ceph
29
The placement group (identified as a PGID) is a conceptual
container for objects. Finally, the mapping of the placement
group to object storage devices is a pseudo-random mapping
using an algorithm called Controlled Replication Under Scalable
Hashing (CRUSH). In this way, mapping of placement groups (and
replicas) to storage devices does not rely on any metadata but
instead on a pseudo-random mapping function. This behavior is
ideal, because it minimizes the overhead of storage and simpli-
fies the distribution and lookup of data.
The final component for allocation is the cluster map. The cluster
map is an efficient representation of the devices representing
the storage cluster. With a PGID and the cluster map, you can
locate any object.
The Ceph metadata serverThe job of the metadata server (cmds) is to manage the file
system’s namespace. Although both metadata and data are
stored in the object storage cluster, they are managed separately
to support scalability. In fact, metadata is further split among
a cluster of metadata servers that can adaptively replicate and
distribute the namespace to avoid hot spots. As shown in Figure
4, the metadata servers manage portions of the namespace and
can overlap (for redundancy and also for performance). The
mapping of metadata servers to namespace is performed in
Ceph using dynamic subtree partitioning, which allows Ceph to
adapt to changing workloads (migrating namespaces between
metadata servers) while preserving locality for performance.
But because each metadata server simply manages the
namespace for the population of clients, its primary applica-
tion is an intelligent metadata cache (because actual metadata
is eventually stored within the object storage cluster). Metadata
to write is cached in a short-term journal, which eventually is
pushed to physical storage. This behavior allows the meta-
data server to serve recent metadata back to clients (which is
common in metadata operations). The journal is also useful for
failure recovery: if the metadata server fails, its journal can be
replayed to ensure that metadata is safely stored on disk.
Metadata servers manage the inode space, converting file names
to metadata. The metadata server transforms the file name into
an inode, file size, and striping data (layout) that the Ceph client
uses for file I/O.
Ceph monitorsCeph includes monitors that implement management of the
cluster map, but some elements of fault management are imple-
mented in the object store itself. When object storage devices fail
or new devices are added, monitors detect and maintain a valid
cluster map. This function is performed in a distributed fashion
where map updates are communicated with existing traffic.
Ceph uses Paxos, which is a family of algorithms for distributed
consensus.
Ceph object storageSimilar to traditional object storage, Ceph storage nodes include
not only storage but also intelligence. Traditional drives are
simple targets that only respond to commands from initiators.
But object storage devices are intelligent devices that act as both
targets and initiators to support communication and collabora-
tion with other object storage devices.
MDS₀ MDS₁ MDS₂ MDS₃ MDS₄
Figure 4: Partitioning of the Ceph namespace for metadata servers
30
From a storage perspective, Ceph object storage devices perform
the mapping of objects to blocks (a task traditionally done at
the file system layer in the client). This behavior allows the local
entity to best decide how to store an object. Early versions of
Ceph implemented a custom low-level file system on the local
storage called EBOFS. This system implemented a nonstandard
interface to the underlying storage tuned for object seman-
tics and other features (such as asynchronous notification of
commits to disk). Today, the B-tree file system (BTRFS) can be
used at the storage nodes, which already implements some of
the necessary features (such as embedded integrity).
Because the Ceph clients implement CRUSH and do not have
knowledge of the block mapping of files on the disks, the under-
lying storage devices can safely manage the mapping of objects
to blocks. This allows the storage nodes to replicate data (when
a device is found to have failed). Distributing the failure recovery
also allows the storage system to scale, because failure detec-
tion and recovery are distributed across the ecosystem. Ceph
calls this RADOS (see Figure 3).
Other features of interestAs if the dynamic and adaptive nature of the file system weren’t
enough, Ceph also implements some interesting features visible
to the user. Users can create snapshots, for example, in Ceph on
any subdirectory (including all of the contents). It’s also possible
to perform file and capacity accounting at the subdirectory level,
which reports the storage size and number of files for a given
subdirectory (and all of its nested contents).
Ceph status and futureAlthough Ceph is now integrated into the mainline Linux kernel,
it’s properly noted there as experimental. File systems in this state
are useful to evaluate but are not yet ready for production envi-
ronments. But given Ceph’s adoption into the Linux kernel and
the motivation by its originators to continue its development,
it should be available soon to solve your massive storage needs.
OpenStack for HPC
Cloud Environments
Application Containers with Docker
31
Application Containers with DockerTransforming Business Through SoftwareGone are the days of private datacenters running off-the-shelf
software and giant monolithic code bases that you updated
once a year. Everything has changed. Whether it is moving to the
cloud, migrating between clouds, modernizing legacy or building
new apps and data structure, the desired results are always the
same – speed. The faster you can move defines your success as a
company.
Software is the critical IP that defines your company even
if the actual product you are selling may be a t-shirt, a car,
or compounding interest. Software is how you engage your
customers, reach new users, understand their data, promote
your product or service and process their order.
To do this well, today’s software is going bespoke. Small pieces
of software that are designed for a very specific job are called
microservices. The design goal of microservices is to have each
service built with all of the necessary components to “run” a
specific job with just the right type of underlying infrastructure
resources. Then, these services are loosely coupled together so
they can be changed at anytime, without having to worry about
the service that comes before or after it.
This methodology, while great for continuous improvement,
poses many challenges in reaching the end state. First it creates
a new, ever-expanding matrix of services, dependencies and
infrastructure making it difficult to manage. Additionally it does
not account for the vast amounts of existing legacy applications
in the landscape, the heterogeneity up and down the application
stack and the processes required to ensure this works in practice.
The Docker Journey and the Power of ANDIn 2013, Docker entered the landscape with application
containers to build, ship and run applications anywhere. Docker
was able to take software and its dependencies package them up
into a lightweight container. Similar to how shipping containers
are today, software containers are simply a standard unit of soft-
ware that looks the same on the the outside regardless of what
code and dependencies are included on the inside. This enabled
developers and sysadmins to transport them across infrastruc-
tures and various environments without requiring any modi-
fications and regardless of the varying configurations in the
different environments. The Docker journey begins here.
Agility: The speed and simplicity of Docker was an
instant hit with developers and is what led to the
meteoric rise in the open source project. Developers
are now able to very simply package up any software and its
dependencies into a container. Developers are able to use any
language, version and tooling because they are all packaged in
a container, which “standardizes” all that heterogeneity without
sacrifice.
Portability: Just by the nature of the Docker tech-
nology, these very same developers have realized
their application containers are now portable – in
ways not previously possible. They can ship their applications
from development, to test and production and the code will work
as designed every time. Any differences in the environment did
not affect what was inside the container. Nor did they need to
change their application to work in production. This was also a
boon for IT operations teams as they can now move applications
across datacenters for clouds to avoid vendor lock in.
32
Control: As these applications move along the life-
cycle to production, new questions around security,
manageability and scale need to be answered. Docker
“standardizes” your environment while maintaining the hetero-
geneity your business requires. Docker provides the ability to set
the appropriate level of control and flexibility to maintain your
service levels, performance and regulatory compliance. IT oper-
ations teams can provision, secure, monitor and scale the infra-
structure and applications to maintain peak service levels. No
two applications or businesses are alike and the Docker allows
you to decide how to control your application environment.
At the core of the Docker journey is the power of AND. Docker
is the only solution to provide agility, portability and control for
developers and IT operations team across all stages of the appli-
cation lifecycle. From these core tenets, Containers as a Service
(CaaS) emerges as the construct by which these new applications
are built better and faster.
Docker Containers as a Service (Caas)What is Containers as a Service (CaaS)? It is an IT managed and
secured application environment of infrastructure and content
where developers can in a self service manner, build and deploy
applications.
BUILDDevelopment
Environments
Secure Content &
Collaboration
Deploy, Manage,
Scale
SHIP RUN
Developers IT Operations
Diagram: CaaS Workflow
OpenStack for HPC
Cloud Environments
Application Containers with Docker
33
In the CaaS diagram above, development and IT operations
team collaborate through the registry. This is a service in which
a library of secure and signed images can be maintained. From
the registry, developers on the left are able to pull and build soft-
ware at their pace and then push the content back to the registry
once it passes integration testing to save the latest version.
Depending on the internal processes, the deployment step is
either automated with tools or can be manually deployed.
The IT operations team on the right in above diagram manages
the different vendor contracts for the production infrastruc-
ture such as compute, networking and storage. These teams
are responsible to provision the compute resources needed for
the application and use the Docker Universal Control Plane to
monitor the clusters and applications over time. Then, they can
move the apps from one cloud to another or scale up or down a
service to maintain peak performance.
Key Characteristics and ConsiderationsThe Docker CaaS provides a framework for organizations to unify
the variety of systems, languages and tools in their environment
and apply the level of control, security or freedom required for
their business. As a Docker native solution with full support of
the Docker API, Docker CaaS can seamlessly take the application
from local development to production without changing the
code and streamlining the deployment cycle.
The following characteristics form the minimum requirements
for any organization’s application environment. In this paradigm,
development and IT operations teams are empowered to use the
best tools for their respective jobs without worrying of breaking
systems, each other’s workflows or lock-in.
2
3
The needs of developers and operations.
Many tools specifically address the functional needs of only
one team; however, CaaS breaks the cycle for continuous
improvement. To truly gain acceleration in the development
to production timeline, you need to address both users along
a continuum. Docker provides unique capabilities for each
team as well as a consistent API across the entire platform
for a seamless transition from one team to the next.
All stages in the application lifecycle.
From continuous integration to delivery and devops, these
practices are about eliminating the waterfall development
methodology and the lagging innovation cycles with it. By
providing tools for both the developer and IT operations,
Docker is able to seamlessly support an application from
build, test, stage to production.
Any language.
Developer agility means the freedom to build with whatever
language, version and tooling required for the features they
are building at that time. Also, the ability to run multiple
versions of a language at the same time provides a greater
level of flexibility. Docker allows your team to focus on
building the app instead of thinking of how to build an app
that works in Docker.
Any operating system.
The vast majority of organizations have more than one oper-
ating system. Some tools just work better in Linux while
others in Windows. Application platforms need to account
and support this diversity, otherwise they are solving only
part of the problem. Originally created for the Linux commu-
nity, Docker and Microsoft are bringing forward Windows
Server support to address the millions of enterprise applica-
tions in existence today and future applications.
1
4
34
Any infrastructure.
When it comes to infrastructure, organizations want
choice, backup and leverage. Whether that means you have
multiple private data centers, a hybrid cloud or multiple
cloud providers, the critical component is the ability to
move workloads from one environment to another, without
causing application issues. The Docker technology architec-
ture abstracts the infrastructure away from the application
allowing the application containers to be run anywhere and
portable across any other infrastructure.
Open APIs, pluggable architecture and ecosystem.
A platform isn’t really a platform if it is an island to itself.
Implementing new technologies is often not possible if you
need to re-tool your existing environment first. A funda-
mental guiding principle of Docker is a platform that is
open. Being open means APIs and plugins to make it easy for
you to leverage your existing investments and to fit Docker
into your environment and processes. This openness invites
a rich ecosystem to flourish and provide you with more
flexibility and choice in adding specialized capabilities to
your CaaS.
Although many, these characteristics are critical as the new
bespoke application paradigms only invite in greater hetero-
geneity into your technical architecture. The Docker CaaS plat-
form is fundamentally designed to support that diversity, while
providing the appropriate controls to manage at any scale.
5
6
OpenStack for HPC
Cloud Environments
Application Containers with Docker
35
Docker CaaSThe Docker CaaS platform is made possible by a suite of inte-
grated software solutions with a flexible deployment model to
meet the needs of your business.
� On-Premises/VPC: For organizations who need to keep their
IP within their network, Docker Trusted Registry and Docker
Universal Control Plane can be deployed on-premises or
in a VPC and connected to your existing infrastructure and
systems like storage, Active Directory/LDAP, monitoring and
logging solutions. Trusted Registry provides the ability to
store and manage images on your storage infrastructure
while also managing role based access control to the images.
Universal Control Plane provides visibility across your Docker
environment including Swarm clusters, Trusted Registry
repositories, containers and multi container applications.
� In the Cloud: For organizations who readily use SaaS solu-
tions, Docker Hub and Tutum by Docker provide a registry
service and control plane that is hosted and managed by
Docker. Hub is a cloud registry service to store and manage
your images and users permissions. Tutum by Docker provi-
sions and manages the deployment clusters as well as moni-
tors and manages the deployed applications. Connect to the
cloud infrastructure of your choice or bring your own phys-
ical node to deploy your application.
Your Docker CaaS can be designed to provide a centralized point
of control and management or allow for decentralized manage-
ment to empower individual application teams. The flexibility
allows you to create a model that is right for your business,
just like how you choose your infrastructure and implement
processes. CaaS is an extension of that to build, ship and run
applications.
Many IT initiatives are enabled and in fact, accelerated by CaaS
due to its unifying nature across the environment. Each organiza-
tion has their take on the terms for the initiatives but they range
from things like containerization, which may involve the “lift
and shift” of existing apps, or the adoption of microservices to
continuous integration, delivery and devops and various flavors
of the cloud including adoption, migration, hybrid and multiple.
In each scenario, Docker CaaS brings the agility, portability and
control to enable the adoption of those use cases across the
organization.
BUILD SHIP RUN
Docker ToolboxDocker Hub
Docker Trusted Registry
Tutm by Docker
Docker Universal Control Plane
Diagram: CaaS Workflow
36
S
N
EW
SESW
NENW
tech BOX
Architecture of Docker
Docker is an open-source project that automates the deploy-
ment of applications inside software containers, by providing
an additional layer of abstraction and automation of operat-
ing-system-level virtualization on Linux. Docker uses the re-
source isolation features of the Linux kernel such as cgroups
and kernel namespaces, and a union-capable filesystem such as
AUFS and others to allow independent “containers” to run with-
in a single Linux instance, avoiding the overhead of starting and
maintaining virtual machines.
The Linux kernel’s support for namespaces mostly isolates an
application’s view of the operating environment, including pro-
cess trees, network, user IDs and mounted file systems, while
the kernel’s cgroups provide resource limiting, including the
CPU, memory, block I/O and network. Since version 0.9, Docker
includes the libcontainer library as its own way to directly use
virtualization facilities provided by the Linux kernel, in addi-
tion to using abstracted virtualization interfaces via libvirt, LXC
(Linux Containers) and systemd-nspawn.
As actions are done to a Docker base image, union filesystem lay-
ers are created and documented, such that each layer fully de-
scribes how to recreate an action. This strategy enables Docker’s
lightweight images, as only layer updates need to be propagated
(compared to full VMs, for example).
OpenStack for HPC
Cloud Environments
Application Containers with Docker
37
Overview
Docker implements a high-level API to provide lightweight
containers that run processes in isolation.
Building on top of facilities provided by the Linux kernel (primar-
ily cgroups and namespaces), a Docker container, unlike a vir-
tual machine, does not require or include a separate operating
system. Instead, it relies on the kernel’s functionality and uses
resource isolation (CPU, memory, block I/O, network, etc.) and
separate namespaces to isolate the application’s view of the
operating system. Docker accesses the Linux kernel’s virtualiza-
tion features either directly using the libcontainer library, which
is available as of Docker 0.9, or indirectly via libvirt, LXC (Linux
Containers) or systemd-nspawn.
By using containers, resources can be isolated, services restrict-
ed, and processes provisioned to have an almost completely
private view of the operating system with their own process ID
space, file system structure, and network interfaces. Multiple
containers share the same kernel, but each container can be con-
strained to only use a defined amount of resources such as CPU,
memory and I/O.
Using Docker to create and manage containers may simplify
the creation of highly distributed systems by allowing multiple
applications, worker tasks and other processes to run autono-
mously on a single physical machine or across multiple virtual
machines. This allows the deployment of nodes to be performed
as the resources become available or when more nodes are need-
ed, allowing a platform as a service (PaaS)-style of deployment
and scaling for systems like Apache Cassandra, MongoDB or Riak.
Docker also simplifies the creation and operation of task or work-
load queues and other distributed systems.
Integration
Docker can be integrated into various infrastructure tools, in-
cluding Amazon Web Services, Ansible, CFEngine, Chef, Google
Cloud Platform, IBM Bluemix, Jelastic, Jenkins,Microsoft Azure,
OpenStack Nova, OpenSVC, HPE Helion Stackato, Puppet, Salt,
Vagrant, and VMware vSphere Integrated Containers.
The Cloud Foundry Diego project integrates Docker into the
Cloud Foundry PaaS.
The GearD project aims to integrate Docker into the Red Hat’s
OpenShift Origin PaaS.
libcontainer
libvirtsystemd-
nspawnLXC
cgroups namespaces
SELinux
AppArmor
Netfilter
Netlink
capabilities
Docker
Linux kernel
Docker can use different interfaces to access virtualization features of the Linux kernel.
39
Advanced Cluster Management Made Easy
High Performance Computing (HPC) clusters serve a crucial role at research-intensive
organizations in industries such as aerospace, meteorology, pharmaceuticals, and oil
and gas exploration.
The thing they have in common is a requirement for large-scale computations. Bright’s
solution for HPC puts the power of supercomputing within the reach of just about any
organization.
Engineering | Life Sciences | Automotive | Price Modelling | Aerospace | CAE | Data Analytics
40
The Bright AdvantageBright Cluster Manager for HPC makes it easy to build and oper-
ate HPC clusters using the server and networking equipment of
your choice, tying them together into a comprehensive, easy to
manage solution.
Bright Cluster Manager for HPC lets customers deploy complete
clusters over bare metal and manage them effectively. It pro-
vides single-pane-of-glass management for the hardware, the
operating system, HPC software, and users. With Bright Cluster
Manager for HPC, system administrators can get clusters up and
running quickly and keep them running reliably throughout their
life cycle – all with the ease and elegance of a fully featured, en-
terprise-grade cluster manager.
Bright Cluster Manager was designed by a group of highly expe-
rienced cluster specialists who perceived the need for a funda-
mental approach to the technical challenges posed by cluster
management. The result is a scalable and flexible architecture,
forming the basis for a cluster management solution that is ex-
tremely easy to install and use, yet is suitable for the largest and
most complex clusters.
Scale clusters to thousands of nodesBright Cluster Manager was designed to scale to thousands of
nodes. It is not dependent on third-party (open source) software
that was not designed for this level of scalability. Many advanced
features for handling scalability and complexity are included:
Management Daemon with Low CPU Overhead
The cluster management daemon (CMDaemon) runs on every
node in the cluster, yet has a very low CPU load – it will not cause
any noticeable slow-down of the applications running on your
cluster.
Because the CMDaemon provides all cluster management func-
tionality, the only additional daemon required is the workload
manager (or queuing system) daemon. For example, no daemons
The cluster installer takes you through the installation process and offers advanced options such as “Express” and “Remote”.
Advanced Cluster
Management Made Easy
Easy-to-use, complete and scalable
41
for monitoring tools such as Ganglia or Nagios are required. This
minimizes the overall load of cluster management software on
your cluster.
Multiple Load-Balancing Provisioning Nodes
Node provisioning – the distribution of software from the head
node to the regular nodes – can be a significant bottleneck in
large clusters. With Bright Cluster Manager, provisioning capabil-
ity can be off-loaded to regular nodes, ensuring scalability to a
virtually unlimited number of nodes.
When a node requests its software image from the head node,
the head node checks which of the nodes with provisioning ca-
pability has the lowest load and instructs the node to download
or update its image from that provisioning node.
Synchronized Cluster Management Daemons
The CMDaemons are synchronized in time to execute tasks in
exact unison. This minimizes the effect of operating system jit-
ter (OS jitter) on parallel applications. OS jitter is the “noise” or
“jitter” caused by daemon processes and asynchronous events
such as interrupts. OS jitter cannot be totally prevented, but for
parallel applications it is important that any OS jitter occurs at
the same moment in time on every compute node.
Built-In Redundancy
Bright Cluster Manager supports redundant head nodes and re-
dundant provisioning nodes that can take over from each other
in case of failure. Both automatic and manual failover configura-
tions are supported.
Other redundancy features include support for hardware and
software RAID on all types of nodes in the cluster.
Diskless Nodes
Bright Cluster Manager supports diskless nodes, which can be
useful to increase overall Mean Time Between Failure (MTBF),
particularly on large clusters. Nodes with only InfiniBand and no
Ethernet connection are possible too.
Bright Cluster Manager 7 for HPCFor organizations that need to easily deploy and manage HPC
clusters Bright Cluster Manager for HPC lets customers deploy
complete clusters over bare metal and manage them effectively.
It provides single-pane-of-glass management for the hardware,
the operating system, the HPC software, and users.
With Bright Cluster Manager for HPC, system administrators can
quickly get clusters up and running and keep them running relia-
bly throughout their life cycle – all with the ease and elegance of
a fully featured, enterprise-grade cluster manager.
Features
� Simple Deployment Process – Just answer a few questions
about your cluster and Bright takes care of the rest. It installs
all of the software you need and creates the necessary config-
uration files for you, so you can relax and get your new cluster
up and running right – first time, every time.
S
N
EW
SESW
NENW
tech BOX
42
Head Nodes & Regular NodesA cluster can have different types of nodes, but it will always
have at least one head node and one or more “regular” nodes.
A regular node is a node which is controlled by the head node
and which receives its software image from the head node or
from a dedicated provisioning node.
Regular nodes come in different flavors.
Some examples include:
� Failover Node – A node that can take over all functionalities
of the head node when the head node becomes unavailable.
� Compute Node – A node which is primarily used for compu-
tations.
� Login Node – A node which provides login access to users of
the cluster. It is often also used for compilation and job sub-
mission.
� I/O Node – A node which provides access to disk storage.
� Provisioning Node – A node from which other regular nodes
can download their software image.
� Workload Management Node – A node that runs the central
workload manager, also known as queuing system.
� Subnet Management Node – A node that runs an InfiniBand
subnet manager.
� More types of nodes can easily be defined and configured
with Bright Cluster Manager. In simple clusters there is only
one type of regular node – the compute node – and the head
node fulfills all cluster management roles that are described
in the regular node types mentioned above.
Advanced Cluster
Management Made Easy
Easy-to-use, complete and scalable
43
All types of nodes (head nodes and regular nodes) run the same
CMDaemon. However, depending on the role that has been as-
signed to the node, the CMDaemon fulfills different tasks. This
node has a ‘Login Role’, ‘Storage Role’ and ‘PBS Client Role’.
The architecture of Bright Cluster Manager is implemented by
the following key elements:
� The Cluster Management Daemon (CMDaemon)
� The Cluster Management Shell (CMSH)
� The Cluster Management GUI (CMGUI)
Connection Diagram
A Bright cluster is managed by the CMDaemon on the head node,
which communicates with the CMDaemons on the other nodes
over encrypted connections.
The CMDaemon on the head node exposes an API which can also
be accessed from outside the cluster. This is used by the cluster
management Shell and the cluster management GUI, but it can
also be used by other applications if they are programmed to ac-
cess the API. The API documentation is available on request.
The diagram below shows a cluster with one head node, three
regular nodes and the CMDaemon running on each node.
Elements of the Architecture
44
� Can Install on Bare Metal – With Bright Cluster Manager there
is nothing to pre-install. You can start with bare metal serv-
ers, and we will install and configure everything you need. Go
from pallet to production in less time than ever.
� Comprehensive Metrics and Monitoring – Bright Cluster Man-
ager really shines when it comes to showing you what’s going
on in your cluster. It’s beautiful graphical user interface lets
you monitor resources and services across the entire cluster
in real time.
� Powerful User Interfaces – With Bright Cluster Manager you
don’t just get one great user interface, we give you two. That
way, you can choose to provision, monitor, and manage your
HPC clusters with a traditional command line interface or our
beautiful, intuitive graphical user interface.
� Optimize the Use of IT Resources by HPC Applications –Bright
ensures HPC applications get the resources they require
according to the policies of your organization. Your cluster
will prioritize workloads according to your business priorities.
� HPC Tools and Libraries Included – Every copy of Bright Cluster
Manager comes with a complete set of HPC tools and libraries
so you will be ready to develop, debug, and deploy your HPC
code right away.
“The building blocks for transtec HPC solutions must be chosen
according to our goals ease-of-management and ease-of-use.
With Bright Cluster Manager, we are happy to have the technology
leader at hand, meeting these requirements, and our customers
value that.”
Jörg Walter | HPC Solution Engineer
Monitor your entire cluster at an glance with Bright Cluster Manager
Advanced Cluster
Management Made Easy
Easy-to-use, complete and scalable
45
Quickly build and deploy an HPC cluster When you need to get an HPC cluster up and running, don’t waste
time cobbling together a solution from various open source
tools. Bright’s solution comes with everything you need to set up
a complete cluster from bare metal and manage it with a beauti-
ful, powerful user interface.
Whether your cluster is on-premise or in the cloud, Bright is a
virtual one-stop-shop for your HPC project.
Build a cluster in the cloud
Bright’s dynamic cloud provisioning capability lets you build an
entire cluster in the cloud or expand your physical cluster into
the cloud for extra capacity. Bright allocates the compute re-
sources in Amazon Web Services automatically, and on demand.
Build a hybrid HPC/Hadoop cluster
Does your organization need HPC clusters for technical comput-
ing and Hadoop clusters for Big Data? The Bright Cluster Man-
ager for Apache Hadoop add-on enables you to easily build and
manage both types of clusters from a single pane of glass. Your
system administrators can monitor all cluster operations, man-
age users, and repurpose servers with ease.
Driven by researchBright has been building enterprise-grade cluster management
software for more than a decade. Our solutions are deployed in
thousands of locations around the globe. Why reinvent the
wheel when you can rely on our experts to help you implement
the ideal cluster management solution for your organization.
Clusters in the cloud
Bright Cluster Manager can provision and manage clusters that
are running on virtual servers inside the AWS cloud as if they
were local machines. Use this feature to build an entire cluster
in AWS from scratch or extend a physical cluster into the cloud
when you need extra capacity.
Use native services for storage in AWS – Bright Cluster Manager
lets you take advantage of inexpensive, secure, durable, flexible
and simple storage services for data use, archiving, and backup
in the AWS cloud.
Make smart use of AWS for HPC – Bright Cluster Manager can
save you money by instantiating compute resources in AWS only
when they are needed. It uses built-in intelligence to create in-
stances only after the data is ready for processing and the back-
log in on-site workloads requires it.
You decide which metrics to monitor. Just drag a component into the display area an Bright Cluster Manager instantly creates a graph of the data you need.
BOX
46
High performance meets efficiencyInitially, massively parallel systems constitute a challenge to
both administrators and users. They are complex beasts. Anyone
building HPC clusters will need to tame the beast, master the
complexity and present users and administrators with an easy-
to-use, easy-to-manage system landscape.
Leading HPC solution providers such as transtec achieve this
goal. They hide the complexity of HPC under the hood and match
high performance with efficiency and ease-of-use for both users
and administrators. The “P” in “HPC” gains a double meaning:
“Performance” plus “Productivity”.
Cluster and workload management software like Moab HPC
Suite, Bright Cluster Manager or Intel Fabric Suite provide the
means to master and hide the inherent complexity of HPC sys-
tems. For administrators and users, HPC clusters are presented
as single, large machines, with many different tuning parame-
ters. The software also provides a unified view of existing clus-
ters whenever unified management is added as a requirement by
the customer at any point in time after the first installation. Thus,
daily routine tasks such as job management, user management,
queue partitioning and management, can be performed easily
with either graphical or web-based tools, without any advanced
scripting skills or technical expertise required from the adminis-
trator or user.
Advanced Cluster
Management Made Easy
Easy-to-use, complete and scalable
S
N
EW
SESW
NENW
tech BOX
47
Bright Cluster Manager unleashes and manages the unlimited power of the cloudCreate a complete cluster in Amazon EC2, or easily extend your
onsite cluster into the cloud. Bright Cluster Manager provides a
“single pane of glass” to both your on-premise and cloud resourc-
es, enabling you to dynamically add capacity and manage cloud
nodes as part of your onsite cluster.
Cloud utilization can be achieved in just a few mouse clicks, with-
out the need for expert knowledge of Linux or cloud computing.
With Bright Cluster Manager, every cluster is cloud-ready, at no
extra cost. The same powerful cluster provisioning, monitoring,
scheduling and management capabilities that Bright Cluster
Manager provides to onsite clusters extend into the cloud.
The Bright advantage for cloud utilization
� Ease of use: Intuitive GUI virtually eliminates user learning
curve; no need to understand Linux or EC2 to manage system.
Alternatively, cluster management shell provides powerful
scripting capabilities to automate tasks.
� Complete management solution: Installation/initialization,
provisioning, monitoring, scheduling and management in
one integrated environment.
� Integrated workload management: Wide selection of work-
load managers included and automatically configured with
local, cloud and mixed queues.
� Single pane of glass; complete visibility and control: Cloud
compute nodes managed as elements of the on-site cluster; vi-
sible from a single console with drill-downs and historic data.
� Efficient data management via data-aware scheduling: Auto-
matically ensures data is in position at start of computation;
delivers results back when complete.
� Secure, automatic gateway: Mirrored LDAP and DNS services
inside Amazon VPC, connecting local and cloud-based nodes
for secure communication over VPN.
� Cost savings: More efficient use of cloud resources; support
for spot instances, minimal user intervention.
Two cloud scenarios
Bright Cluster Manager supports two cloud utilization scenarios:
“Cluster-on-Demand” and “Cluster Extension”.
Scenario 1: Cluster-on-Demand
The Cluster-on-Demand scenario is ideal if you do not have a clus-
ter onsite, or need to set up a totally separate cluster. With just a
few mouse clicks you can instantly create a complete cluster in
the public cloud, for any duration of time.
48
Scenario 2: Cluster Extension
The Cluster Extension scenario is ideal if you have a cluster onsite
but you need more compute power, including GPUs. With Bright
Cluster Manager, you can instantly add EC2-based resources to
your onsite cluster, for any duration of time. Extending into the
cloud is as easy as adding nodes to an onsite cluster – there are
only a few additional, one-time steps after providing the public
cloud account information to Bright Cluster Manager.
The Bright approach to managing and monitoring a cluster in
the cloud provides complete uniformity, as cloud nodes are
managed and monitored the same way as local nodes:
� Load-balanced provisioning
� Software image management
� Integrated workload management
� Interfaces – GUI, shell, user portal
� Monitoring and health checking
� Compilers, debuggers, MPI libraries, mathematical libraries
and environment modules
Advanced Cluster
Management Made Easy
Cloud Bursting With Bright
Bright Computing
49
Bright Cluster Manager also provides additional features that are
unique to the cloud.
Amazon spot instance support – Bright Cluster Manager enables
users to take advantage of the cost savings offered by Amazon’s
spot instances. Users can specify the use of spot instances, and
Bright will automatically schedule as available, reducing the cost
to compute without the need to monitor spot prices and sched-
ule manually.
Hardware Virtual Machine (HVM) – Bright Cluster Manager auto-
matically initializes all Amazon instance types, including Cluster
Compute and Cluster GPU instances that rely on HVM virtualization.
Amazon VPC – Bright supports Amazon VPC setups which allows
compute nodes in EC2 to be placed in an isolated network, there-
by separating them from the outside world. It is even possible to
route part of a local corporate IP network to a VPC subnet in EC2,
so that local nodes and nodes in EC2 can communicate easily.
Data-aware scheduling – Data-aware scheduling ensures that
input data is automatically transferred to the cloud and made
accessible just prior to the job starting, and that the output data
is automatically transferred back. There is no need to wait for the
data to load prior to submitting jobs (delaying entry into the job
queue), nor any risk of starting the job before the data transfer is
complete (crashing the job). Users submit their jobs, and Bright’s
data-aware scheduling does the rest.
Bright Cluster Manager provides the choice:
“To Cloud, or Not to Cloud”
Not all workloads are suitable for the cloud. The ideal situation
for most organizations is to have the ability to choose between
onsite clusters for jobs that require low latency communication,
complex I/O or sensitive data; and cloud clusters for many other
types of jobs.
Bright Cluster Manager delivers the best of both worlds: a pow-
erful management solution for local and cloud clusters, with
the ability to easily extend local clusters into the cloud without
compromising provisioning, monitoring or managing the cloud
resources.
One or more cloud nodes can be configured under the Cloud Settings tab.
51
Intelligent HPCWorkload Management
While all HPC systems face challenges in workload demand, workflow constraints,
resource complexity, and system scale, enterprise HPC systems face more stringent
challenges and expectations.
Enterprise HPC systems must meet mission-critical and priority HPC workload demands
for commercial businesses, business-oriented research, and academic organizations.
High Throuhgput Computing | CAD | Big Data Analytics | Simulation | Aerospace | Automotive
52
Solutions to Enterprise High-Performance Computing ChallengesThe Enterprise has requirements for dynamic scheduling, provi-
sioning and management of multi-step/multi-application servic-
es across HPC, cloud, and big data environments. Their workloads
directly impact revenue, product delivery, and organizational ob-
jectives.
Enterprise HPC systems must process intensive simulation and
data analysis more rapidly, accurately and cost-effectively to
accelerate insights. By improving workflow and eliminating job
delays and failures, the business can more efficiently leverage
data to make data-driven decisions and achieve a competitive
advantage. The Enterprise is also seeking to improve resource
utilization and manage efficiency across multiple heterogene-
ous systems. User productivity must increase to speed the time
to discovery, making it easier to access and use HPC resources
and expand to other resources as workloads demand.
Moab HPC SuiteMoab HPC Suite accelerates insights by unifying data center re-
sources, optimizing the analysis process, and guaranteeing ser-
vices to the business. Moab 8.1 for HPC systems and HPC cloud
continually meets enterprise priorities through increased pro-
ductivity, automated workload uptime, and consistent SLAs. It
uses the battle-tested and patented Moab intelligence engine
to automatically balance the complex, mission-critical workload
priorities of enterprise HPC systems. Enterprise customers ben-
efit from a single integrated product that brings together key
Enterprise HPC capabilities, implementation services, and 24/7
support.
It is imperative to automate workload workflows to speed time
to discovery. The following new use cases play a key role in im-
proving overall system performance:
Moab is the “brain” of an HPC system, intelligently optimizing workload throughput while balancing service levels and priorities.
Intelligent HPC
Workload Management
Moab HPC Suite
53
� Elastic Computing – unifies data center resources by assisting
the business management resource expansion through burst-
ing to private/public clouds and other data center resources
utilizing OpenStack as a common platform
� Performance and Scale – optimizes the analysis process by
dramatically increasing TORQUE and Moab throughput and
scalability across the board
� Tighter cooperation between Moab and Torque – harmoniz-
ing these key structures to reduce overhead and improve
communication between the HPC scheduler and the re-
source manager(s)
� More Parallelization – increasing the decoupling of Moab
and TORQUE’s network communication so Moab is less
dependent on TORQUE’S responsiveness, resulting in a 2x
speed improvement and shortening the duration of the av-
erage scheduling iteration in half
� Accounting Improvements – allowing toggling between
multiple modes of accounting with varying levels of en-
forcement, providing greater flexibility beyond strict allo-
cation options. Additionally, new up-to-date accounting
balances provide real-time insights into usage tracking.
� Viewpoint Admin Portal – guarantees services to the busi-
ness by allowing admins to simplify administrative reporting,
workload status tracking, and job resource viewing
Accelerate InsightsMoab HPC Suite accelerates insights by increasing overall
system, user and administrator productivity, achieving more
accurate results that are delivered faster and at a lower cost
from HPC resources. Moab provides the scalability, 90-99 per-
cent utilization, and simple job submission required to maximize
productivity. This ultimately speeds the time to discovery, aiding
the enterprise in achieving a competitive advantage due to its
HPC system. Enterprise use cases and capabilities include the
following:
� OpenStack integration to offer virtual and physical resource
provisioning for IaaS and PaaS
� Performance boost to achieve 3x improvement in overall opti-
mization performance
� Advanced workflow data staging to enable improved cluster
utilization, multiple transfer methods, and new transfer types
54
“With Moab HPC Suite, we can meet very demanding customers’
requirements as regards unified management of heterogeneous
cluster environments, grid management, and provide them with
flexible and powerful configuration and reporting options. Our
customers value that highly.”
Sven Grützmacher | HPC Solution Engineer
� Advanced power management with clock frequency control
and additional power state options, reducing energy costs by
15-30 percent
�Workload-optimized allocation policies and provisioning to
get more results out of existing heterogeneous resources and
reduce costs, including topology-based allocation
� Unify workload management across heterogeneous clusters
by managing them as one cluster, maximizing resource availa-
bility and administration efficiency
� Optimized, intelligent scheduling packs workloads and back-
fills around priority jobs and reservations while balancing
SLAs to efficiently use all available resources
� Optimized scheduling and management of accelerators, both
Intel Xeon Phi and GPGPUs, for jobs to maximize their utiliza-
tion and effectiveness
� Simplified job submission and management with advanced
job arrays and templates
� Showback or chargeback for pay-for-use so actual resource
usage is tracked with flexible chargeback rates and reporting
by user, department, cost center, or cluster
� Multi-cluster grid capabilities manage and share workload
across multiple remote clusters to meet growing workload
demand or surges
Guarantee Services to the BusinessJob and resource failures in enterprise HPC systems lead to de-
layed results, missed organizational opportunities, and missed
objectives. Moab HPC Suite intelligently automates workload
and resource uptime in the HPC system to ensure that workloads
complete successfully and reliably, avoid failures, and guarantee
services are delivered to the business.
Intelligent HPC
Workload Management
Moab HPC Suite
55
The Enterprise benefits from these features:
� Intelligent resource placement prevents job failures with
granular resource modeling, meeting workload requirements
and avoiding at-risk resources
� Auto-response to failures and events with configurable
actions to pre-failure conditions, amber alerts, or other
metrics and monitors
�Workload-aware future maintenance scheduling that helps
maintain a stable HPC system without disrupting workload
productivity
� Real-world expertise for fast time-to-value and system uptime
with included implementation, training and 24/7 support
remote services
Auto SLA EnforcementMoab HPC Suite uses the powerful Moab intelligence engine to
optimally schedule and dynamically adjust workload to consist-
ently meet service level agreements (SLAs), guarantees, or busi-
ness priorities. This automatically ensures that the right work-
loads are completed at the optimal times, taking into account
the complex number of using departments, priorities, and SLAs
to be balanced. Moab provides the following benefits:
� Usage accounting and budget enforcement that schedules
resources and reports on usage in line with resource sharing
agreements and precise budgets (includes usage limits, usage
reports, auto budget management, and dynamic fair share
policies)
� SLA and priority policies that make sure the highest priority
workloads are processed first (includes Quality of Service and
hierarchical priority weighting)
� Continuous plus future scheduling that ensures priorities and
guarantees are proactively met as conditions and workload
changes (i.e. future reservations, pre-emption)
BOX
transtec HPC solutions are designed for maximum flexibility
and ease of management. We not only offer our customers
the most powerful and flexible cluster management solu-
tion out there, but also provide them with customized setup
and sitespecific configuration. Whether a customer needs
a dynamical Linux-Windows dual-boot solution, unified
management of different clusters at different sites, or the
fine-tuning of the Moab scheduler for implementing fine-
grained policy confi guration – transtec not only gives you
the framework at hand, but also helps you adapt the system
according to your special needs. Needless to say, when cus-
tomers are in need of special trainings, transtec will be there
to provide customers, administrators, or users with specially
adjusted Educational Services.
Having a many years’ experience in High Performance
Computing enabled us to develop efficient concepts for in-
stalling and deploying HPC clusters. For this, we leverage
well-proven 3rd party tools, which we assemble to a total
solution and adapt and confi gure according to the custom-
er’s requirements.
We manage everything from the complete installation and
configuration of the operating system, necessary middle-
ware components like job and cluster management systems
up to the customer’s applications.
56
IBM Platform LSFUser-friendly, topology aware workload managementPlatform HPC includes a robust workload scheduling capability,
which is based on Platform LSF - the industry’s most powerful,
comprehensive, policy driven workload management solution for
engineering and scientific distributed computing environments.
By scheduling workloads intelligently according to policy,
Platform HPC improves end user productivity with minimal sys-
tem administrative effort. In addition, it allows HPC user teams to
easily access and share all computing resources, while reducing
time between simulation iterations.
GPU scheduling – Platform HPC provides the capability to sched-
ule jobs to GPUs as well as CPUs. This is particularly advantageous
in heterogeneous hardware environments as it means that admin-
istrators can configure Platform HPC so that only those jobs that
can benefit from running on GPUs are allocated to those resourc-
es. This frees up CPU-based resources to run other jobs. Using the
unified management interface, administrators can monitor the
GPU performance as well as detect ECC errors.
Unified management interfaceCompeting cluster management tools either do not have a web-
based interface or require multiple interfaces for managing dif-
ferent functional areas. In comparison, Platform HPC includes a
single unified interface through which all administrative tasks
can be performed including node-management, job-manage-
ment, jobs and cluster monitoring and reporting. Using the uni-
fied management interface, even cluster administrators with
very little Linux experience can competently manage a state of
the art HPC cluster.
Intelligent HPC
Workload Management
IBM Platform LSF
57
Job management – While command line savvy users can contin-
ue using the remote terminal capability, the unified web portal
makes it easy to submit, monitor, and manage jobs. As changes
are made to the cluster configuration, Platform HPC automatical-
ly re-configures key components, ensuring that jobs are allocat-
ed to the appropriate resources.
The web portal is customizable and provides job data manage-
ment, remote visualization and interactive job support.
Workload/system correlation – Administrators can correlate
workload information with system load, so that they can make
timely decisions and proactively manage compute resources
against business demand. When it’s time for capacity planning,
the management interface can be used to run detailed reports
and analyses which quantify user needs and remove the guess
work from capacity expansion.
Simplified cluster management – The unified management con-
sole is used to administer all aspects of the cluster environment.
It enables administrators to easily install, manage and monitor
their cluster. It also provides an interactive environment to easi-
ly package software as kits for application deployment as well
as pre-integrated commercial application support. One of the
key features of the interface is an operational dashboard that
provides comprehensive administrative reports. As the image il-
lustrates, Platform HPC enables administrators to monitor and
report on key performance metrics such as cluster capacity, avail-
able memory and CPU utilization. This enables administrators to
easily identify and troubleshoot issues. The easy to use interface
saves the cluster administrator time, and means that they do not
need to become an expert in the administration of open-source
software components. It also reduces the possibility of errors and
time lost due to incorrect configuration. Cluster administrators en-
joy the best of both worlds – easy access to a powerful, web-based
cluster manager without the need to learn and separately ad-
minister all the tools that comprise the HPC cluster environment.
Resource monitoring
58
Intelligent HPC
Workload Management
Slurm
SlurmThe Simple Linux Utility for Resource Management (Slurm) is an
open source, fault-tolerant, and highly scalable cluster manage-
ment and job scheduling system for large and small Linux clus-
ters. Slurm requires no kernel modifications for its operation and
is relatively self-contained. As a cluster workload manager, Slurm
has three key functions. First, it allocates exclusive and/or non-ex-
clusive access to resources (compute nodes) to users for some
duration of time so they can perform work. Second, it provides a
framework for starting, executing, and monitoring work (normally
a parallel job) on the set of allocated nodes. Finally, it arbitrates
contention for resources by managing a queue of pending work.
Optional plugins can be used for accounting, advanced reser-
vation, gang scheduling (time sharing for parallel jobs), backfill
scheduling, topology optimized resource selection, resource limits
by user or bank account, and sophisticated multifactor job prioriti-
zation algorithms.
ArchitectureSlurm has a centralized manager, slurmctld, to monitor resources
and work. There may also be a backup manager to assume those
responsibilities in the event of failure. Each compute server (node)
has a slurmd daemon, which can be compared to a remote shell:
it waits for work, executes that work, returns status, and waits
for more work. The slurmd daemons provide fault-tolerant hier-
archical communications. There is an optional slurmdbd (Slurm
DataBase Daemon) which can be used to record accounting in-
formation for multiple Slurm-managed clusters in a single data-
base. User tools include srun to initiate jobs, scancel to terminate
queued or running jobs, sinfo to report system status, squeue to
report the status of jobs, and sacct to get information about jobs
and job steps that are running or have completed. The smap and
sview commands graphically reports system and job status includ-
ing network topology.
59
There is an administrative tool scontrol available to monitor and/
or modify configuration and state information on the cluster. The
administrative tool used to manage the database is sacctmgr.
It can be used to identify the clusters, valid users, valid bank ac-
counts, etc. APIs are available for all functions.
Slurm has a general-purpose plugin mechanism available to eas-
ily support various infrastructures. This permits a wide variety of
Slurm configurations using a building block approach. These pl-
ugins presently include:
� Accounting Storage: Primarily Used to store historical data
about jobs. When used with SlurmDBD (Slurm Database Dae-
mon), it can also supply a limits based system along with his-
torical system status.
� Account Gather Energy: Gather energy consumption data per
job or nodes in the system. This plugin is integrated with the
Accounting Storage and Job Account Gather plugins.
scontrolslurmctld(primary)
slurmctld(backup)
slurmdbd(optional)
otherclusters
Compute node daemons
.....
sinfo
squeue
scancel
sacctslurmd
slurmd slurmdslurmd
srun
Database
User commands(partial list) Controller daemons
Slurm components
� Authentication of communications: Provides authentication
mechanism between various components of Slurm.
� Checkpoint: Interface to various checkpoint mechanisms.
� Cryptography (Digital Signature Generation): Mechanism used
to generate a digital signature, which is used to validate that
job step is authorized to execute on specific nodes. This is dis-
tinct from the plugin used for Authentication since the job
step request is sent from the user’s srun command rather than
directly from the slurmctld daemon, which generates the job
step credential and its digital signature.
� Generic Resources: Provide interface to control generic re-
sources like Processing Units (GPUs) and Intel® Many Integrat-
ed Core (MIC) processors.
� Job Submit: Custom plugin to allow site specific control over
job requirements at submission and update.
� Job Accounting Gather: Gather job step resource utilization
data.
� Job Completion Logging: Log a job’s termination data. This is
typically a subset of data stored by an Accounting Storage Pl-
ugin.
� Launchers: Controls the mechanism used by the ‘srun’ com-
mand to launch the tasks.
� MPI: Provides different hooks for the various MPI implemen-
tations. For example, this can set MPI specific environment
variables.
� Preempt: Determines which jobs can preempt other jobs and
the preemption mechanism to be used.
� Priority: Assigns priorities to jobs upon submission and on an
ongoing basis (e.g. as they age).
� Process tracking (for signaling): Provides a mechanism for
identifying the processes associated with each job. Used for
job accounting and signaling.
� Scheduler: Plugin determines how and when Slurm schedules
jobs.
60
� Node selection: Plugin used to determine the resources used
for a job allocation.
� Switch or interconnect: Plugin to interface with a switch or
interconnect. For most systems (ethernet or infiniband) this
is not needed.
� Task Affinity: Provides mechanism to bind a job and it’s indi-
vidual tasks to specific processors.
� Network Topology: Optimizes resource selection based upon
the network topology. Used for both job allocations and ad-
vanced reservation.
The entities managed by these Slurm daemons, shown in Figure 2,
include nodes, the compute resource in Slurm, partitions, which
group nodes into logical sets, jobs, or allocations of resources
assigned to a user for a specified amount of time, and job steps,
which are sets of (possibly parallel) tasks within a job. The parti-
tions can be considered job queues, each of which has an assort-
ment of constraints such as job size limit, job time limit, users
permitted to use it, etc. Priority-ordered jobs are allocated nodes
within a partition until the resources (nodes, processors, memory,
etc.) within that partition are exhausted. Once a job is assigned a
set of nodes, the user is able to initiate parallel work in the form of
job steps in any configuration within the allocation. For instance,
a single job step may be started that utilizes all nodes allocated
to the job, or several job steps may independently use a portion
of the allocation. Slurm provides resource management for the
processors allocated to a job, so that multiple job steps can be
simultaneously submitted and queued until there are available
resources within the job’s allocation.
Intelligent HPC
Workload Management
IBM Platform LSF 8
S
N
EW
SESW
NENW
tech BOX
61
IBM Platform LSF 8 Written with Platform LSF administrators in mind, this brief pro-
vides a short explanation of significant changes in Platform’s lat-
est release of Platform LSF, with a specific emphasis on schedul-
ing and workload management features.
About IBM Platform LSF 8Platform LSF is the most powerful workload manager for de-
manding, distributed high performance computing environ-
ments. It provides a complete set of workload management ca-
pabilities, all designed to work together to reduce cycle times
and maximize productivity in missioncritical environments.
This latest Platform LSF release delivers improvements in perfor-
mance and scalability while introducing new features that sim-
plify administration and boost user productivity. This includes:
� Guaranteed resources – Aligns business SLA’s with infra-
structure configuration for simplified administration and
configuration
� Live reconfiguration – Provides simplified administration
and enables agility
� Delegation of administrative rights – Empowers line of busi-
ness owners to take control of their own projects
� Fairshare & pre-emptive scheduling enhancements – Fine
tunes key production policies
Platform LSF 8 FeaturesGuaranteed Resources Ensure Deadlines are Met
In Platform LSF 8, resource-based scheduling has been extended
to guarantee resource availability to groups of jobs. Resources
can be slots, entire hosts or user-defined shared resources such
as software licenses.
As an example, a business unit might guarantee that it has access
to specific types of resources within ten minutes of a job being
submitted, even while sharing resources between departments.
This facility ensures that lower priority jobs using the needed re-
sources can be pre-empted in order to meet the SLAs of higher
priority jobs. Because jobs can be automatically attached to an
SLA class via access controls, administrators can enable these
guarantees without requiring that end-users change their job
submission procedures, making it easy to implement this capa-
bility in existing environments.
Live Cluster Reconfiguration
Platform LSF 8 incorporates a new live reconfiguration capabil-
ity, allowing changes to be made to clusters without the need
to re-start LSF daemons. This is useful to customers who need
to add hosts, adjust sharing policies or re-assign users between
groups “on the fly”, without impacting cluster availability or run-
ning jobs.
62
Changes to the cluster configuration can be made via the bconf
command line utility, or via new API calls. This functionality can
also be integrated via a web-based interface using Platform Ap-
plication Center. All configuration modifications are logged for
a complete audit history, and changes are propagated almost
instantaneously. The majority of reconfiguration operations are
completed in under half a second.
With Live Reconfiguration, down-time is reduced, and adminis-
trators are free to make needed adjustments quickly rather than
wait for scheduled maintenance periods or non-peak hours. In
cases where users are members of multiple groups, controls can
be put in place so that a group administrator can only control
jobs associated with their designated group rather than impact-
ing jobs related to another group submitted by the same user.
Delegation of Administrative Rights
With Platform LSF 8, the concept of group administrators has
been extended to enable project managers and line of business
managers to dynamically modify group membership and fair-
share resource allocation policies within their group. The ability
to make these changes dynamically to a running cluster is made
possible by the Live Reconfiguration feature.
These capabilities can be delegated selectively depending on the
group and site policy. Different group administrators can man-
age jobs, control sharing policies or adjust group membership.
Intelligent HPC
Workload Management
IBM Platform LSF 8
63
More Flexible Fairshare Scheduling Policies
To enable better resource sharing flexibility with Platform LSF 8,
the algorithms used to tune dynamically calculated user priori-
ties can be adjusted at the queue level. These algorithms can vary
based on department, application or project team preferences.
The Fairshare parameters ENABLE_HIST_RUN_TIME and HIST_
HOURS enable administrators to control the degree to which LSF
considers prior resource usage when determining user priority.
The flexibility of Platform LSF 8 has also been improved by al-
lowing a similar “decay rate” to apply to currently running jobs
(RUN_TIME_DECAY), either system-wide or at the queue level.
This is most useful for customers with long-running jobs, where
setting this parameter results in a more accurate view of real re-
source use for the fairshare scheduling to consider.
Performance & Scalability Enhancements
Platform LSF has been extended to support an unparalleled
scale of up to 100,000 cores and 1.5 million queued jobs for
very high throughput EDA workloads. Even higher scalability is
possible for more traditional HPC workloads. Specific areas of im-
provement include the time required to start the master-batch
daemon (MBD), bjobs query performance, job submission and
job dispatching as well as impressive performance gains result-
ing from the new Bulk Job Submission feature. In addition, on
very large clusters with large numbers of user groups employing
fairshare scheduling, the memory footprint of the master batch
scheduler in LSF has been reduced by approximately 70% and
scheduler cycle time has been reduced by 25%, resulting in better
performance and scalability.
More Sophisticated Host-based Resource Usage for Parallel Jobs
Platform LSF 8 provides several improvements to how resource
use is tracked and reported with parallel jobs. Accurate tracking
of how parallel jobs use resources such as CPUs, memory and
swap, is important for ease of management, optimal scheduling
and accurate reporting and workload analysis. With Platform LSF
8 administrators can track resource usage on a per-host basis and
an aggregated basis (across all hosts), ensuring that resource use
is reported accurately. Additional details such as running PIDs
and PGIDs for distributed parallel jobs, manual cleanup (if neces-
sary) and the development of scripts for managing parallel jobs
are simplified. These improvements in resource usage reporting
are reflected in LSF commands including bjobs, bhist and bacct.
Improved Ease of Administration for Mixed Windows and
Linux Clusters
The lspasswd command in Platform LSF enables Windows LSF
users to advise LSF of changes to their Windows level passwords.
With Platform LSF 8, password synchronization between envi-
ronments has become much easier to manage because the Win-
dows passwords can now be adjusted directly from Linux hosts
using the lspasswd command. This allows Linux users to conven-
iently synchronize passwords on Windows hosts without need-
ing to explicitly login into the host.
Bulk Job Submission
When submitting large numbers of jobs with different resource
requirements or job level settings, Bulk Job Submission allows
for jobs to be submitted in bulk by referencing a single file con-
taining job details.
65
Intel Cluster Ready – A Quality Standard for HPC ClustersIntel Cluster Ready is designed to create predictable expectations for users and pro-
viders of HPC clusters, primarily targeting customers in the commercial and industrial
sectors. These are not experimental “test-bed” clusters used for computer science and
computer engineering research, or high-end “capability” clusters closely targeting their
specific computing requirements that power the high-energy physics at the national
labs or other specialized research organizations.
Intel Cluster Ready seeks to advance HPC clusters used as computing resources in pro-
duction environments by providing cluster owners with a high degree of confidence
that the clusters they deploy will run the applications their scientific and engineering
staff rely upon to do their jobs. It achieves this by providing cluster hardware, software,
and system providers with a precisely defined basis for their products to meet their
customers’ production cluster requirements.
Life Sciences | Risk Analysis | Simulation | Big Data Analytics | High Performance Computing
66
What are the Objectives of ICR?The primary objective of Intel Cluster Ready is to make clusters
easier to specify, easier to buy, easier to deploy, and make it eas-
ier to develop applications that run on them. A key feature of ICR
is the concept of “application mobility”, which is defined as the
ability of a registered Intel Cluster Ready application – more cor-
rectly, the same binary – to run correctly on any certified Intel
Cluster Ready cluster. Clearly, application mobility is important
for users, software providers, hardware providers, and system
providers.
� Users want to know the cluster they choose will reliably run
the applications they rely on today, and will rely on tomorrow
� Application providers want to satisfy the needs of their cus-
tomers by providing applications that reliably run on their
customers’ cluster hardware and cluster stacks
� Cluster stack providers want to satisfy the needs of their cus-
tomers by providing a cluster stack that supports their cus-
tomers’ applications and cluster hardware
� Hardware providers want to satisfy the needs of their cus-
tomers by providing hardware components that supports
their customers’ applications and cluster stacks
� System providers want to satisfy the needs of their custom-
ers by providing complete cluster implementations that reli-
ably run their customers’ applications
Without application mobility, each group above must either try
to support all combinations, which they have neither the time
nor resources to do, or pick the “winning combination(s)” that
best supports their needs, and risk making the wrong choice.
The Intel Cluster Ready definition of application portability
supports all of these needs by going beyond pure portability,
(re-compiling and linking a unique binary for each platform), to
application binary mobility, (running the same binary on multi-
ple platforms), by more precisely defining the target system.
Intel Cluster Ready
A Quality Standard for HPC Clusters
So
ftw
are
Sta
ck
Fabric
System
CFD
A SingleSolutionPlatform
Fabrics
RegisteredApplications
Certified Cluster Platforms
Crash Climate QCD
Intel MPI Library (run-time)Intel MKL Cluster Edition (run-time)
Linux Cluster Tools(Intel Selected)
InfiniBand (OFED)
Intel Xeon Processor Platform
Gigabit Ethernet 10Gbit Ethernet
BioOptional
Value Addby Individual
PlatformIntegrator
IntelDevelopment
Tools(C++, Intel Trace
Analyzer and Collector, MKL, etc.)
...
Intel OEM1 OEM2 PI1 PI2 ...
67
A further aspect of application mobility is to ensure that regis-
tered Intel Cluster Ready applications do not need special pro-
gramming or alternate binaries for different message fabrics.
Intel Cluster Ready accomplishes this by providing an MPI imple-
mentation supporting multiple fabrics at runtime; through this,
registered Intel Cluster Ready applications obey the “message
layer independence property”. Stepping back, the unifying con-
cept of Intel Cluster Ready is “one-to-many,” that is
� One application will run on many clusters
� One cluster will run many applications
How is one-to-many accomplished? Looking at Figure 1, you see
the abstract Intel Cluster Ready “stack” components that al-
ways exist in every cluster, i.e., one or more applications, a clus-
ter software stack, one or more fabrics, and finally the underly-
ing cluster hardware. The remainder of that picture (to the right)
shows the components in greater detail.
Applications, on the top of the stack, rely upon the various APIs,
utilities, and file system structure presented by the underly-
ing software stack. Registered Intel Cluster Ready applications
are always able to rely upon the APIs, utilities, and file system
structure specified by the Intel Cluster Ready Specification; if an
application requires software outside this “required” set, then
Intel Cluster Ready requires the application to provide that soft-
ware as a part of its installation.
To ensure that this additional per-application software doesn’t
conflict with the cluster stack or other applications, Intel
Cluster Ready also requires the additional software to be in-
stalled in application-private trees, so the application knows
how to find that software while not interfering with other
applications. While this may well cause duplicate software to
be installed, the reliability provided by the duplication far out-
weighs the cost of the duplicated files.
ICR Stack
68
A prime example supporting this comparison is the removal of a
common file (library, utility, or other) that is unknowingly need-
ed by some other application – such errors can be insidious to
repair even when they cause an outright application failure.
Cluster platforms, at the bottom of the stack, provide the APIs,
utilities, and file system structure relied upon by registered ap-
plications. Certified Intel Cluster Ready platforms ensure the
APIs, utilities, and file system structure are complete per the Intel
Cluster Ready Specification; certified clusters are able to provide
them by various means as they deem appropriate. Because of the
clearly defined responsibilities ensuring the presence of all soft-
ware required by registered applications, system providers have
a high confidence that the certified clusters they build are able
to run any certified applications their customers rely on. In ad-
dition to meeting the Intel Cluster Ready requirements, certified
clusters can also provide their added value, that is, other features
and capabilities that increase the value of their products.
How Does Intel Cluster Ready Accomplish its Objectives?At its heart, Intel Cluster Ready is a definition of the cluster as a
parallel application platform, as well as a tool to certify an ac-
tual cluster to the definition. Let’s look at each of these in more
detail, to understand their motivations and benefits.
A definition of the cluster as parallel application platform
The Intel Cluster Ready Specification is very much written as the
requirements for, not the implementation of, a platform upon
which parallel applications, more specifically MPI applications,
can be built and run. As such, the specification doesn’t care
whether the cluster is diskful or diskless, fully distributed or
single system image (SSI), built from “Enterprise” distributions
or community distributions, fully open source or not. Perhaps
more importantly, with one exception, the specification doesn’t
have any requirements on how the cluster is built; that one
Intel Cluster Ready
A Quality Standard for HPC Clusters
“The Intel Cluster Checker allows us to certify that our transtec
HPC clusters are compliant with an independent high quality
standard. Our customers can rest assured: their applications run as
they expect.”
Marcus Wiedemann | HPC Solution Architect
Cluster Definition & Configuration XML File
STDOUT + LogfilePass/Fail Results & Diagnostics
Cluster Checker Engine
Output
Test Module
Parallel Ops Check
Res
ult
Co
nfi
g
Output
Res
ult
AP
I
Test Module
Parallel Ops Check
Co
nfi
gN
od
e
No
de
No
de
No
de
No
de
No
de
No
de
No
de
69
exception is that compute nodes must be built with automated
tools, so that new, repaired, or replaced nodes can be rebuilt
identically to the existing nodes without any manual interac-
tion, other than possibly initiating the build process.
Some items the specification does care about include:
� The ability to run both 32- and 64-bit applications, including
MPI applications and X-clients, on any of the compute nodes
� Consistency among the compute nodes’ configuration, capa-
bility, and performance
� The identical accessibility of libraries and tools across
the cluster
� The identical access by each compute node to permanent
and temporary storage, as well as users’ data
� The identical access to each compute node from the head node
� The MPI implementation provides fabric independence
� All nodes support network booting and provide a remotely
accessible console
The specification also requires that the runtimes for specific In-
tel software products are installed on every certified cluster:
� Intel Math Kernel Library
� Intel MPI Library Runtime Environment
� Intel Threading Building Blocks
This requirement does two things. First and foremost, main-
line Linux distributions do not necessarily provide a sufficient
software stack to build an HPC cluster – such specialization is
beyond their mission. Secondly, the requirement ensures that
programs built with this software will always work on certified
clusters and enjoy simpler installations. As these runtimes are
directly available from the web, the requirement does not cause
additional costs to certified clusters. It is also very important
to note that this does not require certified applications to use
these libraries nor does it preclude alternate libraries, e.g., other
MPI implementations, from being present on certified clusters.
Quite clearly, an application that requires, e.g., an alternate MPI,
must also provide the runtimes for that MPI as a part of its in-
stallation.
A tool to certify an actual cluster to the definition
The Intel Cluster Checker, included with every certified Intel
Cluster Ready implementation, is used in four modes in the life
of a cluster:
� To certify a system provider’s prototype cluster as a valid im-
plementation of the specification
� To verify to the owner that the just-delivered cluster is a “true
copy” of the certified prototype
� To ensure the cluster remains fully functional, reducing ser-
vice calls not related to the applications or the hardware
� To help software and system providers diagnose and correct
actual problems to their code or their hardware.
While these are critical capabilities, in all fairness, this greatly
understates the capabilities of Intel Cluster Checker. The tool
will not only verify the cluster is performing as expected. To
do this, per-node and cluster-wide static and dynamic tests are
made of the hardware and software.
Intel Cluster Checker
70
The static checks ensure the systems are configured consist-
ently and appropriately. As one example, the tool will ensure
the systems are all running the same BIOS versions as well as
having identical configurations among key BIOS settings. This
type of problem – differing BIOS versions or settings – can be
the root cause of subtle problems such as differing memory con-
figurations that manifest themselves as differing memory band-
widths only to be seen at the application level as slower than ex-
pected overall performance. As is well-known, parallel program
performance can be very much governed by the performance
of the slowest components, not the fastest. In another static
check, the Intel Cluster Checker will ensure that the expected
tools, libraries, and files are present on each node, identically
located on all nodes, as well as identically implemented on all
nodes. This ensures that each node has the minimal software
stack specified by the specification, as well as identical soft-
ware stack among the compute nodes.
A typical dynamic check ensures consistent system perfor-
mance, e.g., via the STREAM benchmark. This particular test en-
sures processor and memory performance is consistent across
compute nodes, which, like the BIOS setting example above, can
be the root cause of overall slower application performance. An
additional check with STREAM can be made if the user config-
ures an expectation of benchmark performance; this check will
ensure that performance is not only consistent across the clus-
ter, but also meets expectations.
Going beyond processor performance, the Intel MPI Bench-
marks are used to ensure the network fabric(s) are performing
properly and, with a configuration that describes expected
performance levels, up to the cluster provider’s performance
expectations. Network inconsistencies due to poorly perform-
ing Ethernet NICs, InfiniBand HBAs, faulty switches, and loose or
faulty cables can be identified. Finally, the Intel Cluster Checker
is extensible, enabling additional tests to be added supporting
additional features and capabilities.
Intel Cluster Ready
Intel Cluster Ready Builds
HPC Momentum
S
N
EW
SESW
NENW
tech BOX
Cluster PlatformSoftware Tools
Reference Designs
Demand Creation
Support & Training
Specification
Co
nfi
gu
ration
Softw
are
Interco
nn
ect
Server Platfo
rm
Intel P
rocesso
r ISV Enabling
©
Intel Cluster Ready Program
71
Intel Cluster Ready Builds HPC MomentumWith the Intel Cluster Ready (ICR) program, Intel Corporation
set out to create a win-win scenario for the major constituen-
cies in the high-performance computing (HPC) cluster mar-
ket. Hardware vendors and independent software vendors
(ISVs) stand to win by being able to ensure both buyers and
users that their products will work well together straight out
of the box.
System administrators stand to win by being able to meet
corporate demands to push HPC competitive advantages
deeper into their organizations while satisfying end users’
demands for reliable HPC cycles, all without increasing IT
staff. End users stand to win by being able to get their work
done faster, with less downtime, on certified cluster plat-
forms. Last but not least, with ICR, Intel has positioned itself
to win by expanding the total addressable market (TAM) and
reducing time to market for the company’s microprocessors,
chip sets, and platforms.
The Worst of Times
For a number of years, clusters were largely confined to
government and academic sites, where contingents of
graduate students and midlevel employees were availa-
ble to help program and maintain the unwieldy early sys-
tems. Commercial firms lacked this low-cost labor supply
and mistrusted the favored cluster operating system, open
source Linux, on the grounds that no single party could be
held accountable if something went wrong with it. Today,
cluster penetration in the HPC market is deep and wide,
extending from systems with a handful of processors to
This enables the Intel Cluster Checker to not only support the
minimal requirements of the Intel Cluster Ready Specification,
but the full cluster as delivered to the customer.
Conforming hardware and software
The preceding was primarily related to the builders of certified
clusters and the developers of registered applications. For end
users that want to purchase a certified cluster to run registered
applications, the ability to identify registered applications and
certified clusters is most important, as that will reduce their ef-
fort to evaluate, acquire, and deploy the clusters that run their
applications, and then keep that computing resource operating
properly, with full performance, directly increasing their pro-
ductivity.
Intel, XEON and certain other trademarks and logos appearing in this brochure,
are trademarks or registered trademarks of Intel Corporation.
72
some of the world’s largest supercomputers, and from un-
der $25,000 to tens or hundreds of millions of dollars in price.
Clusters increasingly pervade every HPC vertical market:
biosciences, computer-aided engineering, chemical engineer-
ing, digital content creation, economic/financial services,
electronic design automation, geosciences/geo-engineering,
mechanical design, defense, government labs, academia, and
weather/climate.
But IDC studies have consistently shown that clusters remain
difficult to specify, deploy, and manage, especially for new and
less experienced HPC users. This should come as no surprise,
given that a cluster is a set of independent computers linked to-
gether by software and networking technologies from multiple
vendors. Clusters originated as do-it-yourself HPC systems. In
the late 1990s users began employing inexpensive hardware to
cobble together scientific computing systems based on the “Be-
owulf cluster” concept first developed by Thomas Sterling and
Donald Becker at NASA. From their Beowulf origins, clusters have
evolved and matured substantially, but the system manage-
ment issues that plagued their early years remain in force today.
The Need for Standard Cluster SolutionsThe escalating complexity of HPC clusters poses a dilemma for
many large IT departments that cannot afford to scale up their
HPC-knowledgeable staff to meet the fast-growing end-user de-
mand for technical computing resources. Cluster management
is even more problematic for smaller organizations and busi-
ness units that often have no dedicated, HPC-knowledgeable
staff to begin with.
The ICR program aims to address burgeoning cluster complexity
by making available a standard solution (aka reference archi-
tecture) for Intel-based systems that hardware vendors can use
Intel Cluster Ready
Intel Cluster Ready Builds
HPC Momentum
73
to certify their configurations and that ISVs and other software
vendors can use to test and register their applications, system
software, and HPC management software. The chief goal of this
voluntary compliance program is to ensure fundamental hard-
ware-software integration and interoperability so that system
administrators and end users can confidently purchase and de-
ploy HPC clusters, and get their work done, even in cases where
no HPC-knowledgeable staff are available to help.
The ICR program wants to prevent end users from having to
become, in effect, their own systems integrators. In smaller or-
ganizations, the ICR program is designed to allow overworked
IT departments with limited or no HPC expertise to support HPC
user requirements more readily. For larger organizations with
dedicated HPC staff, ICR creates confidence that required user
applications will work, eases the problem of system adminis-
tration, and allows HPC cluster systems to be scaled up in size
without scaling support staff. ICR can help drive HPC cluster re-
sources deeper into larger organizations and free up IT staff to
focus on mainstream enterprise applications (e.g., payroll, sales,
HR, and CRM).
The program is a three-way collaboration among hardware ven-
dors, software vendors, and Intel. In this triple alliance, Intel pro-
vides the specification for the cluster architecture implementa-
tion, and then vendors certify the hardware configurations and
register software applications as compliant with the specifica-
tion. The ICR program’s promise to system administrators and
end users is that registered applications will run out of the box
on certified hardware configurations.
ICR solutions are compliant with the standard platform archi-
tecture, which starts with 64-bit Intel Xeon processors in an In-
tel-certified cluster hardware platform. Layered on top of this
foundation are the interconnect fabric (Gigabit Ethernet, Infini-
Band) and the software stack: Intel-selected Linux cluster tools,
an Intel MPI runtime library, and the Intel Math Kernel Library.
Intel runtime components are available and verified as part of
the certification (e.g., Intel tool runtimes) but are not required
to be used by applications. The inclusion of these Intel runtime
components does not exclude any other components a systems
vendor or ISV might want to use. At the top of the stack are In-
tel-registered ISV applications.
At the heart of the program is the Intel Cluster Checker, a vali-
dation tool that verifies that a cluster is specification compliant
and operational before ISV applications are ever loaded. After
the cluster is up and running, the Cluster Checker can function
as a fault isolation tool in wellness mode. Certification needs
to happen only once for each distinct hardware platform, while
verification – which determines whether a valid copy of the
specification is operating – can be performed by the Cluster
Checker at any time.
Cluster Checker is an evolving tool that is designed to accept
new test modules. It is a productized tool that ICR members ship
with their systems. Cluster Checker originally was designed for
homogeneous clusters but can now also be applied to clusters
with specialized nodes, such as all-storage sub-clusters. Cluster
Checker can isolate a wide range of problems, including network
or communication problems.
BOX
74
transtec offers their customers a new and fascinating way to
evaluate transtec’s HPC solutions in real world scenarios. With
the transtec Benchmarking Center solutions can be explored in
detail with the actual applications the customers will later run
on them. Intel Cluster Ready makes this feasible by simplifying
the maintenance of the systems and set-up of clean systems
very easily, and as often as needed.
As High-Performance Computing (HPC) systems are utilized for
numerical simulations, more and more advanced clustering
technologies are being deployed. Because of its performance,
price/performance and energy efficiency advantages, clusters
now dominate all segments of the HPC market and continue
to gain acceptance. HPC computer systems have become far
more widespread and pervasive in government, industry, and
academia. However, rarely does the client have the possibility
to test their actual application on the system they are planning
to acquire.
The transtec Benchmarking Center
transtec HPC solutions get used by a wide variety of clients.
Among those are most of the large users of compute power at
German and other European universities and research centers
as well as governmental users like the German army’s compute
center and clients from the high tech, the automotive and oth-
er sectors. transtec HPC solutions have demonstrated their
value in more than 500 installations. Most of transtec’s cluster
systems are based on SUSE Linux Enterprise Server, Red Hat
Intel Cluster Ready
The transtec Benchmarking Center
75
Enterprise Linux, CentOS, or Scientific Linux. With xCAT for ef-
ficient cluster deployment, and Moab HPC Suite by Adaptive
Computing for high-level cluster management, transtec is able
to efficiently deploy and ship easy-to-use HPC cluster solutions
with enterprise-class management features. Moab has proven
to provide easy-to-use workload and job management for small
systems as well as the largest cluster installations worldwide.
However, when selling clusters to governmental customers as
well as other large enterprises, it is often required that the cli-
ent can choose from a range of competing offers. Many times
there is a fixed budget available and competing solutions are
compared based on their performance towards certain custom
benchmark codes.
So, in 2007 transtec decided to add another layer to their already
wide array of competence in HPC – ranging from cluster deploy-
ment and management, the latest CPU, board and network tech-
nology to HPC storage systems. In transtec’s HPC Lab the sys-
tems are being assembled. transtec is using Intel Cluster Ready
to facilitate testing, verification, documentation, and final test-
ing throughout the actual build process.
At the benchmarking center transtec can now offer a set of small
clusters with the “newest and hottest technology” through
Intel Cluster Ready. A standard installation infrastructure
gives transtec a quick and easy way to set systems up accord-
ing to their customers’ choice of operating system, compilers,
workload management suite, and so on. With Intel Cluster Ready
there are prepared standard set-ups available with verified per-
formance at standard benchmarks while the system stability is
guaranteed by our own test suite and the Intel Cluster Checker.
The Intel Cluster Ready program is designed to provide a com-
mon standard for HPC clusters, helping organizations design
and build seamless, compatible and consistent cluster config-
urations. Integrating the standards and tools provided by this
program can help significantly simplify the deployment and
management of HPC clusters.
77
Remote Visualization and Workflow Optimization
It’s human nature to want to ‘see’ the results from simulations, tests, and analyses. Up
until recently, this has meant ‘fat’ workstations on many user desktops. This approach
provides CPU power when the user wants it – but as dataset size increases, there can be
delays in downloading the results.
Also, sharing the results with colleagues means gathering around the workstation – not
always possible in this globalized, collaborative workplace.
Life Sciences | CAE | High Performance Computing | Big Data Analytics | Simulation | CAD
78
Aerospace, Automotive and ManufacturingThe complex factors involved in CAE range from compute-in-
tensive data analysis to worldwide collaboration between de-
signers, engineers, OEM’s and suppliers. To accommodate these
factors, Cloud (both internal and external) and Grid Computing
solutions are increasingly seen as a logical step to optimize IT
infrastructure usage for CAE. It is no surprise then, that automo-
tive and aerospace manufacturers were the early adopters of
internal Cloud and Grid portals.
Manufacturers can now develop more “virtual products” and
simulate all types of designs, fluid flows, and crash simulations.
Such virtualized products and more streamlined collaboration
environments are revolutionizing the manufacturing process.
With NICE EnginFrame in their CAE environment engineers can
take the process even further by connecting design and simula-
tion groups in “collaborative environments” to get even greater
benefits from “virtual products”. Thanks to EnginFrame, CAE en-
gineers can have a simple intuitive collaborative environment
that can care of issues related to:
� Access & Security - Where an organization must give access
to external and internal entities such as designers, engi-
neers and suppliers.
� Distributed collaboration - Simple and secure connection of
design and simulation groups distributed worldwide.
� Time spent on IT tasks – By eliminating time and resources spent
using cryptic job submission commands or acquiring knowl-
edge of underlying compute infrastructures, engineers can
spend more time concentrating on their core design tasks.
EnginFrame’s web based interface can be used to access the
compute resources required for CAE processes. This means ac-
cess to job submission & monitoring tasks, input & output data
associated with industry standard CAE/CFD applications for
Fluid-Dynamics, Structural Analysis, Electro Design and Design
Remote Visualization
NICE EnginFrame:
A Technical Computing Portal
79
Collaboration (like Abaqus, Ansys, Fluent, MSC Nastran, PAM-
Crash, LS-Dyna, Radioss) without cryptic job submission com-
mands or knowledge of underlying compute infrastructures.
EnginFrame has a long history of usage in some of the most
prestigious manufacturing organizations worldwide including
Aerospace companies like AIRBUS, Alenia Space, CIRA, Galileo
Avionica, Hamilton Sunstrand, Magellan Aerospace, MTU and
Automotive companies like Audi, ARRK, Brawn GP, Bridgestone,
Bosch, Delphi, Elasis, Ferrari, FIAT, GDX Automotive, Jaguar-Lan-
dRover, Lear, Magneti Marelli, McLaren, P+Z, RedBull, Swagelok,
Suzuki, Toyota, TRW.
Life Sciences and HealthcareNICE solutions are deployed in the Life Sciences sector at compa-
nies like BioLab, Partners HealthCare, Pharsight and the M.D.An-
derson Cancer Center; also in leading research projects like DEISA
or LitBio in order to allow easy and transparent use of computing
resources without any insight of the HPC infrastructure.
The Life Science and Healthcare sectors have some very strict
requirement when choosing its IT solution like EnginFrame, for
instance:
� Security - To meet the strict security & privacy requirements
of the biomedical and pharmaceutical industry any solution
needs to take account of multiple layers of security and au-
thentication.
� Industry specific software - ranging from the simplest custom
tool to the more general purpose free and open middlewares.
EnginFrame’s modular architecture allows for different Grid
middlewares and software including leading Life Science appli-
cations like Schroedinger Glide, EPFL RAxML, BLAST family, Tav-
erna, and R) to be exploited, R Users can compose elementary
services into complex applications and “virtual experiments”,
run, monitor, and build workflows via a standard Web browser.
EnginFrame also has highly tunable resource sharing and fine-
grained access control where multiple authentication systems
(like Active Directory, Kerberos/LDAP) can be exploited simulta-
neously.
Growing complexity in globalized teamsHPC systems and Enterprise Grids deliver unprecedented time-
to-market and performance advantages to many research and
corporate customers, struggling every day with compute and
data intensive processes.
This often generates or transforms massive amounts of jobs and
data, that needs to be handled and archived efficiently to deliv-
er timely information to users distributed in multiple locations,
with different security concerns.
Poor usability of such complex systems often negatively im-
pacts users’ productivity, and ad-hoc data management often
increases information entropy and dissipates knowledge and
intellectual property.
The solutionSolving distributed computing issues for our customers, we
understand that a modern, user-friendly web front-end to HPC
and Grids can drastically improve engineering productivity, if
properly designed to address the specific challenges of the Tech-
nical Computing market
NICE EnginFrame overcomes many common issues in the areas
of usability, data management, security and integration, open-
ing the way to a broader, more effective use of the Technical
Computing resources.
The key components of the solution are:
� A flexible and modular Java-based kernel, with clear separa-
tion between customizations and core services
80
� Powerful data management features, reflecting the typical
needs of engineering applications
� Comprehensive security options and a fine grained authori-
zation system
� Scheduler abstraction layer to adapt to different workload
and resource managers
� Responsive and competent Support services
End users can typically enjoy the following improvements:
� User-friendly, intuitive access to computing resources, using
a standard Web browser
� Application-centric job submission
� Organized access to job information and data
� Increased mobility and reduced client requirements
Remote Visualization
NICE EnginFrame:
A Technical Computing Portal
81
On the other side, the Technical Computing Portal delivers signif-
icant added-value for system administrators and IT:
� Reduced training needs to enable users to access the
resources
� Centralized configuration and immediate deployment of
services
� Comprehensive authorization to access services and
information
� Reduced support calls and submission errors
Coupled with our Remote Visualization solutions, our custom-
ers quickly deploy end-to-end engineering processes on their
Intranet, Extranet or Internet.
82
Solving distributed computing issues for our customers, it is
easy to understand that a modern, user-friendly web front-end
to HPC and grids can drastically improve engineering productiv-
ity, if properly designed to address the specific challenges of the
Technical Computing market.
Remote VisualizationIncreasing dataset complexity (millions of polygons, interacting
components, MRI/PET overlays) means that as time comes to
upgrade and replace the workstations, the next generation of
hardware needs more memory, more graphics processing, more
disk, and more CPU cores. This makes the workstation expen-
sive, in need of cooling, and noisy.
Innovation in the field of remote 3D processing now allows com-
panies to address these issues moving applications away from
the Desktop into the data center. Instead of pushing data to the
application, the application can be moved near the data.
Instead of mass workstation upgrades, remote visualization
allows incremental provisioning, on-demand allocation, better
management and efficient distribution of interactive sessions
and licenses. Racked workstations or blades typically have low-
er maintenance, cooling, replacement costs, and they can ex-
tend workstation (or laptop) life as “thin clients”.
The solution
Leveraging their expertise in distributed computing and web-
based application portals, NICE offers an integrated solution to
access, load balance and manage applications and desktop ses-
sions running within a visualization farm. The farm can include
both Linux and Windows resources, running on heterogeneous
hardware.
Remote Visualization
Remote Visualization
83
The core of the solution is the EnginFrame visualization plug-in,
that delivers web-based services to access and manage applica-
tions and desktops published in the farm. This solution has been
integrated with:
� NICE Desktop Cloud Visualization (DCV)
� HP Remote Graphics Software (RGS)
� RealVNC
� TurboVNC and VirtualGL
� Nomachine NX
Coupled with these third party remote visualization engines
(which specialize in delivering high frame-rates for 3D graphics),
the NICE offering for Remote Visualization solves the issues of
user authentication, dynamic session allocation, session man-
agement and data transfers.
End users can enjoy the following improvements:
� Intuitive, application-centric web interface to start, control
and re-connect to a session
� Single sign-on for batch and interactive applications
� All data transfers from and to the remote visualization farm
are handled by EnginFrame
� Built-in collaboration, to share sessions with other users
� The load and usage of the visualization cluster is monitored
in the browser
The solution also delivers significant added-value for the sys-
tem administrators:
� No need of SSH / SCP / FTP on the client machine
� Easy integration into identity services, Single Sign-On (SSO),
Enterprise portals
� Automated data life cycle management
� Built-in user session sharing, to facilitate support
� Interactive sessions are load balanced by a scheduler (LSF,
Moab or Torque) to achieve optimal performance and re-
source usage
� Better control and use of application licenses
� Monitor, control and manage users’ idle sessions
Desktop Cloud VisualizationNICE Desktop Cloud Visualization (DCV) is an advanced technol-
ogy that enables Technical Computing users to remote access
2D/3D interactive applications over a standard network.
Engineers and scientists are immediately empowered by taking
full advantage of high-end graphics cards, fast I/O performance
and large memory nodes hosted in “Public or Private 3D Cloud”,
rather then waiting for the next upgrade of the workstations. The
DCV protocol adapts to heterogeneous networking infrastruc-
tures like LAN, WAN and VPN, to deal with bandwidth and latency
constraints. All applications run natively on the remote machines,
that could be virtualized and share the same physical GPU.
84
In a typical visualization scenario, a software application sends
a stream of graphics commands to a graphics adapter through
an input/output (I/O) interface. The graphics adapter renders
the data into pixels and outputs them to the local display as
a video signal. When using NICE DCV, the scene geometry and
graphics state are rendered on a central server, and pixels are
sent to one or more remote displays.
This approach requires the server to be equipped with one or
more GPUs, which are used for the OpenGL rendering, while the
client software can run on “thin” devices.
NICE DCV architecture consist of:
� DCV Server, equipped with one or more GPUs, used for
OpenGL rendering
� one or more DCV end stations, running on “thin clients”, only
used for visualization
� etherogeneous networking infrastructures (like LAN, WAN
and VPN), optimized balancing quality vs frame rate
NICE DCV Highlights:
� enables high performance remote access to interactive
2D/3D software applications on low bandwidth/high latency
� supports multiple etherogeneous OS (Windows, Linux)
� enables GPU sharing
� supports 3D acceleration for OpenGL applications running
on virtual machines
� Supports multiple user collaboration via session sharing
� Enables attractive Return-on-Investment through resource
sharing and consolidation to data centers (GPU, memory, CPU)
� Keeps the data secure in the data center, reducing data load
and save time
� Enables right sizing of system allocation based on user’s
dynamic needs
� Facilitates application deployment: all applications, updates
and patches are instantly available to everyone, without any
changes to original code
“The amount of data resulting from e.g. simulations in CAE or other
engineering environments can be in the Gigabyte range. It is obvious
that remote post-processing is one of the most urgent topics to be
tackled. NICE EnginFrame provides exactly that, and our customers
are impressed that such great technology enhances their workflow so
significantly.”
Robin Kienecker | HPC Sales Specialist
Remote Visualization
Desktop Cloud Visualization
85
Business BenefitsThe business benefits for adopting NICE DCV can be summarized
in to four categories:
Category Business Benefits
Productivity � Increase business efficiency
� Improve team performance by ensuring
real-time collaboration with colleagues
and partners in real time, anywhere.
� Reduce IT management costs by consoli-
dating workstation resources to a single
point-of-management
� Save money and time on application
deployment
� Let users work from anywhere if there is
an Internet connection
Business
Continuity
� Move graphics processing and data to the
datacenter - not on laptop/desktop
� Cloud-based platform support enables
you to scale the visualization solution
„on-demand“ to extend business, grow
new revenue, manage costs.
Data
Security
� Guarantee secure and auditable use of
remote resources (applications, data,
infrastructure, licenses)
� Allow real-time collaboration with
partners while protecting Intellectual
Property and resources
� Restrict access by class of user, service,
application, and resource
Training
Effectiveness
� Enable multiple users to follow
application procedures alongside an
instructor in real-time
� Enable collaboration and session sharing
among remote users
(employees, partners, and affiliates)
NICE DCV is perfectly integrated into EnginFrame Views, lever-
aging 2D/3D capabilities over the web, including the ability to
share an interactive session with other users for collaborative
working.
BOX
86
Cloud Computing With transtec and NICECloud Computing is a style of computing in which dynamically
scalable and often virtualized resources are provided as a service
over the Internet. Users need not have knowledge of, expertise in,
or control over the technology infrastructure in the “cloud” that
supports them. The concept incorporates technologies that have
the common theme of reliance on the Internet for satisfying the
computing needs of the users. Cloud Computing services usually
provide applications online that are accessed from a web brows-
er, while the software and data are stored on the servers.
Companies or individuals engaging in Cloud Computing do not
own the physical infrastructure hosting the software platform
in question. Instead, they avoid capital expenditure by renting
usage from a third-party provider (except for the case of ‘Private
Cloud’ - see below). They consume resources as a service, paying
instead for only the resources they use. Many Cloud Comput-
ing offerings have adopted the utility computing model, which
is analogous to how traditional utilities like electricity are con-
sumed, while others are billed on a subscription basis.
The main advantage offered by Cloud solutions is the reduction
of infrastructure costs, and of the infrastructure’s maintenance.
By not owning the hardware and software, Cloud users avoid
capital expenditure by renting usage from a third-party provider.
Customers pay for only the resources they use.
The advantage for the provider of the Cloud, is that sharing com-
puting power among multiple tenants improves utilization rates,
as servers are not left idle, which can reduce costs and increase
efficiency.
Remote Visualization
Cloud Computing With transtec and NICE
87
Public cloud
Public cloud or external cloud describes cloud computing in the
traditional mainstream sense, whereby resources are dynam-
ically provisioned on a fine-grained, self-service basis over the
Internet, via web applications/web services, from an off-site
third-party provider who shares resources and bills on a fine-
grained utility computing basis.
Private cloud
Private cloud and internal cloud describe offerings deploying
cloud computing on private networks. These solutions aim to
“deliver some benefits of cloud computing without the pitfalls”,
capitalizing on data security, corporate governance, and relia-
bility concerns. On the other hand, users still have to buy, deploy,
manage, and maintain them, and as such do not benefit from
lower up-front capital costs and less hands-on management.
Architecture
The majority of cloud computing infrastructure, today, consists
of reliable services delivered through data centers and built on
servers with different levels of virtualization technologies. The
services are accessible anywhere that has access to networking
infrastructure. The Cloud appears as a single point of access
for all the computing needs of consumers. The offerings need
to meet the quality of service requirements of customers and
typically offer service level agreements, and, at the same time
proceed over the typical limitations.
The architecture of the computing platform proposed by NICE
(fig. 1) differs from the others in some interesting ways:
� you can deploy it on an existing IT infrastructure, because it
is completely decoupled from the hardware infrastructure
� it has a high level of modularity and configurability, resulting
in being easily customizable for the user’s needs
� based on the NICE EnginFrame technology, it is easy to build
graphical web-based interfaces to provide several applica-
tions, as web services, without needing to code or compile
source programs
� it utilises the existing methodology in place for authentica-
tion and authorization
88
Further, because the NICE Cloud solution is built on advanced
IT technologies, including virtualization and workload manage-
ment, the execution platform is dynamically able to allocate,
monitor, and configure a new environment as needed by the
application, inside the Cloud infrastructure. The NICE platform
offers these important properties:
� Incremental Scalability: the quantity of computing and
storage resources, provisioned to the applications, changes
dynamically depending on the workload
� Reliability and Fault-Tolerance: because of the virtualiza-
tion of the hardware resources and the multiple redundant
hosts, the platform adjusts the resources needed from the
applications, without disruption during disasters or crashes
� Service Level Agreement: the use of advanced systems for
the dynamic allocation of the resources, allows the guaran-
tee of service level, agreed across applications and services;
� Accountability: the continuous monitoring of the resourc-
es used by each application (and user), allows the setup of
services that users can access in a pay-per-use mode, or sub-
scribing to a specific contract. In the case of an Enterprise
Cloud, this feature allows costs to be shared among the cost
centers of the company.
transtec has a long-term experience in Engineering environ-
ments, especially from the CAD/CAE sector. This allow us to
provide customers from this area with solutions that greatly en-
hance their workflow and minimizes time-to-result.
This, together with transtec’s offerings of all kinds of services,
allows our customers to fully focus on their productive work,
and have us do the environmental optimizations.
Remote Visualization
NVIDIA Grid Boards
89
Graphics accelerated virtual desktops and applicationsNVIDIA GRID for large corporations offers the ability to offload
graphics processing from the CPU to the GPU in virtualised envi-
ronments, allowing the data center manager to deliver true PC
graphics-rich experiences to more users for the first time.
Benefits of NVIDIA GRID for IT:
� Leverage industry-leading remote visualization solutions like
NICE DCV
� Add your most graphics-intensive users to your virtual solu-
tion
� Improve the productivity of all users
Benefits of NVIDIA GRID for users:
� Highly responsive windows and rich multimedia experiences
� Access to all critical applications, including the most 3D-intensive
� Access from anywhere, on any device
NVIDIA GRID BoardsNVIDIA’s Kepler architecture-based GRID boards are specifically
designed to enable rich graphics in virtualised environments.
GPU VirtualisationGRID boards feature NVIDIA Kepler-based GPUs that, for the first
time, allow hardware virtualisation of the GPU. This means mul-
tiple users can share a single GPU, improving user density while
providing true PC performance and compatibility.
Low-Latency Remote DisplayNVIDIA’s patented low-latency remote display technology great-
ly improves the user experience by reducing the lag that users
feel when interacting with their virtual machine.
With this technology, the virtual desktop screen is pushed
directly to the remoting protocol.
H.264 EncodingThe Kepler GPU includes a high-performance H.264 encoding
engine capable of encoding simultaneous streams with supe-
rior quality. This provides a giant leap forward in cloud server
efficiency by offloading the CPU from encoding functions and
allowing the encode function to scale with the number of GPUs
in a server.
Maximum User DensityGRID boards have an optimised multi-GPU design that helps to
maximise user density. The GRID K1 board features 4 GPUs and
16 GB of graphics memory, allowing it to support up to 100 users
on a single board.
Power EfficiencyGRID boards are designed to provide data center-class power
efficiency, including the revolutionary new streaming multi-
processor, called “SMX”. The result is an innovative, proven solu-
tion that delivers revolutionary performance-per-watt for the
enterprise data centre
24/7 ReliabilityGRID boards are designed, built, and tested by NVIDIA for 24/7
operation. Working closely with leading server vendors ensures
GRID cards perform optimally and reliably for the life of the sys-
tem.
Widest Range of Virtualisation SolutionsGRID cards enable GPU-capable virtualisation solutions like
XenServer or Linux KVM, delivering the flexibility to choose from
a wide range of proven solutions.
90
GRID K1 Board GRID K2 Board
Number of GPUs 4 x entry Kepler GPUs2 x high-end Kepler
GPUs
Total NVIDIA CUDA cores 768 3072
Total memory size 16 GB DDR3 8 GB GDDR5
Max power 130 W 225 W
Board length 10.5” 10.5”
Board height 4.4” 4.4”
Board width Dual slot Dual slot
Display IO None None
Aux power 6-pin connector 8-pin connector
PCIe x16 x16
PCIe generation Gen3 (Gen2 compatible) Gen3 (Gen2 compatible)
Cooling solution Passive Passive
Technical SpecificationsGRID K1 Board Specifications
GRID K2 Board Specifications
Virtual GPU TechnologyNVIDIA GRID vGPU brings the full benefit of NVIDIA hardware-ac-
celerated graphics to virtualized solutions. This technology pro-
vides exceptional graphics performance for virtual desktops
equivalent to local PCs when sharing a GPU among multiple users.
NVIDIA GRID vGPU is the industry’s most advanced technology
for sharing true GPU hardware acceleration between multiple vir-
tual desktops – without compromising the graphics experience.
Application features and compatibility are exactly the same as
they would be at the desk.
Remote Visualization
Virtual GPU Technology
91
With GRID vGPU technology, the graphics commands of each vir-
tual machine are passed directly to the GPU, without translation
by the hypervisor. This allows the GPU hardware to be time-sliced
to deliver the ultimate in shared virtualised graphics performance.
vGPU Profiles Mean Customised, Dedicated Graphics MemoryTake advantage of vGPU Manager to assign just the right amount
of memory to meet the specific needs of each user. Every virtual
desktop has dedicated graphics memory, just like they would
at their desk, so they always have the resources they need to
launch and use their applications.
vGPU Manager enables up to eight users to share each physical
GPU, assigning the graphics resources of the available GPUs to
virtual machines in a balanced approach. Each NVIDIA GRID K1
graphics card has up to four GPUs, allowing 32 users to share a
single graphics card.
NVIDIA GRID Graphics Board
Virtual GPU Profile Application Certifications
Graphics Memory Max Displays Per User
Max Resolution Per Display
Max Users Per Graphics Board
Use Case
GRID K2K260Q 2,048 MB 4 2560×1600 4
Designer/Power User
K240Q 1,024 MB 2 2560×1600 8Designer/
Power User
K220Q 512 MB 2 2560×1600 16Designer/
Power User
K200 256 MB 2 1900×1200 16Knowledge
Worker
GRID K1K140Q 1,024 MB 2 2560×1600 16 Power User
K120Q 512 MB 2 2560×1600 32 Power User
K100 256 MB 2 1900×1200 32Knowledge
Worker
Up to 8 users supported per physical GPU depending in vGPU profiles
Life Sciences | CAE | High Performance Computing | Big Data Analytics | Simulation | CAD
Big Data Analytics withHadoopThe Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. HDFS is
highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS pro-
vides high throughput access to application data and is suitable for applications that
have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access
to file system data. HDFS was originally built as infrastructure for the Apache Nutch
web search engine project. HDFS is now an Apache Hadoop subproject.
94
Big Data Analytics with Hadoop
Apache Hadoop
Apache HadoopOverviewApache Hadoop is an open-source software framework written
in Java for distributed storage and distributed processing of very
large data sets on computer clusters built from commodity hard-
ware. All the modules in Hadoop are designed with a fundamen-
tal assumption that hardware failures are common and should
be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as
Hadoop Distributed File System (HDFS), and a processing part
called MapReduce. Hadoop splits files into large blocks and dis-
tributes them across nodes in a cluster. To process data, Hadoop
transfers packaged code for nodes to process in parallel based
on the data that needs to be processed. This approach takes ad-
vantage of data locality – nodes manipulating the data they have
access to – to allow the dataset to be processed faster and more
efficiently than it would be in a more conventional supercomput-
er architecture that relies on a parallel file system where compu-
tation and data are distributed via high-speed networking.
The base Apache Hadoop framework is composed of the follow-
ing modules:
� Hadoop Common – contains libraries and utilities needed by
other Hadoop modules;
� Hadoop Distributed File System (HDFS) – a distributed
file-system that stores data on commodity machines,
providing very high aggregate bandwidth across the cluster;
� Hadoop YARN – a resource-management platform responsi-
ble for managing computing resources in clusters and using
them for scheduling of users’ applications; and
� Hadoop MapReduce – an implementation of the MapReduce
programming model for large scale data processing.
95
The term Hadoop has come to refer not just to the base mod-
ules above, but also to the ecosystem, or collection of additional
software packages that can be installed on top of or alongside
Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache
Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala,
Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm.
Apache Hadoop’s MapReduce and HDFS components were in-
spired by Google papers on their MapReduce and Google File
System.
The Hadoop framework itself is mostly written in the Java pro-
gramming language, with some native code in C and command
line utilities written as shell scripts. Though MapReduce Java
code is common, any programming language can be used with
“Hadoop Streaming” to implement the “map” and “reduce” parts
of the user’s program. Other projects in the Hadoop ecosystem
expose richer user interfaces.
ArchitectureHadoop consists of the Hadoop Common package, which pro-
vides filesystem and OS level abstractions, a MapReduce engine
(either MapReduce/MR1 or YARN/MR2) and the Hadoop Distrib-
uted File System (HDFS). The Hadoop Common package contains
the necessary Java ARchive (JAR) files and scripts needed to start
Hadoop.
For effective scheduling of work, every Hadoop-compatible file
system should provide location awareness: the name of the rack
(more precisely, of the network switch) where a worker node is.
Hadoop applications can use this information to execute code
on the node where the data is, and, failing that, on the same rack/
switch to reduce backbone traffic. HDFS uses this method when
replicating data for data redundancy across multiple racks. This
approach reduces the impact of a rack power outage or switch
failure; if one of these hardware failures occurs, the data will re-
main available.
A small Hadoop cluster includes a single master and multiple
worker nodes. The master node consists of a Job Tracker, Task
Tracker, NameNode, and DataNode. A slave or worker node acts
as both a DataNode and TaskTracker, though it is possible to have
data-only worker nodes and compute-only worker nodes. These
are normally used only in nonstandard applications.
Hadoop requires Java Runtime Environment (JRE) 1.6 or higher.
The standard startup and shutdown scripts require that Secure
Shell (ssh) be set up between nodes in the cluster.
In a larger cluster, HDFS nodes are managed through a dedicated
NameNode server to host the file system index, and a second-
ary NameNode that can generate snapshots of the namenode’s
memory structures, thereby preventing file-system corruption
and loss of data. Similarly, a standalone JobTracker server can
manage job scheduling across nodes. When Hadoop MapReduce
is used with an alternate file system, the NameNode, secondary
NameNode, and DataNode architecture of HDFS are replaced by
the file-system-specific equivalents.
tasktracker
master slave
multi-node cluster
MapReducelayer
HDFSlayer
jobtracker
namenode
datanode
tasktracker
datanode
A multi-node Hadoop cluster
96
Hadoop distributed file system (HDFS)The Hadoop distributed file system (HDFS) is a distributed, scal-
able, and portable file-system written in Java for the Hadoop
framework. A Hadoop cluster has nominally a single namenode
plus a cluster of datanodes, although redundancy options are
available for the namenode due to its criticality. Each datanode
serves up blocks of data over the network using a block protocol
specific to HDFS. The file system uses TCP/IP sockets for commu-
nication. Clients use remote procedure call (RPC) to communi-
cate between each other.
HDFS stores large files (typically in the range of gigabytes to
terabytes) across multiple machines. It achieves reliability by
replicating the data across multiple hosts, and hence theoreti-
cally does not require RAID storage on hosts (but to increase I/O
performance some RAID configurations are still useful). With the
default replication value, 3, data is stored on three nodes: two on
the same rack, and one on a different rack. Data nodes can talk to
each other to rebalance data, to move copies around, and to keep
the replication of data high. HDFS is not fully POSIX-compliant,
because the requirements for a POSIX file-system differ from the
target goals for a Hadoop application. The trade-off of not having
a fully POSIX-compliant file-system is increased performance for
data throughput and support for non-POSIX operations such as
Append.
HDFS added the high-availability capabilities, as announced for
release 2.0 in May 2012, letting the main metadata server (the
NameNode) fail over manually to a backup. The project has also
started developing automatic fail-over.
The HDFS file system includes a so-called secondary namenode,
a misleading name that some might incorrectly interpret as a
backup namenode for when the primary namenode goes offline.
In fact, the secondary namenode regularly connects with the
Big Data Analytics with Hadoop
Apache Hadoop
97
primary namenode and builds snapshots of the primary nameno-
de’s directory information, which the system then saves to local
or remote directories. These checkpointed images can be used
to restart a failed primary namenode without having to replay
the entire journal of file-system actions, then to edit the log to
create an up-to-date directory structure. Because the namenode
is the single point for storage and management of metadata, it
can become a bottleneck for supporting a huge number of files,
especially a large number of small files. HDFS Federation, a new
addition, aims to tackle this problem to a certain extent by allow-
ing multiple namespaces served by separate namenodes.
An advantage of using HDFS is data awareness between the job
tracker and task tracker. The job tracker schedules map or reduce
jobs to task trackers with an awareness of the data location. For
example: if node A contains data (x,y,z) and node B contains data
(a,b,c), the job tracker schedules node B to perform map or reduce
tasks on (a,b,c) and node A would be scheduled to perform map
or reduce tasks on (x,y,z). This reduces the amount of traffic that
goes over the network and prevents unnecessary data transfer.
When Hadoop is used with other file systems, this advantage is
not always available. This can have a significant impact on job-
completion times, which has been demonstrated when running
data-intensive jobs.
HDFS was designed for mostly immutable files and may not be
suitable for systems requiring concurrent write-operations.
HDFS can be mounted directly with a Filesystem in Userspace
(FUSE) virtual file system on Linux and some other Unix systems.
File access can be achieved through the native Java application
programming interface (API), the Thrift API to generate a client in
the language of the users’ choosing (C++, Java, Python, PHP, Ruby,
Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the com-
mand-line interface, browsed through the HDFS-UI Web applica-
tion (webapp) over HTTP, or via 3rd-party network client libraries.
JobTracker and TaskTracker: the MapReduce EngineAbove the file systems comes the MapReduce Engine, which con-
sists of one JobTracker, to which client applications submit Ma-
pReduce jobs. The JobTracker pushes work out to available Task-
Tracker nodes in the cluster, striving to keep the work as close to
the data as possible. With a rack-aware file system, the JobTracker
knows which node contains the data, and which other machines
are nearby. If the work cannot be hosted on the actual node
where the data resides, priority is given to nodes in the same
rack. This reduces network traffic on the main backbone net-
work. If a TaskTracker fails or times out, that part of the job is re-
scheduled. The TaskTracker on each node spawns a separate Java
Virtual Machine process to prevent the TaskTracker itself from
failing if the running job crashes its JVM. A heartbeat is sent from
the TaskTracker to the JobTracker every few minutes to check its
status. The Job Tracker and TaskTracker status and information is
exposed by Jetty and can be viewed from a web browser.
Known limitations of this approach are:
� The allocation of work to TaskTrackers is very simple. Every
TaskTracker has a number of available slots (such as “4 slots”).
Every active map or reduce task takes up one slot. The Job
Tracker allocates work to the tracker nearest to the data with
an available slot. There is no consideration of the current
system load of the allocated machine, and hence its actual
availability.
� If one TaskTracker is very slow, it can delay the entire Ma-
pReduce job – especially towards the end of a job, where
everything can end up waiting for the slowest task. With
speculative execution enabled, however, a single task can be
executed on multiple slave nodes.
Scheduling
By default Hadoop uses FIFO scheduling, and optionally 5 sched-
uling priorities to schedule jobs from a work queue. In version
0.19 the job scheduler was refactored out of the JobTracker, while
98
adding the ability to use an alternate scheduler (such as the Fair
scheduler or the Capacity scheduler, described next).
Fair scheduler
The fair scheduler was developed by Facebook. The goal of the
fair scheduler is to provide fast response times for small jobs and
QoS for production jobs. The fair scheduler has three basic con-
cepts.
1. Jobs are grouped into pools.
2. Each pool is assigned a guaranteed minimum share.
3. Excess capacity is split between jobs.
By default, jobs that are uncategorized go into a default pool.
Pools have to specify the minimum number of map slots, reduce
slots, and a limit on the number of running jobs.
Capacity scheduler
The capacity scheduler was developed by Yahoo. The capacity
scheduler supports several features that are similar to the fair
scheduler.
� Queues are allocated a fraction of the total resource capacity.
� Free resources are allocated to queues beyond their total
capacity.
�Within a queue a job with a high level of priority has access
to the queue’s resources.
There is no preemption once a job is running.
Other applicationsThe HDFS file system is not restricted to MapReduce jobs. It can
be used for other applications, many of which are under devel-
opment at Apache. The list includes the HBase database, the
Apache Mahout machine learning system, and the Apache Hive
Data Warehouse system. Hadoop can in theory be used for any
sort of work that is batch-oriented rather than real-time, is very
data-intensive, and benefits from parallel processing of data.
Big Data Analytics with Hadoop
HDFS Architecture
99
It can also be used to complement a real-time system, such as
lambda architecture.
As of October 2009, commercial applications of Hadoop included:
� Log and/or clickstream analysis of various kinds
� Marketing analytics
� Machine learning and/or sophisticated data mining
� Image processing
� Processing of XML messages
�Web crawling and/or text processing
� General archiving, including of relational/tabular data, e.g.
for compliance
HDFS ArchitectureAssumptions and GoalsHardware Failure
Hardware failure is the norm rather than the exception. An HDFS
instance may consist of hundreds or thousands of server ma-
chines, each storing part of the file system’s data. The fact that
there are a huge number of components and that each compo-
nent has a non-trivial probability of failure means that some
component of HDFS is always non-functional. Therefore, detec-
tion of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
Streaming Data Access
Applications that run on HDFS need streaming access to their
data sets. They are not general purpose applications that typi-
cally run on general purpose file systems. HDFS is designed more
for batch processing rather than interactive use by users. The
emphasis is on high throughput of data access rather than low
latency of data access. POSIX imposes many hard requirements
that are not needed for applications that are targeted for HDFS.
POSIX semantics in a few key areas has been traded to increase
data throughput rates.
Large Data Sets
Applications that run on HDFS have large data sets. A typical file
in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to
support large files. It should provide high aggregate data band-
width and scale to hundreds of nodes in a single cluster. It should
support tens of millions of files in a single instance.
Simple Coherency Model
HDFS applications need a write-once-read-many access model
for files. A file once created, written, and closed need not be
changed. This assumption simplifies data coherency issues and
enables high throughput data access. A MapReduce application
or a web crawler application fits perfectly with this model. There
is a plan to support appending-writes to files in the future.
“Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more effi-
cient if it is executed near the data it operates on. This is espe-
cially true when the size of the data set is huge. This minimizes
network congestion and increases the overall throughput of the
system. The assumption is that it is often better to migrate the
computation closer to where the data is located rather than mov-
ing the data to where the application is running. HDFS provides
interfaces for applications to move themselves closer to where
the data is located.
Portability Across Heterogeneous Hardware and
Software Platforms
HDFS has been designed to be easily portable from one platform
to another. This facilitates widespread adoption of HDFS as a
platform of choice for a large set of applications.
NameNode and DataNodesHDFS has a master/slave architecture. An HDFS cluster con-
sists of a single NameNode, a master server that manages the
file system namespace and regulates access to files by clients.
100
In addition, there are a number of DataNodes, usually one per
node in the cluster, which manage storage attached to the nodes
that they run on. HDFS exposes a file system namespace and al-
lows user data to be stored in files. Internally, a file is split into one
or more blocks and these blocks are stored in a set of DataNodes.
The NameNode executes file system namespace operations like
opening, closing, and renaming files and directories. It also de-
termines the mapping of blocks to DataNodes. The DataNodes
are responsible for serving read and write requests from the file
system’s clients. The DataNodes also perform block creation, de-
letion, and replication upon instruction from the NameNode.
The NameNode and DataNode are pieces of software designed
to run on commodity machines. These machines typically run
a GNU/Linux operating system (OS). HDFS is built using the Java
language; any machine that supports Java can run the NameNo-
de or the DataNode software. Usage of the highly portable Java
language means that HDFS can be deployed on a wide range of
machines. A typical deployment has a dedicated machine that
runs only the NameNode software. Each of the other machines in
the cluster runs one instance of the DataNode software. The ar-
chitecture does not preclude running multiple DataNodes on the
same machine but in a real deployment that is rarely the case.
Namecode Metadata (Name, replicas, ...):/home/foo/data, 3, ...
Metadataops
Rack 1Write
Rack 2
Blockops
Datanodes Datanodes
Replication
Read
Blocks
Client
Client
HDFS Architecture
Big Data Analytics with Hadoop
HDFS Architecture
101
The existence of a single NameNode in a cluster greatly simplifies
the architecture of the system. The NameNode is the arbitrator
and repository for all HDFS metadata. The system is designed in
such a way that user data never flows through the NameNode.
The File System NamespaceHDFS supports a traditional hierarchical file organization. A user
or an application can create directories and store files inside
these directories. The file system namespace hierarchy is similar
to most other existing file systems; one can create and remove
files, move a file from one directory to another, or rename a file.
HDFS does not yet implement user quotas. HDFS does not sup-
port hard links or soft links. However, the HDFS architecture does
not preclude implementing these features.
The NameNode maintains the file system namespace. Any change
to the file system namespace or its properties is recorded by the
NameNode. An application can specify the number of replicas of
a file that should be maintained by HDFS. The number of copies
of a file is called the replication factor of that file. This informa-
tion is stored by the NameNode.
Data ReplicationHDFS is designed to reliably store very large files across machines
in a large cluster. It stores each file as a sequence of blocks; all
blocks in a file except the last block are the same size. The blocks
of a file are replicated for fault tolerance. The block size and rep-
lication factor are configurable per file. An application can spec-
ify the number of replicas of a file. The replication factor can be
specified at file creation time and can be changed later. Files in
HDFS are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of
blocks. It periodically receives a Heartbeat and a Blockreport
from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport
contains a list of all blocks on a DataNode.
Replica Placement: The First Baby Steps
The placement of replicas is critical to HDFS reliability and per-
formance. Optimizing replica placement distinguishes HDFS
from most other distributed file systems. This is a feature that
needs lots of tuning and experience. The purpose of a rack-aware
replica placement policy is to improve data reliability, availabil-
ity, and network bandwidth utilization. The current implemen-
tation for the replica placement policy is a first effort in this di-
rection. The short-term goals of implementing this policy are to
validate it on production systems, learn more about its behavior,
and build a foundation to test and research more sophisticated
policies.
Large HDFS instances run on a cluster of computers that com-
monly spread across many racks. Communication between two
nodes in different racks has to go through switches. In most
cases, network bandwidth between machines in the same rack is
greater than network bandwidth between machines in different
racks.
The NameNode determines the rack id each DataNode belongs
to via the process outlined in Hadoop Rack Awareness. A simple
but non-optimal policy is to place replicas on unique racks. This
prevents losing data when an entire rack fails and allows use of
bandwidth from multiple racks when reading data. This policy
evenly distributes replicas in the cluster which makes it easy to
Namecode (Filename, numReplicas, block-ids, ...)/users/sameerp/data/part-0, r:2, {1,3}, .../users/sameerp/data/part-1, r:3, {2,4,5}, ...
Datanodes
Block Replication
1 1 42
2 2 5
53
53 4 4
102
balance load on component failure. However, this policy increas-
es the cost of writes because a write needs to transfer blocks to
multiple racks.
For the common case, when the replication factor is three,
HDFS’s placement policy is to put one replica on one node in the
local rack, another on a node in a different (remote) rack, and the
last on a different node in the same remote rack. This policy cuts
the inter-rack write traffic which generally improves write per-
formance. The chance of rack failure is far less than that of node
failure; this policy does not impact data reliability and availabil-
ity guarantees. However, it does reduce the aggregate network
bandwidth used when reading data since a block is placed in
only two unique racks rather than three. With this policy, the rep-
licas of a file do not evenly distribute across the racks. One third
of replicas are on one node, two thirds of replicas are on one rack,
and the other third are evenly distributed across the remaining
racks. This policy improves write performance without compro-
mising data reliability or read performance.
The current, default replica placement policy described here is a
work in progress.
Replica Selection
To minimize global bandwidth consumption and read latency,
HDFS tries to satisfy a read request from a replica that is clos-
est to the reader. If there exists a replica on the same rack as the
reader node, then that replica is preferred to satisfy the read re-
quest. If angg/ HDFS cluster spans multiple data centers, then a
replica that is resident in the local data center is preferred over
any remote replica.
Safemode
On startup, the NameNode enters a special state called Safem-
ode. Replication of data blocks does not occur when the NameNo-
de is in the Safemode state. The NameNode receives Heartbeat
and Blockreport messages from the DataNodes. A Blockreport
Big Data Analytics with Hadoop
HDFS Architecture
103
contains the list of data blocks that a DataNode is hosting. Each
block has a specified minimum number of replicas. A block is con-
sidered safely replicated when the minimum number of replicas
of that data block has checked in with the NameNode. After a
configurable percentage of safely replicated data blocks checks
in with the NameNode (plus an additional 30 seconds), the Na-
meNode exits the Safemode state. It then determines the list of
data blocks (if any) that still have fewer than the specified num-
ber of replicas. The NameNode then replicates these blocks to
other DataNodes.
The Persistence of File System MetadataThe HDFS namespace is stored by the NameNode. The NameNode
uses a transaction log called the EditLog to persistently record
every change that occurs to file system metadata. For example,
creating a new file in HDFS causes the NameNode to insert a re-
cord into the EditLog indicating this. Similarly, changing the rep-
lication factor of a file causes a new record to be inserted into the
EditLog. The NameNode uses a file in its local host OS file system
to store the EditLog. The entire file system namespace, includ-
ing the mapping of blocks to files and file system properties, is
stored in a file called the FsImage. The FsImage is stored as a file
in the NameNode’s local file system too.
The NameNode keeps an image of the entire file system
namespace and file Blockmap in memory. This key metadata item
is designed to be compact, such that a NameNode with 4 GB of
RAM is plenty to support a huge number of files and directories.
When the NameNode starts up, it reads the FsImage and EditLog
from disk, applies all the transactions from the EditLog to the in-
memory representation of the FsImage, and flushes out this new
version into a new FsImage on disk. It can then truncate the old
EditLog because its transactions have been applied to the persis-
tent FsImage. This process is called a checkpoint. In the current
implementation, a checkpoint only occurs when the NameNode
starts up. Work is in progress to support periodic checkpointing
in the near future.
The DataNode stores HDFS data in files in its local file system.
The DataNode has no knowledge about HDFS files. It stores each
block of HDFS data in a separate file in its local file system. The
DataNode does not create all files in the same directory. Instead,
it uses a heuristic to determine the optimal number of files per
directory and creates subdirectories appropriately. It is not op-
timal to create all local files in the same directory because the
local file system might not be able to efficiently support a huge
number of files in a single directory. When a DataNode starts up,
it scans through its local file system, generates a list of all HDFS
data blocks that correspond to each of these local files and sends
this report to the NameNode: this is the Blockreport.
The Communication ProtocolsAll HDFS communication protocols are layered on top of the TCP/
IP protocol. A client establishes a connection to a configurable
TCP port on the NameNode machine. It talks the ClientProtocol
with the NameNode. The DataNodes talk to the NameNode using
the DataNode Protocol. A Remote Procedure Call (RPC) abstrac-
tion wraps both the Client Protocol and the DataNode Protocol.
By design, the NameNode never initiates any RPCs. Instead, it
only responds to RPC requests issued by DataNodes or clients.
RobustnessThe primary objective of HDFS is to store data reliably even in
the presence of failures. The three common types of failures are
NameNode failures, DataNode failures and network partitions.
Data Disk Failure, Heartbeats and Re-Replication
Each DataNode sends a Heartbeat message to the NameNode pe-
riodically. A network partition can cause a subset of DataNodes
to lose connectivity with the NameNode. The NameNode detects
this condition by the absence of a Heartbeat message. The Na-
meNode marks DataNodes without recent Heartbeats as dead
and does not forward any new IO requests to them. Any data that
was registered to a dead DataNode is not available to HDFS any
more. DataNode death may cause the replication factor of some
104
blocks to fall below their specified value. The NameNode con-
stantly tracks which blocks need to be replicated and initiates
replication whenever necessary. The necessity for re-replication
may arise due to many reasons: a DataNode may become unavail-
able, a replica may become corrupted, a hard disk on a DataNode
may fail, or the replication factor of a file may be increased.
Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing
schemes. A scheme might automatically move data from one
DataNode to another if the free space on a DataNode falls below
a certain threshold. In the event of a sudden high demand for a
particular file, a scheme might dynamically create additional rep-
licas and rebalance other data in the cluster. These types of data
rebalancing schemes are not yet implemented.
Data Integrity
It is possible that a block of data fetched from a DataNode ar-
rives corrupted. This corruption can occur because of faults in a
storage device, network faults, or buggy software. The HDFS cli-
ent software implements checksum checking on the contents
of HDFS files. When a client creates an HDFS file, it computes a
checksum of each block of the file and stores these checksums in
a separate hidden file in the same HDFS namespace. When a cli-
ent retrieves file contents it verifies that the data it received from
each DataNode matches the checksum stored in the associated
checksum file. If not, then the client can opt to retrieve that block
from another DataNode that has a replica of that block.
Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS.
A corruption of these files can cause the HDFS instance to be
non-functional. For this reason, the NameNode can be config-
ured to support maintaining multiple copies of the FsImage and
EditLog. Any update to either the FsImage or EditLog causes each
of the FsImages and EditLogs to get updated synchronously. This
synchronous updating of multiple copies of the FsImage and Ed-
Big Data Analytics with Hadoop
HDFS Architecture
105
itLog may degrade the rate of namespace transactions per sec-
ond that a NameNode can support. However, this degradation
is acceptable because even though HDFS applications are very
data intensive in nature, they are not metadata intensive. When
a NameNode restarts, it selects the latest consistent FsImage and
EditLog to use.
The NameNode machine is a single point of failure for an HDFS
cluster. If the NameNode machine fails, manual intervention is
necessary. Currently, automatic restart and failover of the Na-
meNode software to another machine is not supported.
Snapshots
Snapshots support storing a copy of data at a particular instant
of time. One usage of the snapshot feature may be to roll back
a corrupted HDFS instance to a previously known good point in
time. HDFS does not currently support snapshots but will in a fu-
ture release.
Data OrganizationData Blocks
HDFS is designed to support very large files. Applications that are
compatible with HDFS are those that deal with large data sets.
These applications write their data only once but they read it one
or more times and require these reads to be satisfied at stream-
ing speeds. HDFS supports write-once-read-many semantics on
files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file
is chopped up into 64 MB chunks, and if possible, each chunk will
reside on a different DataNode.
Staging
A client request to create a file does not reach the NameNode im-
mediately. In fact, initially the HDFS client caches the file data
into a temporary local file. Application writes are transparently
redirected to this temporary local file. When the local file accu-
mulates data worth over one HDFS block size, the client contacts
the NameNode. The NameNode inserts the file name into the file
system hierarchy and allocates a data block for it. The NameNode
responds to the client request with the identity of the DataNode
and the destination data block. Then the client flushes the block
of data from the local temporary file to the specified DataNode.
When a file is closed, the remaining un-flushed data in the tempo-
rary local file is transferred to the DataNode. The client then tells
the NameNode that the file is closed. At this point, the NameNode
commits the file creation operation into a persistent store. If the
NameNode dies before the file is closed, the file is lost.
The above approach has been adopted after careful consider-
ation of target applications that run on HDFS. These applications
need streaming writes to files. If a client writes to a remote file
directly without any client side buffering, the network speed
and the congestion in the network impacts throughput consider-
ably. This approach is not without precedent. Earlier distributed
file systems, e.g. AFS, have used client side caching to improve
performance. A POSIX requirement has been relaxed to achieve
higher performance of data uploads.
Replication Pipelining
When a client is writing data to an HDFS file, its data is first writ-
ten to a local file as explained in the previous section. Suppose
the HDFS file has a replication factor of three. When the local
file accumulates a full block of user data, the client retrieves a
list of DataNodes from the NameNode. This list contains the
DataNodes that will host a replica of that block. The client then
flushes the data block to the first DataNode. The first DataNode
starts receiving the data in small portions (4 KB), writes each
portion to its local repository and transfers that portion to the
second DataNode in the list. The second DataNode, in turn starts
receiving each portion of the data block, writes that portion to its
repository and then flushes that portion to the third DataNode.
Finally, the third DataNode writes the data to its local reposi-
tory. Thus, a DataNode can be receiving data from the previous
one in the pipeline and at the same time forwarding data to the
next one in the pipeline. Thus, the data is pipelined from one
DataNode to the next.
106
Space ReclamationFile Deletes and Undeletes
When a file is deleted by a user or an application, it is not im-
mediately removed from HDFS. Instead, HDFS first renames it
to a file in the /trash directory. The file can be restored quickly
as long as it remains in /trash. A file remains in /trash for a con-
figurable amount of time. After the expiry of its life in /trash, the
NameNode deletes the file from the HDFS namespace. The dele-
tion of a file causes the blocks associated with the file to be freed.
Note that there could be an appreciable time delay between the
time a file is deleted by a user and the time of the corresponding
increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains
in the /trash directory. If a user wants to undelete a file that he/
she has deleted, he/she can navigate the /trash directory and re-
trieve the file. The /trash directory contains only the latest copy
of the file that was deleted. The /trash directory is just like any
other directory with one special feature: HDFS applies specified
policies to automatically delete files from this directory. The cur-
rent default policy is to delete files from /trash that are more
than 6 hours old. In the future, this policy will be configurable
through a well defined interface.
Decrease Replication Factor
When the replication factor of a file is reduced, the NameNode
selects excess replicas that can be deleted. The next Heartbeat
transfers this information to the DataNode. The DataNode then
removes the corresponding blocks and the corresponding free
space appears in the cluster. Once again, there might be a time
delay between the completion of the setReplication API call and
the appearance of free space in the cluster.
Big Data Analytics with Hadoop
HDFS Architecture
107
AccessibilityHDFS can be accessed from applications in many different ways.
Natively, HDFS provides a Java API for applications to use. A C lan-
guage wrapper for this Java API is also available. In addition, an
HTTP browser can also be used to browse the files of an HDFS
instance. Work is in progress to expose HDFS through the Web-
DAV protocol.
Action Command
Create a directory named /foodir bin/hadoop dfs -mkdir /foodir
Remove a directory named /foodir bin/hadoop dfs -rmr /foodir
View the contents of a file named /foodir/myfile.txt bin/hadoop dfs -cat /foodir/myfile.txt
FS shell is targeted for applications that need a scripting language to interact with the stored data.
Action Command
Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter
Generate a list of DataNodes bin/hadoop dfsadmin -report
Recommission or decommission DataNode(s) bin/hadoop dfsadmin -refreshNodes
FS Shell
HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets
a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already
familiar with. Here are some sample action/command pairs:
DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster.
These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:
Browser Interface
A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port.
This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.
NVIDIA GPU Computing –The CUDA Architecture
The CUDA parallel computing platform is now widely deployed with 1000s of GPU-
accelerated applications and 1000s of published research papers, and a complete range
of CUDA tools and ecosystem solutions is available to developers.
High Throuhgput Computing | CAD | Big Data Analytics | Simulation | Aerospace | Automotive
110
What is GPU Accelerated Computing?GPU-accelerated computing is the use of a graphics processing
unit (GPU) together with a CPU to accelerate scientific, analytics,
engineering, consumer, and enterprise applications. Pioneered
in 2007 by NVIDIA, GPU accelerators now power energy-efficient
datacenters in government labs, universities, enterprises, and
small-and-medium businesses around the world. GPUs are accel-
erating applications in platforms ranging from cars, to mobile
phones and tablets, to drones and robots.
How GPUS accelerated applications GPU-accelerated computing offers unprecedented application
performance by offloading compute-intensive portions of the
application to the GPU, while the remainder of the code still runs
on the CPU. From a user’s perspective, applications simply run
significantly faster.
CPU versus GPUA simple way to understand the difference between a CPU and
GPU is to compare how they process tasks. A CPU consists of a
few cores optimized for sequential serial processing while a GPU
has a massively parallel architecture consisting of thousands
of smaller, more efficient cores designed for handling multiple
tasks simultaneously.
“We are very proud to be one of the leading providers of Tesla sys-
tems who are able to combine the overwhelming power of NVIDIA
Tesla systems with the fully engineered and thoroughly tested
transtec hardware to a total Tesla-based solution.”
Bernd Zell | Senior HPC System Engineer
NVIDIA GPU Computing
What is GPU Computing?
CPU multiple cores GPU thousands of cores
111
GPUs have thousands of cores to process parallel workloads
efficiently
Kepler GK110/210 GPU Computing Architecture As the demand for high performance parallel computing increas-
es across many areas of science, medicine, engineering, and fi-
nance, NVIDIA continues to innovate and meet that demand with
extraordinarily powerful GPU computing architectures.
NVIDIA’s GPUs have already redefined and accelerated High Per-
formance Computing (HPC) capabilities in areas such as seismic
processing, biochemistry simulations, weather and climate mod-
eling, signal processing, computational finance, computer aided
engineering, computational fluid dynamics, and data analysis.
NVIDIA’s Kepler GK110/210 GPUs are designed to help solve the
world’s most difficult computing problems.
SMX Processing Core ArchitectureEach of the Kepler GK110/210 SMX units feature 192 single-preci-
sion CUDA cores, and each core has fully pipelined floating-point
and integer arithmetic logic units. Kepler retains the full IEEE
754-2008 compliant single and double-precision arithmetic in-
troduced in Fermi, including the fused multiply-add (FMA) oper-
ation. One of the design goals for the Kepler GK110/210 SMX was
to significantly increase the GPU’s delivered double precision
performance, since double precision arithmetic is at the heart of
many HPC applications. Kepler GK110/210’s SMX also retains the
special function units (SFUs) for fast approximate transcenden-
tal operations as in previous-generation GPUs, providing 8x the
number of SFUs of the Fermi GF110 SM.
Similar to GK104 SMX units, the cores within the new GK110/210
SMX units use the primary GPU clock rather than the 2x shader
clock. Recall the 2x shader clock was introduced in the G80 Tes-
la architecture GPU and used in all subsequent Tesla and Fermi
architecture GPUs.
Running execution units at a higher clock rate allows a chip
to achieve a given target throughput with fewer copies of the
execution units, which is essentially an area of optimization,
but the clocking logic for the faster cores is more power-hungry.
For Kepler, the priority was performance per watt. While NVIDIA
made many optimizations that benefitted both area and power,
they chose to optimize for power even at the expense of some
added area cost, with a larger number of processing cores run-
ning at the lower, less power-hungry GPU clock.
Quad Warp Scheduler The SMX schedules threads in groups of 32 parallel threads
called warps. Each SMX features four warp schedulers and eight
instruction dispatch units, allowing four warps to be issued and
executed concurrently.
Kepler’s quad warp scheduler selects four warps, and two inde-
pendent instructions per warp can be dispatched each cycle. Un-
like Fermi, which did not permit double precision instructions to
be paired with other instructions, Kepler GK110/210 allows dou-
ble precision instructions to be paired with other instructions.
112
New ISA Encoding: 255 Registers per ThreadThe number of registers that can be accessed by a thread has
been quadrupled in GK110, allowing each thread access to up to
255 registers. Codes that exhibit high register pressure or spill-
ing behavior in Fermi may see substantial speedups as a result
of the increased available per - thread register count. A compel-
ling example can be seen in the QUDA library for performing lat-
tice QCD (quantum chromodynamics) calculations using CUDA.
QUDA fp64-based algorithms see performance increases up to
5.3x due to the ability to use many more registers per thread and
experiencing fewer spills to local memory.
GK210 further improves this, doubling the overall register file ca-
pacity per SMX as compared to GK110. In doing so, it allows appli-
cations to more readily make use of higher numbers of registers
per thread without sacrificing the number of threads that can
fit concurrently per SMX. For example, a CUDA kernel using 128
registers thread on GK110 is limited to 512 out of a possible 2048
concurrent threads per SMX, limiting the available parallelism.
Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dispatch Units. A single Warp Scheduler Unit is shown above.
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
113
GK210 doubles the concurrency automatically in this case, which
can help to cover arithmetic and memory latencies, improving
overall efficiency.
Shuffle InstructionTo further improve performance, Kepler implements a new Shuf-
fle instruction, which allows threads within a warp to share data.
Previously, sharing data between threads within a warp required
separate store and load operations to pass the data through
shared memory. With the Shuffle instruction, threads within
a warp can read values from other threads in the warp in just
about any imaginable permutation.
Shuffle supports arbitrary indexed references – i.e. any thread
reads from any other thread. Useful shuffle subsets including
next-thread (offset up or down by a fixed amount) and XOR “but-
terfly” style permutations among the threads in a warp, are also
available as CUDA intrinsics. Shuffle offers a performance advan-
tage over shared memory, in that a store-and-load operation is
carried out in a single step. Shuffle also can reduce the amount of
shared memory needed per thread block, since data exchanged
at the warp level never needs to be placed in shared memory. In
the case of FFT, which requires data sharing within a warp, a 6%
performance gain can be seen just by using Shuffle.
Atomic OperationsAtomic memory operations are important in parallel pro-
gramming, allowing concurrent threads to correctly perform
read-modify-write operations on shared data structures. Atomic
operations such as add, min, max, and compare-and-swap are
atomic in the sense that the read, modify, and write operations
are performed without interruption by other threads.
Atomic memory operations are widely used for parallel sorting,
reduction operations, and building data structures in parallel
without locks that serialize thread execution. Throughput of
global memory atomic operations on Kepler GK110/210 are sub-
stantially improved compared to the Fermi generation. Atomic
operation throughput to a common global memory address is
improved by 9x to one operation per clock.
Atomic operation throughput to independent global address-
es is also significantly accelerated, and logic to handle address
conflicts has been made more efficient. Atomic operations can
often be processed at rates similar to global load operations.
This speed increase makes atomics fast enough to use frequent-
ly within kernel inner loops, eliminating the separate reduction
passes that were previously required by some algorithms to con-
solidate results.
This example shows some of the variations possible using the new Shuffle instruction in Kepler.
114
Kepler GK110 also expands the native support for 64-bit atomic
operations in global memory. In addition to atomicAdd, atomic-
CAS, and atomicExch (which were also supported by Fermi and
Kepler GK104), GK110 supports the following:
� atomicMin
� atomicMax
� atomicAnd
� atomicOr
� atomicXor
Other atomic operations which are not supported natively (for
example 64-bit floating point atomics) may be emulated using
the compare-and-swap (CAS) instruction.
Texture ImprovementsThe GPU’s dedicated hardware Texture units are a valuable re-
source for compute programs with a need to sample or filter
image data. The texture throughput in Kepler is significantly in-
creased compared to Fermi – each SMX unit contains 16 texture
filtering units, a 4x increase vs the Fermi GF110 SM.
In addition, Kepler changes the way texture state is managed. In
the Fermi generation, for the GPU to reference a texture, it had
to be assigned a “slot” in a fixed-size binding table prior to grid
launch.
The number of slots in that table ultimately limits how many
unique textures a program can read from at run time. Ultimately,
a program was limited to accessing only 128 simultaneous tex-
tures in Fermi.
With bindless textures in Kepler, the additional step of using
slots isn’t necessary: texture state is now saved as an object in
memory and the hardware fetches these state objects on de-
mand, making binding tables obsolete.
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
115
This effectively eliminates any limits on the number of unique
textures that can be referenced by a compute program. Instead,
programs can map textures at any time and pass texture handles
around as they would any other pointer.
Dynamic ParallelismIn a hybrid CPU-GPU system, enabling a larger amount of paral-
lel code in an application to run efficiently and entirely within
the GPU improves scalability and performance as GPUs increase
in perf/watt. To accelerate these additional parallel portions of
the application, GPUs must support more varied types of parallel
workloads. Dynamic Parallelism is introduced with Kepler GK110
and also included in GK210. It allows the GPU to generate new
work for itself, synchronize on results, and control the schedul-
ing of that work via dedicated, accelerated hardware paths, all
without involving the CPU.
Fermi was very good at processing large parallel data structures
when the scale and parameters of the problem were known at
kernel launch time. All work was launched from the host CPU,
would run to completion, and return a result back to the CPU.
The result would then be used as part of the final solution, or
would be analyzed by the CPU which would then send additional
requests back to the GPU for additional processing.
In Kepler GK110/210 any kernel can launch another kernel, and
can create the necessary streams, events and manage the de-
pendencies needed to process additional work without the need
for host CPU interaction. This architectural innovation makes it
easier for developers to create and optimize recursive and data -
dependent execution patterns, and allows more of a program to
be run directly on GPU. The system CPU can then be freed up for
additional tasks, or the system could be configured with a less
powerful CPU to carry out the same workload.
Dynamic Parallelism allows more varieties of parallel algorithms
to be implemented on the GPU, including nested loops with
differing amounts of parallelism, parallel teams of serial con-
trol-task threads, or simple serial control code offloaded to the
GPU in order to promote data-locality with the parallel portion
of the application. Because a kernel has the ability to launch ad-
ditional workloads based on intermediate, on-GPU results, pro-
grammers can now intelligently load-balance work to focus the
bulk of their resources on the areas of the problem that either
require the most processing power or are most relevant to the
solution. One example would be dynamically setting up a grid
for a numerical simulation – typically grid cells are focused in re-
gions of greatest change, requiring an expensive pre-processing
pass through the data.
Alternatively, a uniformly coarse grid could be used to prevent
wasted GPU resources, or a uniformly fine grid could be used to
ensure all the features are captured, but these options risk miss-
ing simulation features or “over-spending” compute resources
on regions of less interest. With Dynamic Parallelism, the grid
resolution can be determined dynamically at runtime in a da-
ta-dependent manner. Starting with a coarse grid, the simulation
can “zoom in” on areas of interest while avoiding unnecessary
calculation in areas with little change. Though this could be
accomplished using a sequence of CPU-launched kernels, it
would be far simpler to allow the GPU to refine the grid itself by
Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).
116
analyzing the data and launching additional work as part of a
single simulation kernel, eliminating interruption of the CPU and
data transfers between the CPU and GPU.
Hyper-QOne of the challenges in the past has been keeping the GPU
supplied with an optimally scheduled load of work from multi-
ple streams. The Fermi architecture supported 16 – way concur-
rency of kernel launches from separate streams, but ultimately
the streams were all multiplexed into the same hardware work
queue. This allowed for false intra-stream dependencies, requir-
ing dependent kernels within one stream to complete before ad-
ditional kernels in a separate stream could be executed. While
this could be alleviated to some extent through the use of a
breadth-first launch order, as program complexity increases, this
can become more and more difficult to manage efficiently.
Kepler GK110/210 improve on this functionality with their
Hyper-Q feature. Hyper-Q increases the total number of connec-
tions (work queues) between the host and the CUDA Work Dis-
tributor (CWD) logic in the GPU by allowing 32 simultaneous,
hardware-managed connections (compared to the single con-
nection available with Fermi). Hyper-Q is a flexible solution that
allows connections from multiple CUDA streams, from multiple
Message Passing Interface (MPI) processes, or even from multiple
Image attribution Charles Reid
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
117
threads within a process. Applications that previously encoun-
tered false serialization across tasks, thereby limiting GPU utiliza-
tion, can see up to a 32x performance increase without changing
any existing code. Each CUDA stream is managed within its own
hardware work queue, inter-stream dependencies are optimized,
and operations in one stream will no longer block other streams,
enabling streams to execute concurrently without needing to
specifically tailor the launch order to eliminate possible false de-
pendencies.
Hyper-Q offers significant benefits for use in MPI-based parallel
computer systems. Legacy MPI-based algorithms were often cre-
ated to run on multi-core CPU systems, with the amount of work
assigned to each MPI process scaled accordingly. This can lead to
a single MPI process having insufficient work to fully occupy the
GPU. While it has always been possible for multiple MPI process-
es to share a GPU, these processes could become bottlenecked
by false dependencies. Hyper-Q removes those false dependen-
cies, dramatically increasing the efficiency of GPU sharing across
MPI processes.
Grid Management Unit – Efficiently Keeping the GPU Utilized New features introduced with Kepler GK110, such as the ability
for CUDA kernels to launch work directly on the GPU with Dynam-
ic Parallelism, required that the CPU-to-GPU workflow in Kepler
offer increased functionality over the Fermi design. On Fermi, a
grid of thread blocks would be launched by the CPU and would
always run to completion, creating a simple unidirectional flow
of work from the host to the SMs via the CUDA Work Distributor
(CWD) unit. Kepler GK110/210 improve the CPU-to-GPU workflow
by allowing the GPU to efficiently manage both CPU and CUDA
created workloads.
NVIDIA discussed the ability of Kepler GK110 GPU to allow ker-
nels to launch work directly on the GPU, and it’s important to
understand the changes made in the Kepler GK110 architecture
to facilitate these new functions. In Kepler GK110/210, a grid can
be launched from the CPU just as was the case with Fermi, how-
ever new grids can also be created programmatically by CUDA
within the Kepler SMX unit. To manage both CUDA-created and
host-originated grids, a new Grid Management Unit (GMU) was
introduced in Kepler GK110. This control unit manages and pri-
oritizes grids that are passed into the CWD to be sent to the SMX
units for execution.
The CWD in Kepler holds grids that are ready to dispatch, and
it is able to dispatch 32 active grids, which is double the capac-
ity of the Fermi CWD. The Kepler CWD communicates with the
GMU via a bi-directional link that allows the GMU to pause the
dispatch of new grids and to hold pending and suspended grids
until needed. The GMU also has a direct connection to the Kepler
SMX units to permit grids that launch additional work on the
GPU via Dynamic Parallelism to send the new work back to GMU
to be prioritized and dispatched. If the kernel that dispatched the
additional workload pauses, the GMU will hold it inactive until
the dependent work has completed.
The redesigned Kepler HOST to GPU workflow shows the new Grid Management Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids.
BOX
118
transtec has strived for developing well-engineered GPU Com-
puting solutions from the very beginning of the Tesla era. From
High-Performance GPU Workstations to rack-mounted Tesla
server solutions, transtec has a broad range of specially de-
signed systems available.
As an NVIDIA Tesla Preferred Provider (TPP), transtec is able to
provide customers with the latest NVIDIA GPU technology as
well as fully-engineered hybrid systems and Tesla Preconfig-
ured Clusters. Thus, customers can be assured that transtec’s
large experience in HPC cluster solutions is seamlessly brought
into the GPU computing world. Performance Engineering made
by transtec.
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
119
NVIDIA GPUDirectWhen working with a large amount of data, increasing the data
throughput and reducing latency is vital to increasing compute
performance. Kepler GK110/210 support the RDMA feature in
NVIDIA GPUDirect, which is designed to improve performance
by allowing direct access to GPU memory by third-party devices
such as IB adapters, NICs, and SSDs.
When using CUDA 5.0 or later, GPUDirect provides the following
important features:
� Direct memory access (DMA) between NIC and GPU without
the need for CPU-side data buffering.
� Significantly improved MPISend/MPIRecv efficiency between
GPU and other nodes in a network.
� Eliminates CPU bandwidth and latency bottlenecks
�Works with variety of 3rd-party network, capture and storage
devices
Applications like reverse time migration (used in seismic imaging
for oil & gas exploration) distribute the large imaging data across
several GPUs. Hundreds of GPUs must collaborate to crunch the
data, often communicating intermediate results. GPUDirect en-
ables much higher aggregate bandwidth for this GPU-to-GPU
GPUDirect RDMA allows direct access to GPU memory from 3rd-party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.
communication scenario within a server and across servers with
the P2P and RDMA features. Kepler GK110 also supports other
GPUDirect features such as Peer-to-Peer and GPUDirect for Video.
Kepler Memory Subsystem – L1, L2, ECC Kepler’s memory hierarchy is organized similarly to Fermi. The
Kepler architecture supports a unified memory request path for
loads and stores, with an L1 cache per SMX multiprocessor. Ke-
pler GK110 also enables compiler-directed use of an additional
new cache for read-only data, as described below.
Configurable shared memory and L1 CacheIn the Kepler GK110 architecture, as in the previous generation
Fermi architecture, each SMX has 64 KB of on-chip memory that
can be configured as 48 KB of Shared memory with 16 KB of L1
cache, or as 16 KB of shared memory with 48 KB of L1 cache.
Kepler now allows for additional flexibility in configuring the
allocation of shared memory and L1 cache by permitting a 32 KB/
32 KB split between shared memory and L1 cache.
Configurable Shared Memory and L1 Cache
120
To support the increased throughput of each SMX unit, the
shared memory bandwidth for 64 bit and larger load operations
is also doubled compared to the Fermi SM, to 256 byte per core
clock. For the GK210 architecture, the total amount of configu-
rable memory is doubled to 128 KB, allowing a maximum of 112
KB shared memory and 16 KB of L1 cache. Other possible memory
configurations are 32 KB L1 cache with 96 KB shared memory, or
48 KB L1 cache with 80 KB of shared memory.
This increase allows a similar improvement in concurrency of
threads as is enabled by the register file capacity improvement
described above.
48 KB Read-Only Data CacheIn addition to the L1 cache, Kepler introduces a 48 KB cache for
data that is known to be read-only for the duration of the func-
tion. In the Fermi generation, this cache was accessible only by
the Texture unit. Expert programmers often found it advanta-
geous to load data through this path explicitly by mapping their
data as textures, but this approach had many limitations.
In Kepler, in addition to significantly increasing the capacity of
this cache along with the texture horsepower increase, we de-
cided to make the cache directly accessible to the SM for general
load operations. Use of the read-only path is beneficial because
it takes both load and working set footprint off of the Shared/L1
cache path. In addition, the Read-Only Data Cache’s higher tag
bandwidth supports full speed unaligned memory access pat-
terns among other scenarios.
Use of the read-only path can be managed automatically by the
compiler or explicitly by the programmer. Access to any variable
or data structure that is known to be constant through program-
mer use of the C99 standard “const __restrict” keyword may be
tagged by the compiler to be loaded through the Read- Only Data
Cache. The programmer can also explicitly use this path with the
__ldg() intrinsic.
NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA
Parallel Data Cache and certain other trademarks and logos
appearing in this brochure, are trademarks or registered
trademarks of NVIDIA orporation.
NVIDIA GPU Computing
Kepler GK110/210
GPU Computing Architecture
S
N
EW
SESW
NENW
tech BOX
121
Improved L2 CacheThe Kepler GK110/210 GPUs feature 1536 KB of dedicated L2
cache memory, double the amount of L2 available in the Fermi
architecture. The L2 cache is the primary point of data unifica-
tion between the SMX units, servicing all load, store, and texture
requests and providing efficient, high speed data sharing across
the GPU.
The L2 cache on Kepler offers up to 2x of the bandwidth per clock
available in Fermi. Algorithms for which data addresses are not
known beforehand, such as physics solvers, ray tracing, and
sparse matrix multiplication especially benefit from the cache
hierarchy. Filter and convolution kernels that require multiple
SMs to read the same data also benefit.
Memory Protection SupportLike Fermi, Kepler’s register files, shared memories, L1 cache, L2
cache and DRAM memory are protected by a Single-Error Correct
Double-Error Detect (SECDED) ECC code. In addition, the Read-
Only Data Cache supports single-error correction through a par-
ity check; in the event of a parity error, the cache unit automati-
cally invalidates the failed line, forcing a read of the correct data
from L2.
ECC checkbit fetches from DRAM necessarily consume some
amount of DRAM bandwidth, which results in a performance
difference between ECC-enabled and ECC-disabled operation,
especially on memory bandwidth-sensitive applications. Kepler
GK110 implements several optimizations to ECC checkbit fetch
handling based on Fermi experience. As a result, the ECC on-vs-
off performance delta has been reduced by an average of 66%,
as measured across our internal compute application test suite.
CUDACUDA is a combination hardware/software platform that
enables NVIDIA GPUs to execute programs written with C,
C++, Fortran, and other languages. A CUDA program invokes
parallel functions called kernels that execute across many
parallel threads. The programmer or compiler organizes
these threads into thread blocks and grids of thread blocks.
Each thread within a thread block executes an instance of
the kernel. Each thread also has thread and block IDs within
its thread block and grid, a program counter, registers, per-
thread private memory, inputs, and output results.
A thread block is a set of concurrently executing threads that
can cooperate among themselves through barrier synchro-
nization and shared memory. A thread block has a block ID
within its grid. A grid is an array of thread blocks that execute
the same kernel, read inputs from global memory, write re-
sults to global memory, and synchronize between dependent
kernel calls. In the CUDA parallel programming model, each
thread has a per-thread private memory space used for regis-
ter spills, function calls, and C automatic array variables. Each
thread block has a per-block shared memory space used for
inter-thread communication, data sharing, and result sharing
in parallel algorithms. Grids of thread blocks share results in
Global Memory space after kernel-wide global synchroniza-
tion.
CUDA Hardware ExecutionCUDA’s hierarchy of threads maps to a hierarchy of proces-
sors on the GPU; a GPU executes one or more kernel grids; a
streaming multiprocessor (SM on Fermi / SMX on Kepler) ex-
ecutes one or more thread blocks; and CUDA cores and other
122
execution units in the SMX execute thread instructions. The SMX
executes threads in groups of 32 threads called warps.
While programmers can generally ignore warp execution for
functional correctness and focus on programming individual
scalar threads, they can greatly improve performance by having
threads in a warp execute the same code path and access memory
with nearby addresses.
CUDA Hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared,-and per-application global memory spaces.
CUDA Toolkit The NVIDIA CUDA Toolkit provides a comprehensive development
environment for C and C++ developers building GPU-accelerated
NVIDIA GPU Computing
A Quick Refresher on CUDA
123
applications. The CUDA Toolkit includes a compiler for NVIDIA
GPUs, math libraries, and tools for debugging and optimizing the
performance of your applications. You’ll also find programming
guides, user manuals, API reference, and other documentation to
help you get started quickly accelerating your application with
GPUs.
Dramatically simplify parallel programming64 bit ARM Support
� Develop or recompile your applications to run on 64-bit ARM
systems with NVIDIA GPUs
Unified Memory
� Simplifies programming by enabling applications to access
CPU and GPU memory without the need to manually copy
data. Read more about unified memory.
Drop-in Libraries and cuFFT callbacks
� Automatically accelerate applications’ BLAS and FFTW calcu-
lations by up to 8X by simply replacing the existing CPU librar-
ies with the GPU-accelerated equivalents.
� Use cuFFT callbacks for higher performance custom process-
ing on input or output data.
Multi-GPU scaling
� cublasXT – a new BLAS GPU library that automatically scales
performance across up to eight GPUs in a single node, deliver-
ing over nine teraflops of double precision performance per
node, and supporting larger workloads than ever before (up
to 512GB). The re-designed FFT GPU library scales up to 2 GPUs
in a single node, allowing larger transform sizes and higher
throughput.
Improved Tools Support
� Support for Microsoft Visual Studio 2013.
� Improved debugging for CUDA FORTRAN applications
� Replay feature in Visual Profiler and nvprof
� nvprune utiliy to optimize the size of object files
The new Tesla K80 GPU Accelerator based on the GK210 GPUA dual GPU board that combines 24 GB of memory with blazing fast
memory bandwidth and up to 2.91 Tflops double precision per-
formance with NVIDIA GPU Boost, the Tesla K80 GPU is designed
for the most demanding computational tasks. It’s ideal for single
and double precision workloads that not only require leading
compute performance but also demands high data throughput.
Intel Xeon Phi Coprocessor – The Architecture
Intel Many Integrated Core (Intel MIC) architecture combines many Intel CPU cores onto
a single chip. Intel MIC architecture is targeted for highly parallel, High Performance
Computing (HPC) workloads in a variety of fields such as computational physics, chem-
istry, biology, and financial services. Today such workloads are run as task parallel pro-
grams on large compute clusters.
Engineering | Life Sciences | Automotive | Price Modelling | Aerospace | CAE | Data Analytics
126
What is the Intel Xeon Phi coprocessor?Intel Xeon Phi coprocessors are PCI Express form factor add-in
cards that work synergistically with Intel Xeon processors to
enable dramatic performance gains for highly parallel code –
up to 1.2 double-precision teraFLOPS (floating point operations
per second) per coprocessor. Manufactured using Intel’s indus-
try-leading 22 nm technology with 3-D Tri-Gate transistors, each
coprocessor features more cores, more threads, and wider vector
execution units than an Intel Xeon processor. The high degree of
parallelism compensates for the lower speed of each core to de-
liver higher aggregate performance for highly parallel workloads.
What applications can benefit from the Intel Xeon Phi
coprocessor?
While a majority of applications (80 to 90 percent) will contin-
ue to achieve maximum performance on Intel Xeon processors,
certain highly parallel applications will benefit dramatically by
using Intel Xeon Phi coprocessors. To take full advantage of Intel
Xeon Phi coprocessors, an application must scale well to over 100
software threads and either make extensive use of vectors or ef-
ficiently use more local memory bandwidth than is available on
an Intel Xeon processor. Examples of segments with highly paral-
lel applications include: animation, energy, finance, life sciences,
manufacturing, medical, public sector, weather, and more.
When do I use Intel Xeon processors and Intel Xeon Phi
coprocessors?
Intel Xeon Phi Coprocessor
The Architecture
S
N
EW
SESW
NENW
tech BOX
127
Think “resuse” rather than “recode”
Since languages, tools, and applications are compatible for
both Intel Xeon processor and coprocessors, now you can think
“reuse” rather than “recode”.
Single programming model for all your code
The Intel Xeon Phi coprocessor gives developers a hardware de-
sign point optimized for extreme parallelism, without requiring
them to re-architect or rewrite their code. No need to rethink the
entire problem or learn a new programming model; simply rec-
ompile and optimize existing code using familiar tools, libraries,
and runtimes.
Performance multiplier
By maintaining a single source code between Intel Xeon proces-
sors and Intel Xeon Phi coprocessors, developers optimize once
for parallelism but maximize performance on both processor
and coprocessor.
Execution flexibility
Intel Xeon Phi is designed from the ground up for high-perfor-
mance computing. Unlike a GPU, a coprocessor can host an oper-
ating system, be fully IP addressable, and support standards such
as Message Passing Interface (MPI). Additionally, it can operate in
multiple execution modes:
� “Symmetric” mode: Workload tasks are shared between the
host processor and coprocessor
� “Native” mode: Workload resides entirely on the coprocessor,
and essentially acts as a separate compute nodes
� “Offload” mode: Workload resides on the host processor, and
parts of the workload are sent out to the coprocessor as needed
Multiple Intel Xeon Phi coprocessors can be installed in a sin-
gle host system. Within a single system, the coprocessors can
communicate with each other through the PCIe peer-to-peer
interconnect without any intervention from the host. Similarly,
A Family of Coprocessors for Diverse Needs Intel Xeon Phi coprocessors provide up to 61 cores, 244
threads, and 1.2 teraflops of performance, and they come
in a variety of configurations to address diverse hardware,
software, workload, performance, and efficiency require-
ments. They also come in a variety of form factors, including
a standard PCIe x16 form factor (with active, passive, or no
thermal solution), and a dense form factor that offers addi-
tional design flexibility.
� The Intel Xeon Phi Coprocessor 3100 family provides out-
standing parallel performance. It is an excellent choice for
compute-bound workloads, such as MonteCarlo, black-
Scholes, HPL, LifeSc, and many others. Active and passive
cooling options provide flexible support for a variety of
server and workstation systems.
� The Intel Xeon Phi Coprocessor 5100 family is opti-
mized for high-density computing and is well-suited for
workloads that are memory-bandwidth bound, such as
STREAM, memory-capacity bound, such as ray-tracing, or
both, such as reverse time migration (RTM). These copro-
cessors are passively cooled and have the lowest thermal
design power (TDP) of the Intel Xeon Phi product family.
� The Intel Xeon Phi Coprocessor 7100 family provides the
most features and the highest performance and memory
capacity of the Intel Xeon Phi product family. This family
supports Intel Turbo boost Technology 1.0, which increas-
es core frequencies during peak workloads when thermal
conditions allow. Passive and no thermal solution options
enable powerful and innovative computing solutions.
128
the coprocessors can also communicate through a network card
such as InfiniBand or Ethernet, without any intervention from
the host.
The Intel Xeon Phi coprocessor is primarily composed of process-
ing cores, caches, memory controllers, PCIe client logic, and a very
high bandwidth, bidirectional ring interconnect. Each core comes
complete with a private L2 cache that is kept fully coherent by a
global-distributed tag directory. The memory controllers and the
PCIe client logic provide a direct interface to the GDDR5 mem-
ory on the coprocessor and the PCIe bus, respectively. All these
components are connected together by the ring interconnect.
The first generation Intel Xeon Phi product codenamed “Knights Corner”
Microarchitecture
Intel Xeon Phi Coprocessor
The Architecture
129
Each core in the Intel Xeon Phi coprocessor is designed to be pow-
er efficient while providing a high throughput for highly parallel
workloads. A closer look reveals that the core uses a short in-or-
der pipeline and is capable of supporting 4 threads in hardware.
It is estimated that the cost to support IA architecture legacy is a
mere 2% of the area costs of the core and is even less at a full chip
or product level. Thus the cost of bringing the Intel Architecture
legacy capability to the market is very marginal.
Vector Processing UnitAn important component of the Intel Xeon Phi coprocessor’s
core is its vector processing unit (VPU). The VPU features a novel
512-bit SIMD instruction set, officially known as Intel Initial Many
Core Instructions (Intel IMCI). Thus, the VPU can execute 16 sin-
gle-precision (SP) or 8 double-precision (DP) operations per cycle.
The VPU also supports Fused Multiply-Add (FMA) instructions and
hence can execute 32 SP or 16 DP floating point operations per
cycle. It also provides support for integers.
Vector units are very power efficient for HPC workloads. A single
operation can encode a great deal of work and does not incur en-
ergy costs associated with fetching, decoding, and retiring many
instructions. However, several improvements were required to
support such wide SIMD instructions. For example, a mask register
was added to the VPU to allow per lane predicated execution. This
helped in vectorizing short conditional branches, thereby im-
proving the overall software pipelining efficiency. The VPU also
supports gather and scatter instructions, which are simply non-
unit stride vector memory accesses, directly in hardware. Thus
for codes with sporadic or irregular access patterns, vector scat-
ter and gather instructions help in keeping the code vectorized.
The VPU also features an Extended Math Unit (EMU) that can ex-
ecute transcendental operations such as reciprocal, square root,
and log, thereby allowing these operations to be executed in a
vector fashion with high bandwidth. The EMU operates by calcu-
lating polynomial approximations of these functions.
The InterconnectThe interconnect is implemented as a bidirectional ring. Each
direction is comprised of three independent rings. The first,
largest, and most expensive of these is the data block ring. The
data block ring is 64 bytes wide to support the high bandwidth
requirement due to the large number of cores. The address ring
is much smaller and is used to send read/write commands and
memory addresses. Finally, the smallest ring and the least expen-
sive ring is the acknowledgement ring, which sends flow control
and coherence messages.
Intel Xeon Phi Coprocessor Core
Vector Processing Unit
130
When a core accesses its L2 cache and misses, an address request
is sent on the address ring to the tag directories. The memory ad-
dresses are uniformly distributed amongst the tag directories
on the ring to provide a smooth traffic characteristic on the ring.
If the requested data block is found in another core’s L2 cache,
a forwarding request is sent to that core’s L2 over the address
ring and the request block is subsequently forwarded on the
data block ring. If the requested data is not found in any caches,
a memory address is sent from the tag directory to the memory
controller.
The figure below shows the distribution of the memory con-
trollers on the bidirectional ring. The memory controllers are
symmetrically interleaved around the ring. There is an all-to-all
mapping from the tag directories to the memory controllers. The
addresses are evenly distributed across the memory controllers,
The Interconnect
Distributed Tag Directories
Intel Xeon Phi Coprocessor
The Architecture
131
thereby eliminating hotspots and providing a uniform access
pattern which is essential for a good bandwidth response.
During a memory access, whenever an L2 cache miss occurs on a
core, the core generates an address request on the address ring
and queries the tag directories. If the data is not found in the tag
directories, the core generates another address request and que-
ries the memory for the data. Once the memory controller fetch-
es the data block from memory, it is returned back to the core
over the data ring. Thus during this process, one data block, two
address requests (and by protocol, two acknowledgement mes-
sages) are transmitted over the rings. Since the data block rings
are the most expensive and are designed to support the required
data bandwidth, it is necessary to increase the number of less ex-
pensive address and acknowledgement rings by a factor of two
to match the increased bandwidth requirement caused by the
higher number of requests on these rings.
Other Design FeaturesOther micro-architectural optimizations incorporated into the
Intel Xeon Phi coprocessor include a 64-entry second-level Trans-
lation Lookaside Buffer (TLB), simultaneous data cache loads and
stores, and 512KB L2 caches. Lastly, the Intel Xeon Phi coproc-
essor implements a 16 stream hardware prefetcher to improve
the cache hits and provide higher bandwidth. The figure below
shows the net performance improvements for the SPECfp 2006
benchmark suite for a single core, single thread runs. The results
indicate an average improvement of over 80% per cycle not in-
cluding frequency.
Per-Core ST Performance Improvement (per cycle)
Interleaved Memory Access
Interconnect: 2x AD/AK
132
CachesThe Intel MIC architecture invests more heavily in L1 and L2 cach-
es compared to GPU architectures. The Intel Xeon Phi coproces-
sor implements a leading-edge, very high bandwidth memory
subsystem. Each core is equipped with a 32KB L1 instruction
cache and 32KB L1 data cache and a 512KB unified L2 cache.
These caches are fully coherent and implement the x86 memory
order model. The L1 and L2 caches provide an aggregate band-
width that is approximately 15 and 7 times, respectively, faster
compared to the aggregate memory bandwidth. Hence, effective
utilization of the caches is key to achieving peak performance on
the Intel Xeon Phi coprocessor. In addition to improving band-
width, the caches are also more energy efficient for supplying
data to the cores than memory. The figure below shows the ener-
gy consumed per byte of data transfered from the memory, and
L1 and L2 caches. In the exascale compute era, caches will play a
crucial role in achieving real performance under restrictive pow-
er constraints.
StencilsStencils are common in physics simulations and are classic exam-
ples of workloads which show a large performance gain through
efficient use of caches.
Caches – For or Against?
Intel Xeon Phi Coprocessor
The Architecture
133
Stencils are typically employed in simulation of physical sys-
tems to study the behavior of the system over time. When these
workloads are not programmed to be cache-blocked, they will be
bound by memory bandwidth. Cache blocking promises substan-
tial performance gains given the increased bandwidth and ener-
gy efficiency of the caches compared to memory. Cache blocking
improves performance by blocking the physical structure or the
physical system such that the blocked data fits well into a core’s
L1 and or L2 caches. For example, during each time-step, the
same core can process the data which is already resident in the
L2 cache from the last time step, and hence does not need to be
fetched from the memory, thereby improving performance. Addi-
tionally, the cache coherence further aids the stencil operation
by automatically fetching the updated data from the nearest
neighboring blocks which are resident in the L2 caches of other
cores. Thus, stencils clearly demonstrate the benefits of efficient
cache utilization and coherence in HPC workloads.
Power ManagementIntel Xeon Phi coprocessors are not suitable for all workloads.
In some cases, it is beneficial to run the workloads only on the
host. In such situations where the coprocessor is not being used,
it is necessary to put the coprocessor in a power-saving mode.
To conserve power, as soon as all the four threads on a core are
halted, the clock to the core is gated.
Once the clock has been gated for some programmable time, the
core power gates itself, thereby eliminating any leakage. At any
point, any number of the cores can be powered down or powered
up as shown in the figure below. Additionally, when all the cores
are power gated and the uncore detects no activity, the tag direc-
tories, the interconnect, L2 caches and the memory controllers
are clock gated.
Clock Gate Core
Power Gate Core
Stencils Example
134
At this point, the host driver can put the coprocessor into a deep-
er sleep or an idle state, wherein all the uncore is power gated,
the GDDR is put into a self-refresh mode, the PCIe logic is put in
a wait state for a wakeup and the GDDR-IO is consuming very lit-
tle power. These power management techniques help conserve
power and make Intel Xeon Phi coprocessor an excellent candi-
date for data centers.
SummaryThe Intel Xeon Phi coprocessor provides high performance, and
performance per watt for highly parallel HPC workloads, while
not requiring a new programming model, API, language or restric-
tive memory modelIt is able to do this with an array of general
purpose cores withmultiple thread contexts, wide vector units,
caches, and high bandwidth on die and memory interconnect.
Knights Corner is the first Intel Xeon Phi product in the MIC ar-
chitecture family of processors from Intel, aimed at enabling the
exascale era of computing.
Package Auto C3
Intel Xeon Phi Coprocessor
An Outlook on Knights Landing
and Omni-Path
S
N
EW
SESW
NENW
tech BOX
135
Knights Landing Knights Landing is the codename for Intel’s 2nd generation In-
tel Xeon Phi Product Family, which will deliver massive thread
parallelism, data parallelism and memory bandwidth – with
improved single-thread performance and Intel Xeon processor
binary-compatibility in a standard CPU form factor.
Additionally, Knights Landing will offer integrated Intel Om-
ni-Path fabric technology, and also be available in the tradition-
al PCIe coprocessor form factor.
Performance3+ TeraFLOPS of double-precision peak theoretical performance
per single socket node.
Power Management: All On and Running
IntegrationIntel Omni Scale fabric integration
High-performance on-package memory (MCDRAM)
� Over 5x STREAM vs. DDR4 Over 400 GB/s
� Up to 16GB at launch
� NUMA support
� Over 5x Energy Efficiency vs. GDDR52
� Over 3x Density vs. GDDR52
� In partnership with Micron Technology
� Flexible memory modes including cache and flat
Server Processor � Standalone bootable processor (running host OS) and a PCIe
coprocessor (PCIe end-point device)
� Platform memory capacity comparable to Intel Xeon
Processors
� Reliability (“Intel server-class reliability”)
� Power Efficiency (Over 25% better than discrete coprocessor)
with over 10 GF/W
� Density (3+ KNL with fabric in 1U)
Microarchitecture � Based on Intel’s 14 nanometer manufacturing technology
� Binary compatible with Intel Xeon Processors
� Support for Intel Advanced Vector Extensions 512
(Intel AVX-512)
� 3x Single-Thread Performance compared to Knights Corner
� Cache-coherency
� 60+ cores in a 2D Mesh architecture
136
4 Threads / Core
� Deep Out-of-Order Buffers
� Gather/scatter in hardware
� Advanced Branch Prediction
� High cache bandwidth
� Most of today’s parallel optimizations carry forward to KNL
� Multiple NUMA domain support per socket
Roadmap � Knights Hill is the codename for the 3rd generation of the
Intel Xeon Phi product family
� Based on Intel’s 10 nanometer manufacturing technology
� Integrated 2nd generation Intel Omni-Path Fabric
Omni Path – the Next-Generation FabricIntel Omni-Path Architecture is designed to deliver the perfor-
mance for tomorrow’s high performance computing (HPC) work-
loads and the ability to scale to tens – and eventually hundreds
– of thousands of nodes at a cost-competitive price to today’s
fabrics.
This next-generation fabric from Intel builds on the Intel True
Scale Fabric with an end-to-end solution, including PCIe adapters,
silicon, switches, cables, and management software. Customers
deploying Intel True Scale Fabric today will be able to migrate to
Intel Omni-Path Architecture through an Intel upgrade program.
In the near future, Intel will integrate the Intel Omni-Path Host
Fabric Interface onto future generations of Intel Xeon processors
and Intel Xeon Phi processors to address several key challenges
of HPC computing centers:
Intel Xeon Phi Coprocessor
An Outlook on Knights Landing
and Omni-Path
137
� Performance: Processor capacity and memory bandwidth are
scaling faster than system I/O. Intel Omni-Path Architecture
will accelerate message passing interface (MPI) rates for to-
morrow’s HPC.
� Cost and density: More components on a server limit density
and increase fabric cost. An integrated fabric controller helps
eliminate the additional costs and required space of discrete
cards, enabling higher server density.
� Reliability and power: Discrete interface cards consume
many watts of power. Integrated onto the processor, the In-
tel Omni-Path Host Fabric Interface will draw less power with
fewer discrete components.
For software applications, Intel Omni-Path Architecture will
maintain consistency and compatibility with existing Intel True
Scale Fabric and InfiniBand APIs by working through the open
source OpenFabrics Alliance (OFA) software stack. The OFA soft-
ware stack works with leading Linux distribution releases.
Intel is providing a clear path to addressing the issues of tomor-
row’s HPC with Intel Omni-Path Architecture.
Power Management: All On and Running
139
Vector Supercomputing – The NEC SX Architecture
Initially, in the 80s and 90s, the notion of supercomputing was almost a synonym for
vector-computing. And this concept is back now in different flavors. Now almost every
architecture is using SIMD, the acronym for “single instruction multiple data”.
Life Sciences | Risk Analysis | Simulation | Big Data Analytics | High Performance Computing
140
History of HPC, and what we learnDuring the last years systems based on x86 architecture domi-
nated the HPC market, with GPUs and many-core systems re-
cently making their inroads. But this period is coming to an end,
and it is again time for a differentiated complementary HPC-
targeted system. NEC will develop such a system.
The Early Days
When did HPC really start? Probably one would think about the
time when Seymour Cray developed high-end machines at CDC
and then started his own company Cray Research, with the first
product being the Cray-1 in 1976. At that time supercomputing was
very special, systems were expensive, and access was extremely
limited for researchers and students. But the business case
was obvious in many fields, scientifically as well as commercially.
These Cray machines were vector-computers, initially featuring
only one CPU, as parallelization was far from being mainstream,
even hardly a matter of research. And code development was
easier than today, a code of 20000 lines was already considered
“complex” at that time, and there were not so many standard
codes and algorithms available anyway. For this reason code-de-
velopment could adapt to any kind of architecture to achieve op-
timal performance.
In 1981 NEC started marketing its first vector supercomputer, the
SX-2, a single-CPU-vector-architecture like the Cray-1. The SX-2
was a milestone, it was the first CPU to exceed one GFlops of peak
performance.
The Attack of the “Killer-Micros”
In the 1990s the situation changed drastically, and what we see
today is a direct consequence. The “killer-micros” took over, new
parallel, later “massively parallel” systems (MPP) were built based
on micro-processor architectures which were predominantly
Vector Supercomputing
The NEC SX Architecture
141
targeted to commercial sectors, or in the case of Intel, AMD or
Motorola to PCs, to products which were successfully invading
the daily life, partially more than even experts had anticipated.
Clearly one could save a lot of cost by using, in the worst case
slightly adapting these CPUs for high-end-systems, and by fo-
cusing investments on other ingredients, like interconnect tech-
nology. At the same time increasingly Linux was adopted and of-
fered a cheap generally available OS platform to allow everybody
to use HPC for a reasonable price.
These systems were cheaper because of two reasons: the develop-
ment of standard CPUs was sponsored by different businesses,
and another expensive part, the memory architecture with many
CPUs addressing the same flat memory space, was replaced by sys-
tems comprised of many distinct so-called nodes, complete sys-
tems by themselves, plus some kind of interconnect technology,
which could also be taken from standard equipment partially.
The impact on the user was the necessity to code for these “dis-
tributed memory systems”. Standards for coding evolved, PVM
and later MPI, plus some proprietary paradigms. MPI is dominat-
ing still today.
In the high-end the special systems were still dominating, be-
cause they had architectural advantages and the codes were run-
ning with very good efficiency on such systems. Most codes were
not parallel at all or at least not well parallelized, let alone on
systems with distributed memory. The skills and experiences to
generate parallel versions of codes were lacking in those compa-
nies selling high-end-machines and also on the customers’ side.
So it took some time for the MPPs to take over. But it was inevi-
table, because this situation, sometimes called “democratization
of HPC”, was positive for both the commercial and the scientific
developments. From that time on researchers and students in
academia and industry got sufficient access to resources they
could not have dreamt of before.
The x86 Dominance
One could dispute whether the x86 architecture is the most
advanced or elegant, undoubtedly it became the dominant
one. It was developed for mass-markets, this was the strate-
gy, and HPC was a side-effect, sponsored by the mass-market.
This worked successfully and the proportion of x86 Linux clus-
ters in the Top500 list increased year by year. Because of the
mass-markets it made sense economically to invest a lot into
the development of LSI technology (large-scale integration).
The related progress, which led to more and more transistors
on the chips and increasing frequencies, helped HPC applica-
tions, which got more, faster and cheaper systems with each
generation. Software did not have to adapt a lot between gen-
erations and the software developments rather focused initial-
ly on the parallelization for distributed memory, later on new
functionalities and features. This worked for quite some years.
Systems of sufficient size to produce relevant results for typical
applications are cheap, and most codes do not really scale to a
level of parallelism to use thousands of processes. Some light-
house projects do, after a lot of development which is not easily
affordable, but the average researcher or student is often com-
pletely satisfied with something between a few and perhaps a
few hundred nodes of a standard Linux cluster.
In the meantime already a typical Intel-based dual-socket-node
of a standard cluster has 24 cores, soon a bit more. So paralleliza-
tion is increasingly important. There is “Moore’s law”, invented by
Gordon E. Moore, a co-founder of the Intel Corporation, in 1965.
Originally it stated that the number of transistors in an integrated
circuit doubles about every 18 months, leading to an exponential
increase. This statement was often misinterpreted that the per-
formance of a CPU would double every 18 months, which is mis-
leading. We observe ever increasing core-counts, the transistors
are just there, so why not? That applications might not be able to
142
use these many cores is another question, in principle the poten-
tial performance is there.
The “standard benchmark” in the HPC-scene is the so-called HPL,
High Performance LINPACK, and the famous Top500 list is built
based on this. This HPL benchmark is only one of three parts of
the original LINPACK benchmark, this is often forgotten! But the
other parts did not show such a good progress in performance,
their scalability is limited, so they could not serve to spawn com-
petition even between nations, which leads to a lot of prestige.
Moreover there are at most a few real codes in the world with
similar characteristics as the HPL. In other words: it does not tell
a lot. This is a highly political situation and probably quite some
money is wasted which would better be invested in brain-ware.
Anyway it is increasingly doubtful that LSI technology will ad-
vance at the same pace in the future. The grid-constant of Silicon
is 5.43 Å. LSI technology is now using 14 nm processes, the Intel
Broadwell generation, which is about 25 times the grid-constant,
and with 7nm LSIs at the long-term horizon. Some scientists as-
sume there is still some room to improve even more, but once
the features on the LSI have a thickness of only a few atoms one
cannot get another factor of ten. Sometimes the impact of some
future “disruptive technology” is claimed …
Another obstacle is the cost to build a chip fab for a new genera-
tion of LSI technology. Certainly there are ways to optimize both
production and cost, and there will be bright ideas, but with cur-
rent generations a fab costs at least many billion $. There have
been ideas of slowing down the innovation cycle on that side to
merely be able to recover the investment.
Memory Bandwidth
For many HPC applications it is not the compute power of the
Vector Supercomputing
The NEC SX Architecture
143
CPU which counts, it is the speed with which the CPU can
exchange data with the memory, the so-called memory band-
width. If it is neglected the CPUs will wait for data, the perfor-
mance will decrease. To describe the balance between CPU
performance and memory bandwidth one often uses the ratio
between operations and the amount of bytes that can be trans-
ferred while they are running, the “byte-per-flop-ratio”. Tradi-
tionally vector machines were always great in this regard, and
this is to continue as far as current plans are known. And this is
the reason why a vector machine can realize a higher fraction
of its theoretical performance on a real application than a sca-
lar machine.
Power Consumption
One other but really overwhelming problem is the increasing
power consumption of systems. Looking at different genera-
tions of Linux clusters one can assume that a dual-socket-node,
depending on how it is equipped with additional features, will
consume between 350 Watt and 400 Watt.
This only slightly varies between different generations, perhaps it
will slightly increase. Even in Europe sites are talking about elec-
tricity budgets in excess of 1 MWatt, even beyond 10 MWatts. One
also has to cool the system, which adds something like 10-30%
to the electricity cost! The cost to run such a system over 5 years
will be in the same order of magnitude as the purchase price!
The power consumption will grow with the frequency of the CPU,
is the dominant reason why frequencies are stagnating, if not de-
creasing between product generations. There are more cores, but
individual cores of an architecture are getting slower, not faster!
This has to be compensated on the application side by increased
scalability, which is not always easy to achieve. In the past us-
ers easily got a higher performance for almost every application
with each new generation, this is no longer the case.
Architecture Development
So the basic facts clearly indicate limitations of the LSI technology
and they are known since quite some time. Consequently this
leads to attempts to get additional performance by other means.
And the developments in the market clearly indicate that high-
end users are willing to consider the adaptation of codes.
To utilize available transistors CPUs with more complex func-
tionality are designed, utilizing SIMD instructions or providing
more operations per cycle than just an add and a multiply. SIMD
is “Single Instruction Multiple Data”. Initially at Intel the SIMD-in-
structions provided two results in 64-bit-precision per cycle rath-
er than one, in the meantime with AVX it is even four, on Xeon Phi
already eight.
This makes a lot of sense. The electricity consumed by the “ad-
ministration” of the operations, code fetch and decode, configu-
ration of paths on the CPU etc. is a significant portion of the pow-
er consumption. If the energy is used not only for one operation
but for two or even four, the energy per operation is obviously
reduced.
In essence and in order to overcome the performance-bottleneck
the architectures developed into the direction which to a larger
extent was already realized in the vector-supercomputers dec-
ades ago!
There is almost no architecture left without such features. But
SIMD-operations require application code to be vectorized. Be-
fore code developments and programming paradigms did not
care for fine-grain data-parallelism, which is the basis of appli-
cability of SIMD. Thus vector-machines could not show their
strength, on scalar code they cannot beat a scalar CPU. But strictly
speaking, there is no scalar CPU left, standard x86-systems do not
achieve their optimal performance on purely scalar code as well!
144
Alternative Architectures?
During recent years GPUs and many-core architectures are in-
creasingly used in the HPC market. Although these systems are
somewhat diverse in detail, in any case the degree of parallelism
needed is much higher than in standard CPUs. This points into
the same direction, a low-level parallelism is used to address the
bottlenecks. In the case of Nvidia GPUs the individual so-called
CUDA cores, relatively simple entities, are steered by a more
complex central processor, realizing a high level of SIMD paral-
lelism. In the case of Xeon Phi the individual cores are also less
powerful, but they are used in parallel using coding paradigms
which operate on the level of loop nests, an approach which was
already used on vector supercomputers during the 1980s!
So these alternative architectures not only go into the same
direction, the utilization of fine-grain parallelism, they also in-
dicate that coding styles which suite real vector machines will
dominate the future software developments. Note that at last
some of these systems have an economic backing by mass mar-
kets, gaming!
Vector Supercomputing
The NEC SX Architecture
S
N
EW
SESW
NENW
tech BOX
145
History of the SX SeriesAs already stated NEC started marketing the first commercially
available generation of this series, the SX-2, in 1981. By now the
latest generation is the SX-ACE.
In 1998 Dr. Tadashi Watanabe, the developer of the SX series
and the major driver of NEC’s HPC-business, received the Eck-
hard-Mauchly-medal, a renowned US-award for these achieve-
ments, and in 2006 also the “Seymour Cray Award”. Prof. Watanabe
was a very pragmatic developer, a knowledgeable scientist and a
visionary manager, all at the same time, and a great person to
talk to. From the beginning the SX-architecture was different
from the competing Cray-vectors, it already integrated some of
the features that RISC CPUs at that time were using, but com-
bined them with vector registers and pipelines.
For example the SX could reorder instructions in hardware on
the fly, which Cray machines could not do. In effect this made
the system much easier to use. The single processor SX-2 was the
fastest CPU at that time, the first one to break the 1 GFlops bar-
rier. Naturally the compiler technology for automatic vectoriza-
tion and optimization was important.
With the SX-3 NEC developed a high-end shared-memory system.
It had up to four very strong CPUs. The resolution of memory ac-
cess conflicts by CPUs working in parallel, and the consistency of
data throughout the whole shared-memory system are non-triv-
ial, and the development of such a memory architecture was a
major breakthrough.
With the SX-4 NEC implemented a high-end system based on CMOS.
It was the direct competitor to the Cray T90, with a huge advantage
with regard to price and power consumption. The T90 was using
ECL technology and achieved 1.8 GFlops peak performance. The
SX-4 had a lower frequency because of CMOS, but because of the
higher density of the LSIs it featured a higher level of parallelism
on the CPU, leading to 2 GFlops per CPU with up to 32 CPUs.
The memory architecture to support those was a major leap
ahead. On this system one could typically reach an efficiency
of 20-40% of peak performance measured throughout complete
applications, far from what is achieved on standard components
today. With this system NEC also implemented the first genera-
tion of its own proprietary interconnect, the so-called IXS, which
146
allowed connecting several of these strong nodes into one highly
powerful system. The IXS featured advanced functions like a
global barrier and some hardware fault-tolerance, things which
other companies tackled years later.
With the SX-5 this direction was continued, it was a very suc-
cessful system pushing the limits of CMOS-based CPUs and mul-
ti-node configurations. At that time slowly the efforts on the us-
ers’ side to parallelize their codes using “message passing”, PVM
and MPI at that time, started to surface, so the need for a huge
shared memory system slowly disappeared. Still there were quite
some important codes in the scene which were only parallel us-
ing a shared-memory paradigm, initially “microtasking”, which
was vendor-specific, “autotasking”, NEC and some other vendors
could automatically parallelize code, and later the standard
OpenMP. So there was still sufficient need for a huge and highly
capable shared-memory-system.
But later it became clear that more and more codes could utilize
distributed memory systems, and moreover the memory-archi-
tecture of huge shared memory systems became increasingly
complex and that way costly. A memory is made of “banks”, in-
dividual pieces of memory which are used in parallel. In order
to provide the data at the necessary speed for a CPU a certain
minimum amount of banks per CPU is required, because a bank
needs some cycles to recover after every access. Consequently
the number of banks had to scale with the number of CPUs, so
that the complexity of the network connecting CPUs and banks
was growing with the square of the number of CPUs. And if the
CPUs became stronger they needed more banks, which made it
even worse. So this concept became too expensive.
Vector Supercomputing
History of the SX Series
147
At the same time NEC had a lot of experience in mixing MPI- and
OpenMP parallelization in applications, and it turned out that for
the very fast CPUs the shared-memory parallelization was typi-
cally not really efficient for more than four or eight threads. So
the concept was adapted accordingly, which lead to the SX-6 and
the famous Earth-Simulator, which kept the #1-spot of the Top500
for some years.
The Earth Simulator was a different machine, but based
on the same technology level and architecture as the SX-6.
These systems were the first implementation ever of a vector-CPU
on one single chip. The CPU by itself was not the challenge, but
the interface to the memory to support the bandwidth needed
for this chip, basically the “pins”. This kind of packaging tech-
no-logy was always a strong point in NEC’s technology portfolio.
The SX-7 was a somewhat special machine, it was based on the SX-6
technology level, but featured 32 CPUs on one shared memory.
The SX-8 then was a new implementation again with some ad-
vanced features which directly addressed the necessities of cer-
tain code-structures frequently encountered in applications. The
communication between the world-wide organization of appli-
cation-analysts and the hard- and software developers in Japan
had proven very fruitful, the SX-8 was very successful in the market.
With the SX-9 the number of CPUs per node was pushed up again
to 16 because of certain demands in the market. Just as an in-
dication for the complexity of the memory architecture: the
system had 32768 banks! The peak-performance of one node was
1,6 TFlops.
Keep in mind that the efficiency achieved on real applications
was still in the range of 10% in bad cases, up to 20% or even more,
depending on the code. One should not compare this peak perfor-
mance with some standard chip today, this would be misleading.
The Current Product: SX-ACEIn November 2014 NEC announced the next generation of the
SX-vector-product, the SX-ACE. Some systems are already in-
stalled and accepted, also in Europe.
The main design target was the reduction of power-consump-
tion per sustained performance, and measurements have prov-
en this has been achieved. In order to do so and in line with the
observation that the shared-memory parallelization with vector
CPUs will be useful on four cores, perhaps eight, but normally
not beyond that, the individual node consists of a single-chip-
four core-processor, which inherits memory-controllers and the
interface to the communication interconnect, a next generation
of the proprietary IXS.
That way a node is a small board, leading to a very compact de-
sign. The complicated network between the memory and the
CPUs, which took about 70% of the power-consumption of an SX-
9, was eliminated. And naturally a new generation of LSI-and PCB
technology also leads to reductions.
With the memory bandwidth of 256 GB/s this machine realizes
one “byte per flop”, a measure of the balance between the pure
compute power and the memory bandwidth of a chip. In compar-
ison nowadays typical standard processors have in the order of
one eights of this!
148
Memory bandwidth is increasingly a problem, this is known in
the market as the “memory wall”, and is a hot topic since more
than a decade. Standard components are built for a different
market, where memory bandwidth is not as important as it is for
HPC applications. Two of these node boards are put into a mod-
ule, eight such modules are combined in one cage, and four of
the cages, i.e. 64 nodes, are contained in one rack. The standard
configuration of 512 nodes consists of eight racks.
Vector Supercomputing
History of the SX Series
149
NEC already installed several SX-ACE systems in Europe, and
more installations are ongoing. The advantages of the system,
the superior memory bandwidth, the strong CPU with good effi-
ciency, will provide the scientists with performance levels on cer-
tain codes which otherwise could not be achieved. Moreover the
system has proven to be superior in terms of power consumption
per sustained application performance, and this is a real hot top-
ic nowadays.
S
N
EW
SESW
NENW
tech BOX
do i = 1, n
a(i) = b(i) + c(i)
end do
150
LSI technologies have improved drastically during the last
two decades, but the ever increasing power consumption of
systems now poses serious limitations. Tasks like fetching in-
structions, decoding them or configuring paths on the chip
consume energy, so it makes sense to execute these activities
not for one operation only, but for a set of such operations.
This is meant by SIMD.
Data ParallelismThe underlying structure to enable the usage of SIMD-instruc-
tions is called “data parallelism”. For example the following loop
is data-parallel, the individual loop iterations could be executed
in any arbitrary order, and the results would stay the same:
The complete independence of the loop iterations allows executing
them in parallel, and this is mapped to hardware with SIMD support.
Naturally one could also use other means to parallelize such a
loop, e.g. for a very long loop one could execute chunks of loop
iterations on different processors of a shared-memory machine
using OpenMP. But this will cause overhead, so it will not be effi-
cient for small trip counts, while a SIMD feature in hardware can
deal with such a fine granularity.
As opposed to the example above the following loop, a recursion,
is not data parallel:
Vector Supercomputing
History of the SX Series
do i = 1, n-1
a(i+1) = a(i) + c(i)
end do
151
The time or number of cycles extends to the right hand side,
assuming five stages (vertical direction), the red squares indi-
cate which stage is busy at the given cycle. Obviously this is not
efficient, because white squares indicate transistors which con-
sume power without actively producing results. And the com-
pute-performance is not optimal, it takes five cycles to compute
one result.
Similarly to the assembly line in the automotive industry this is
what we want to achieve:
After the “latency”, five cycles, just the length of the pipeline, we
get a result every cycle. Obviously it is easy to feed the pipeline
with sufficient independent input data in the case of different
loop iterations of a data-parallel loop.
If this is not possible one could also use different operations
within the execution of a complicated loop body within one
Obviously a change of the order of the loop iterations, for ex-
ample the special case of starting with a loop-index of n-1 and
decreasing it afterwards, will lead to different results. Therefore
these iterations cannot be executed in parallel.
PipeliningThe following description is very much simplified, but explains
the basic principles: Operations like add and multiply consist of
so-called segments, which are executed one after the other in a
so-called pipeline. While each segment acts on the data for some
kind of intermediate step, only after the execution of the whole
pipeline the final result is available.
It is similar to an assembly-line in the automotive industry. Now,
the segments do not need to deal with the same operation at a
given point in time, they could be kept active if sufficient inputs
are fed into the pipeline. Again, in the automotive industry the
different stages of the assembly line are working on different
cars, and nobody would wait for a car to be completed before
starting the next one in the first stage of the production, that
way keeping almost all stages inactive. Such a situation would
be resembled by the following picture:
152
given loop iteration. But typically operations in such a loop body
need results from previous operations of the same iteration, so
one would rather arrive at a utilization of the pipeline like this:
Some operations can be calculated almost in parallel, others
have to wait for previous calculations to finish because they
need input. In total the utilization of the pipeline is better but
not optimal.
This was the situation before the trend to SIMD computing was
revitalized. Before that the advances of LSI technology always
provided an increased performance with every new generation
of product. But then the race for increasing frequency stopped,
and now architectural enhancements are important again.
Single Instruction Multiple Data (SIMD)Hardware realizations of SIMD have been present since the first
Cray-1, but this was a vector architecture. Nowadays the instruc-
tion sets SSE and AVX in their different generations and flavors
use it, and similarly for example CUDA coding is also a way to uti-
lize data parallelism.
The “width” of the SIMD-instruction, the amount of loop itera-
tions to execute in parallel by one instruction, is kept relatively
small in the case of SSE (two for double precision) and AVX (four
for double precision). The implementations of SIMD in stand-
ard “scalar” processors typically involve add-and-multiply units
which can produce several results per clock-cycle. They also fea-
ture registers of the same “width” to keep several input parame-
ters or several results.
Vector Supercomputing
History of the SX Series
153
iterations are independent, means if the loop is data parallel. It
would not work if one computation would need a result of anoth-
er computation as input while this is just being processed.
In fact the SX-vector-machines always were combining both
ideas, the vector-register and parallel functional units, so-called
“pipe-sets”, four in this picture:
There is another advantage of the vector architecture in the SX,
and this proves to be very important. The scalar portion of the
CPU is acting independently, and while the pipelines are kept
busy with input from the vector registers, the inevitable scalar
instructions which come with every loop can be executed. This
involves checking of loop counters, updating of address pointers
and the like. That way these instructions essentially do not con-
sume time. The amount of cycles available to “hide” those is the
length of the vector-registers divided by the number of pipe sets.
In fact the white space in the picture above can partially be hid-
den as well, because new vector instructions can be issued in ad-
vance to immediately take over once the vector register with the
input data for the previous instruction is completely processed.
The instruction-issue-unit has the complete status information
of all available resources, specifically pipes and vector registers.
In a way the pipelines are “doubled” for SSE, but both are activat-
ed by the same single instruction:
Vector ComputingThe difference between those SIMD architectures and SIMD with
a real vector architecture like the NEC SX-ACE, the only remaining
vector system, lies in the existence of vector registers.
In order to provide sufficient input to the functional units the
vector registers have proven to be very efficient. In a way they are
a very natural match to the idea of a pipeline – or assembly line.
Again thinking about the automotive industry, there one would
have a stock of components to constantly feed the assembly line.
The following viewgraph resembles this situation:
When an instruction is issued it takes some cycles for the first
result to appear, but after that one would get a new result every
cycle, until the vector register, which has a certain length, is
completely processed. Naturally this can only work if the loop
Parallel File Systems – The New Standard for HPC StorageHPC computation results in the terabyte range are not uncommon. The problem in this
context is not so much storing the data at rest, but the performance of the necessary
copying back and forth in the course of the computation job flow and the dependent
job turn-around time.
For interim results during a job runtime or for fast storage of input and results data,
parallel file systems have established themselves as the standard to meet the ever-
increasing performance requirements of HPC storage systems.
Engineering | Life Sciences | Automotive | Price Modelling | Aerospace | CAE | Data Analytics
156
Parallel NFSYesterday’s Solution: NFS for HPC StorageThe original Network File System (NFS) developed by Sun Mi-
crosystems at the end of the eighties – now available in version 4.1
– has been established for a long time as a de-facto standard for
the provisioning of a global namespace in networked computing.
A very widespread HPC cluster solution includes a central master
node acting simultaneously as an NFS server, with its local file
system storing input, interim and results data and exporting
them to all other cluster nodes.
There is of course an immediate bottleneck in this method: When
the load of the network is high, or where there are large num-
bers of nodes, the NFS server can no longer keep up delivering or
receiving the data. In high-performance computing especially,
the nodes are interconnected at least once via Gigabit Ethernet,
so the sum total throughput is well above what an NFS server
with a Gigabit interface can achieve. Even a powerful network
connection of the NFS server to the cluster, for example with
10-Gigabit Ethernet, is only a temporary solution to this prob-
lem until the next cluster upgrade. The fundamental problem re-
mains – this solution is not scalable; in addition, NFS is a difficult
protocol to cluster in terms of load balancing: either you have
to ensure that multiple NFS servers accessing the same data are
constantly synchronised, the disadvantage being a noticeable
Parallel File Systems
Parallel NFS
Cluster Nodes= NFS Clients
NAS Head= NFS Server
A classical NFS server is a bottleneck
157
drop in performance or you manually partition the global name-
space which is also time-consuming. NFS is not suitable for dy-
namic load balancing as on paper it appears to be stateless but
in reality is, in fact, stateful.
Today’s Solution: Parallel File SystemsFor some time, powerful commercial products have been
available to meet the high demands on an HPC storage system.
The open-source solutions BeeGFS from the Fraunhofer Com-
petence Center for High Performance Computing or Lustre are
widely used in the Linux HPC world, and also several other free
as well as commercial parallel file system solutions exist.
What is new is that the time-honoured NFS is to be upgraded,
including a parallel version, into an Internet Standard with
the aim of interoperability between all operating systems.
The original problem statement for parallel NFS access was
written by Garth Gibson, a professor at Carnegie Mellon Uni-
versity and founder and CTO of Panasas. Gibson was already a
renowned figure being one of the authors contributing to the
original paper on RAID architecture from 1988.
The original statement from Gibson and Panasas is clearly
noticeable in the design of pNFS. The powerful HPC file sys-
tem developed by Gibson and Panasas, ActiveScale PanFS,
with object-based storage devices functioning as central
components, is basically the commercial continuation of the
“Network-Attached Secure Disk (NASD)” project also developed
by Garth Gibson at the Carnegie Mellon University.
Parallel NFSParallel NFS (pNFS) is gradually emerging as the future standard
to meet requirements in the HPC environment. From the indus-
try’s as well as the user’s perspective, the benefits of utilising
standard solutions are indisputable: besides protecting end
user investment, standards also ensure a defined level of inter-
operability without restricting the choice of products availa-
ble. As a result, less user and administrator training is required
which leads to simpler deployment and at the same time, a
greater acceptance.
As part of the NFS 4.1 Internet Standard, pNFS will not only
adopt the semantics of NFS in terms of cache consistency or
security, it also represents an easy and flexible extension of
the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1
implementations do not have to include pNFS as a feature. The
scheduled Internet Standard NFS 4.1 is today presented as IETF
RFC 5661.
The pNFS protocol supports a separation of metadata and data:
a pNFS cluster comprises so-called storage devices which store
the data from the shared file system and a metadata server
(MDS), called Director Blade with Panasas – the actual NFS 4.1
server. The metadata server keeps track of which data is stored
on which storage devices and how to access the files, the so-
called layout. Besides these “striping parameters”, the MDS
also manages other metadata including access rights or similar,
which is usually stored in a file’s inode.
The layout types define which Storage Access Protocol is used
by the clients to access the storage devices. Up until now, three
potential storage access protocols have been defined for pNFS:
file, block and object-based layouts, the former being described
in RFC 5661 direct, the latters in RFC 5663 and 5664, respective-
ly. Last but not least, a Control Protocol is also used by the MDS
and storage devices to synchronise status data. This protocol is
deliberately unspecified in the standard to give manufacturers
certain flexibility. The NFS 4.1 standard does however specify cer-
tain conditions which a control protocol has to fulfil, for exam-
ple, how to deal with the change/modify time attributes of files.
pNFS supports backwards compatibility with non-pNFS compat-
ible NFS 4 clients. In this case, the MDS itself gathers data from
the storage devices on behalf of the NFS client and presents the
data to the NFS client via NFS 4. The MDS acts as a kind of proxy
server – which is e.g. what the Director Blades from Panasas do.
S
N
EW
SESW
NENW
tech BOX
158
What’s New in NFS 4.1?NFS 4.1 is a minor update to NFS 4, and adds new features to it.
One of the optional features is parallel NFS (pNFS) but there is
other new functionality as well.
One of the technical enhancements is the use of sessions, a
persistent server object, dynamically created by the client. By
means of sessions, the state of an NFS connection can be stored,
no matter whether the connection is live or not. Sessions sur-
vive temporary downtimes both of the client and the server.
Each session has a so-called fore channel, which is the connec-
tion from the client to the server for all RPC operations, and op-
tionally a back channel for RPC callbacks from the server that
now can also be realized through firewall boundaries. Sessions
can be trunked to increase the bandwidth. Besides session
trunking there is also a client ID trunking for grouping together
several sessions to the same client ID.
By means of sessions, NFS can be seen as a really stateful proto-
col with a so-called “Exactly-Once Semantics (EOS)”. Until now, a
necessary but unspecified reply cache within the NFS server is
implemented to handle identical RPC operations that have been
sent several times. This statefulness in reality is not very robust,
however, and sometimes leads to the well-known stale NFS han-
dles. In NFS 4.1, the reply cache is now a mandatory part of the
NFS implementation, storing the server replies to RPC requests
persistently on disk.
Parallel File Systems
What’s New in NFS 4.1?
159
Another new feature of NFS 4.1 is delegation for directories: NFS
clients can be given temporary exclusive access to directories.
Before, this has only been possible for simple files. With the
forthcoming version 4.2 of the NFS standard, federated filesys-
tems will be added as a feature, which represents the NFS coun-
terpart of Microsoft’s DFS (distributed filesystem).
Some Proposed NFSv4.2 featuresNFSv4.2 promises many features that end-users have been
requesting, and that makes NFS more relevant as not only an
“every day” protocol, but one that has application beyond the
data center.
Server Side Copy
Server-Side Copy (SSC) removes one leg of a copy operation. In-
stead of reading entire files or even directories of files from one
server through the client, and then writing them out to another,
SSC permits the destination server to communicate directly to
the source server without client involvement, and removes the
limitations on server to client bandwidth and the possible con-
gestion it may cause.
Application Data Blocks (ADB)
ADB allows definition of the format of a file; for example, a VM
image or a database. This feature will allow initialization of data
stores; a single operation from the client can create a 300GB da-
tabase or a VM image on the server.
Guaranteed Space Reservation & Hole Punching
As storage demands continue to increase, various efficiency
techniques can be employed to give the appearance of a large
virtual pool of storage on a much smaller storage system. Thin
provisioning, (where space appears available and reserved, but
is not committed) is commonplace, but often problematic to
manage in fast growing environments. The guaranteed space
reservation feature in NFSv4.2 will ensure that, regardless of the
thin provisioning policies, individual files will always have space
vailable for their maximum extent.
While such guarantees are a reassurance for the end-user, they
don’t help the storage administrator in his or her desire to fully
utilize all his available storage. In support of better storage ef-
ficiencies, NFSv4.2 will introduce support for sparse files. Com-
monly called “hole punching”, deleted and unused parts of files
are returned to the storage system’s free space pool.
pNFS Clients Storage Device
Metadata Server
Storage Access Protocol
NFS 4.1Control Protocol
160
pNFS Layout TypesIf storage devices act simply as NFS 4 file servers, the file layout
is used. It is the only storage access protocol directly specified
in the NFS 4.1 standard. Besides the stripe sizes and stripe lo-
cations (storage devices), it also includes the NFS file handles
which the client needs to use to access the separate file areas.
The file layout is compact and static, the striping information
does not change even if changes are made to the file enabling
multiple pNFS clients to simultaneously cache the layout and
avoid synchronisation overhead between clients and the MDS
or the MDS and storage devices.
File system authorisation and client authentication can be well
implemented with the file layout. When using NFS 4 as the stor-
age access protocol, client authentication merely depends on
the security flavor used – when using the RPCSEC_GSS securi-
ty flavor, client access is kerberized, for example and the server
controls access authorization using specified ACLs and cryp-
tographic processes.
In contrast, the block/volume layout uses volume identifiers
and block offsets and extents to specify a file layout. SCSI block
commands are used to access storage devices. As the block
distribution can change with each write access, the layout
must be updated more frequently than with the file layout.
Parallel NFS
Parallel File Systems
Panasas HPC Storage
161
Block-based access to storage devices does not offer any secure
authentication option for the accessing SCSI initiator. Secure
SAN authorisation is possible with host granularity only, based
on World Wide Names (WWNs) with Fibre Channel or Initiator
Node Names (IQNs) with iSCSI. The server cannot enforce access
control governed by the file system. On the contrary, a pNFS
client basically voluntarily abides with the access rights, the
storage device has to trust the pNFS client – a fundamental
access control problem that is a recurrent issue in the NFS pro-
tocol history.
The object layout is syntactically similar to the file layout, but it
uses the SCSI object command set for data access to so-called
Object-based Storage Devices (OSDs) and is heavily based on
the DirectFLOW protocol of the ActiveScale PanFS from Pana-
sas. From the very start, Object-Based Storage Devices were
designed for secure authentication and access. So-called capa-
bilities are used for object access which involves the MDS issu-
ing so-called capabilities to the pNFS clients. The ownership of
these capabilities represents the authoritative access right to
an object.
pNFS can be upgraded to integrate other storage access pro-
tocols and operating systems, and storage manufacturers also
have the option to ship additional layout drivers for their pNFS
implementations.
The Panasas file system uses parallel and redundant access to
object storage devices (OSDs), per-file RAID, distributed metada-
ta management, consistent client caching, file locking services,
and internal cluster management to provide a scalable, fault tol-
erant, high performance distributed file system.
The clustered design of the storage system and the use of client-
driven RAID provide scalable performance to many concurrent
file system clients through parallel access to file data that is
striped across OSD storage nodes. RAID recovery is performed
in parallel by the cluster of metadata managers, and declus-
tered data placement yields scalable RAID rebuild rates as the
storage system grows larger.
Panasas HPC StoragePanasas ActiveStor parallel storage takes a very different ap-
proach by allowing compute clients to read and write directly
to the storage, entirely eliminating filer head bottlenecks and
allowing single file system capacity and performance to scale
linearly to extreme levels using a proprietary protocol called Di-
rectFlow. Panasas has actively shared its core knowledge with
a consortium of storage industry technology leaders to create
an industry standard protocol which will eventually replace the
need for DirectFlow. This protocol, called parallel NFS (pNFS) is
now an optional extension of the NFS v4.1 standard.
The Panasas system is a production system that provides file
service to some of the largest compute clusters in the world, in
scientific labs, in seismic data processing, in digital animation
studios, in computational fluid dynamics, in semiconductor
manufacturing, and in general purpose computing environ-
ments. In these environments, hundreds or thousands of file
system clients share data and generate very high aggregate I/O
load on the file system. The Panasas system is designed to sup-
port several thousand clients and storage capacities in excess
of a petabyte.
The unique aspects of the Panasas system are its use of per-file,
client-driven RAID, its parallel RAID rebuild, its treatment of dif-
ferent classes of metadata (block, file, system) and a commodity
parts based blade hardware with integrated UPS. Of course, the
system has many other features (such as object storage, fault
tolerance, caching and cache consistency, and a simplified
management model) that are not unique, but are necessary for a
scalable system implementation.
162
Panasas File System BackgroundThe storage cluster is divided into storage nodes and manager
nodes at a ratio of about 10 storage nodes to 1 manager node,
although that ratio is variable. The storage nodes implement an
object store, and are accessed directly from Panasas file system
clients during I/O operations. The manager nodes manage the
overall storage cluster, implement the distributed file system se-
mantics, handle recovery of storage node failures, and provide an
exported view of the Panasas file system via NFS and CIFS. The
following Figure gives a basic view of the system components.
Object Storage
An object is a container for data and attributes; it is analogous
to the inode inside a traditional UNIX file system implementa-
tion. Specialized storage nodes called Object Storage Devices
(OSD) store objects in a local OSDFS file system. The object in-
terface addresses objects in a two-level (partition ID/object ID)
namespace. The OSD wire protocol provides byteoriented ac-
cess to the data, attribute manipulation, creation and deletion
of objects, and several other specialized operations. Panasas
uses an iSCSI transport to carry OSD commands that are very
similar to the OSDv2 standard currently in progress within SNIA
and ANSI-T10.
The Panasas file system is layered over the object storage. Each file
is striped over two or more objects to provide redundancy and high
bandwidth access. The file system semantics are implemented by
metadata managers that mediate access to objects from clients
of the file system. The clients access the object storage using the
iSCSI/OSD protocol for Read and Write operations. The I/O op-
erations proceed directly and in parallel to the storage nodes,
bypassing the metadata managers. The clients interact with the
out-of-band metadata managers via RPC to obtain access capa-
bilities and location information for the objects that store files.
Parallel File Systems
Panasas HPC Storage
163
Object attributes are used to store file-level attributes, and di-
rectories are implemented with objects that store name to ob-
ject ID mappings. Thus the file system metadata is kept in the
object store itself, rather than being kept in a separate database
or some other form of storage on the metadata nodes.
System Software Components
The major software subsystems are the OSDFS object storage
system, the Panasas file system metadata manager, the Pana-
sas file system client, the NFS/CIFS gateway, and the overall
cluster management system.
� The Panasas client is an installable kernel module that runs in-
side the Linux kernel. The kernel module implements the stand-
ard VFS interface, so that the client hosts can mount the file sys-
tem and use a POSIX interface to the storage system.
� Each storage cluster node runs a common platform that is based
NFS/CIFS
OSDFS
SysMgr
COMPUTE NODE
MANAGER NODE
STORAGE NODE
iSCSI/OSD
PanFSClient
RPC
Client
Panasas System Components
on FreeBSD, with additional services to provide hardware moni-
toring, configuration management, and overall control.
� The storage nodes use a specialized local file system (OSD-
FS) that implements the object storage primitives. They
implement an iSCSI target and the OSD command set. The
OSDFS object store and iSCSI target/OSD command proces-
sor are kernel modules. OSDFS is concerned with traditional
block-level file system issues such as efficient disk arm utili-
zation, media management (i.e. error handling), high through-
put, as well as the OSD interface.
� The cluster manager (SysMgr) maintains the global configura-
tion, and it controls the other services and nodes in the storage
cluster. There is an associated management application that pro-
vides both a command line interface (CLI) and an HTML interface
(GUI). These are all user level applications that run on a subset
of the manager nodes. The cluster manager is concerned with
membership in the storage cluster, fault detection, configuration
management, and overall control for operations like software
upgrade and system restart.
� The Panasas metadata manager (PanFS) implements the file
system semantics and manages data striping across the object
storage devices. This is a user level application that runs on every
manager node. The metadata manager is concerned with distrib-
uted file system issues such as secure multi-user access, main-
taining consistent file- and object-level metadata, client cache
coherency, and recovery from client, storage node, and metadata
server crashes. Fault tolerance is based on a local transaction log
that is replicated to a backup on a different manager node.
� The NFS and CIFS services provide access to the file system
for hosts that cannot use our Linux installable file system
client. The NFS service is a tuned version of the standard
FreeBSD NFS server that runs inside the kernel. The CIFS ser-
vice is based on Samba and runs at user level. In turn, these
services use a local instance of the file system client, which
runs inside the FreeBSD kernel. These gateway services run
on every manager node to provide a clustered NFS and CIFS
service.
164
Commodity Hardware Platform
The storage cluster nodes are implemented as blades that are
very compact computer systems made from commodity parts.
The blades are clustered together to provide a scalable plat-
form. The OSD StorageBlade module and metadata manager Di-
rectorBlade module use the same form factor blade and fit into
the same chassis slots.
Storage Management
Traditional storage management tasks involve partitioning
available storage space into LUNs (i.e., logical units that are one
or more disks, or a subset of a RAID array), assigning LUN own-
ership to different hosts, configuring RAID parameters, creating
file systems or databases on LUNs, and connecting clients to the
correct server for their storage. This can be a labor-intensive sce-
nario. Panasas provides a simplified model for storage manage-
ment that shields the storage administrator from these kinds of
details and allow a single, part-time admin to manage systems
that were hundreds of terabytes in size.
The Panasas storage system presents itself as a file system with
a POSIX interface, and hides most of the complexities of storage
management. Clients have a single mount point for the entire
system. The /etc/fstab file references the cluster manager, and
from that the client learns the location of the metadata service
instances. The administrator can add storage while the system
is online, and new resources are automatically discovered. To
manage available storage, Panasas introduces two basic stor-
age concepts: a physical storage pool called a BladeSet, and a
logical quota tree called a Volume. The BladeSet is a collection
of StorageBlade modules in one or more shelves that comprise a
RAID fault domain. Panasas mitigates the risk of large fault do-
mains with the scalable rebuild performance described below
in the text. The BladeSet is a hard physical boundary for the vol-
umes it contains. A BladeSet can be grown at any time, either by
adding more StorageBlade modules, or by merging two existing
BladeSets together.
Parallel File Systems
Panasas HPC Storage
165
The Volume is a directory hierarchy that has a quota constraint
and is assigned to a particular BladeSet. The quota can be
changed at any time, and capacity is not allocated to the Vol-
ume until it is used, so multiple volumes compete for space
within their BladeSet and grow on demand. The files in those
volumes are distributed among all the StorageBlade modules
in the BladeSet. Volumes appear in the file system name space
as directories. Clients have a single mount point for the whole
storage system, and volumes are simply directories below the
mount point. There is no need to update client mounts when the
admin creates, deletes, or renames volumes.
Automatic Capacity Balancing
Capacity imbalance occurs when expanding a BladeSet (i.e., add-
ing new, empty storage nodes), merging two BladeSets, and re-
placing a storage node following a failure. In the latter scenario,
the imbalance is the result of the RAID rebuild, which uses spare
capacity on every storage node rather than dedicating a specif-
ic “hot spare” node. This provides better throughput during re-
build, but causes the system to have a new, empty storage node
after the failed storage node is replaced. The system automati-
cally balances used capacity across storage nodes in a BladeSet
using two mechanisms: passive balancing and active balancing.
Passive balancing changes the probability that a storage node
will be used for a new component of a file, based on its availa-
ble capacity. This takes effect when files are created, and when
their stripe size is increased to include more storage nodes. Ac-
tive balancing is done by moving an existing component object
from one storage node to another, and updating the storage
map for the affected file. During the transfer, the file is transpar-
ently marked read-only by the storage management layer, and
the capacity balancer skips files that are being actively written.
Capacity balancing is thus transparent to file system clients.
Object RAID and ReconstructionPanasas protects against loss of a data object or an entire stor-
age node by striping files across objects stored on different
storage nodes, using a fault-tolerant striping algorithm such as
RAID-1 or RAID-5. Small files are mirrored on two objects, and
larger files are striped more widely to provide higher bandwidth
and less capacity overhead from parity information. The per-file
RAID layout means that parity information for different files is
not mixed together, and easily allows different files to use differ-
ent RAID schemes alongside each other. This property and the
security mechanisms of the OSD protocol makes it possible to
enforce access control over files even as clients access storage
nodes directly. It also enables what is perhaps the most novel
aspect of our system, client-driven RAID. That is, the clients are
responsible for computing and writing parity. The OSD security
mechanism also allows multiple metadata managers to manage
objects on the same storage device without heavyweight coor-
dination or interference from each other.
Client-driven, per-file RAID has four advantages for large-scale
storage systems. First, by having clients compute parity for
their own data, the XOR power of the system scales up as the
number of clients increases. We measured XOR processing dur-
ing streaming write bandwidth loads at 7% of the client’s CPU,
with the rest going to the OSD/iSCSI/TCP/IP stack and other file
system overhead. Moving XOR computation out of the storage
system into the client requires some additional work to handle
failures. Clients are responsible for generating good data and
good parity for it. Because the RAID equation is per-file, an er-
rant client can only damage its own data. However, if a client
fails during a write, the metadata manager will scrub parity to
ensure the parity equation is correct.
The second advantage of client-driven RAID is that clients can
perform an end-to-end data integrity check. Data has to go
through the disk subsystem, through the network interface on
the storage nodes, through the network and routers, through
the NIC on the client, and all of these transits can introduce er-
rors with a very low probability. Clients can choose to read par-
ity as well as data, and verify parity as part of a read operation.
166
If errors are detected, the operation is retried. If the error is per-
sistent, an alert is raised and the read operation fails. By check-
ing parity across storage nodes within the client, the system can
ensure end-to-end data integrity. This is another novel property
of per-file, client-driven RAID.
Third, per-file RAID protection lets the metadata managers re-
build files in parallel. Although parallel rebuild is theoretically
possible in block-based RAID, it is rarely implemented. This is due
to the fact that the disks are owned by a single RAID controller,
even in dual-ported configurations. Large storage systems have
multiple RAID controllers that are not interconnected. Since the
SCSI Block command set does not provide fine-grained synchro-
nization operations, it is difficult for multiple RAID controllers
to coordinate a complicated operation such as an online rebuild
without external communication. Even if they could, without
connectivity to the disks in the affected parity group, other RAID
controllers would be unable to assist. Even in a high-availability
configuration, each disk is typically only attached to two differ-
ent RAID controllers, which limits the potential speedup to 2x.
When a StorageBlade module fails, the metadata managers that
own Volumes within that BladeSet determine what files are af-
fected, and then they farm out file reconstruction work to every
other metadata manager in the system. Metadata managers re-
build their own files first, but if they finish early or do not own
any Volumes in the affected Bladeset, they are free to aid other
metadata managers. Declustered parity groups spread out the
I/O workload among all StorageBlade modules in the BladeSet.
The result is that larger storage clusters reconstruct lost data
more quickly.
The fourth advantage of per-file RAID is that unrecoverable
faults can be constrained to individual files. The most com-
monly encountered double-failure scenario with RAID-5 is an
unrecoverable read error (i.e., grown media defect) during the
reconstruction of a failed storage device. The second storage
device is still healthy, but it has been unable to read a sector,
©2013 Panasas Incorporated. All rights reserved. Panasas,
the Panasas logo, Accelerating Time to Results, ActiveScale,
DirectFLOW, DirectorBlade, StorageBlade, PanFS, PanActive
and MyPanasas are trademarks or registered trademarks of
Panasas, Inc. in the United States and other countries. All oth-
er trademarks are the property of their respective owners.
Information supplied by Panasas, Inc. is believed to be accu-
rate and reliable at the time of publication, but Panasas, Inc.
assumes no responsibility for any errors that may appear in
this document. Panasas, Inc. reserves the right, without no-
tice, to make changes in product design, specifications and
prices. Information is subject to change without notice.
Parallel File Systems
Panasas HPC Storage
167
which prevents rebuild of the sector lost from the first drive and
potentially the entire stripe or LUN, depending on the design of
the RAID controller. With block-based RAID, it is difficult or im-
possible to directly map any lost sectors back to higher-level file
system data structures, so a full file system check and media
scan will be required to locate and repair the damage. A more
typical response is to fail the rebuild entirely. RAID control-
lers monitor drives in an effort to scrub out media defects and
avoid this bad scenario, and the Panasas system does media
scrubbing, too. However, with high capacity SATA drives, the
chance of encountering a media defect on drive B while rebuild-
ing drive A is still significant. With per-file RAID-5, this sort of
double failure means that only a single file is lost, and the spe-
cific file can be easily identified and reported to the administra-
tor. While block-based RAID systems have been compelled to
introduce RAID-6 (i.e., fault tolerant schemes that handle two
failures), the Panasas solution is able to deploy highly reliable
RAID-5 systems with large, high performance storage pools.
RAID Rebuild Performance
RAID rebuild performance determines how quickly the system
can recover data when a storage node is lost. Short rebuild
times reduce the window in which a second failure can cause
data loss. There are three techniques to reduce rebuild times: re-
ducing the size of the RAID parity group, declustering the place-
ment of parity group elements, and rebuilding files in parallel
using multiple RAID engines.
bCf
hJK
CDe
Gim
ADe
him
ACD
GJK
bCf
Gim
Aef
hJK
bDe
Gim
Abf
hjK
Declustered parity groups
The rebuild bandwidth is the rate at which reconstructed data
is written to the system when a storage node is being recon-
structed. The system must read N times as much as it writes,
depending on the width of the RAID parity group, so the overall
throughput of the storage system is several times higher than
the rebuild rate. A narrower RAID parity group requires fewer
read and XOR operations to rebuild, so will result in a higher re-
build bandwidth.
However, it also results in higher capacity overhead for parity
data, and can limit bandwidth during normal I/O. Thus, selection
of the RAID parity group size is a trade-off between capacity
overhead, on-line performance, and rebuild performance. Un-
derstanding declustering is easier with a picture. In the figure
on the left, each parity group has 4 elements, which are indicat-
ed by letters placed in each storage device. They are distributed
among 8 storage devices. The ratio between the parity group
size and the available storage devices is the declustering ratio,
which in this example is ½. In the picture, capital letters repre-
sent those parity groups that all share the second storage node.
If the second storage device were to fail, the system would have
to read the surviving members of its parity groups to rebuild
the lost elements. You can see that the other elements of those
parity groups occupy about ½ of each other storage device. For
this simple example you can assume each parity element is the
same size so all the devices are filled equally. In a real system,
the component objects will have various sizes depending on the
overall file size, although each member of a parity group will be
very close in size. There will be thousands or millions of objects
on each device, and the Panasas system uses active balancing
to move component objects between storage nodes to level ca-
pacity. Declustering means that rebuild requires reading a sub-
set of each device, with the proportion being approximately the
same as the declustering ratio. The total amount of data read is
the same with and without declustering, but with declustering
it is spread out over more devices. When writing the reconstruct-
ed elements, two elements of the same parity group cannot be
168
located on the same storage node. Declustering leaves many
storage devices available for the reconstructed parity element,
and randomizing the placement of each file’s parity group lets
the system spread out the write I/O over all the storage. Thus
declustering RAID parity groups has the important property of
taking a fixed amount of rebuild I/O and spreading it out over
more storage devices.
Having per-file RAID allows the Panasas system to divide the
work among the available DirectorBlade modules by assigning
different files to different DirectorBlade modules. This division
is dynamic with a simple master/worker model in which meta-
data services make themselves available as workers, and each
metadata service acts as the master for the volumes it imple-
ments. By doing rebuilds in parallel on all DirectorBlade mod-
ules, the system can apply more XOR throughput and utilize the
additional I/O bandwidth obtained with declustering.
Metadata ManagementThere are several kinds of metadata in the Panasas system.
These include the mapping from object IDs to sets of block ad-
dresses, mapping files to sets of objects, file system attributes
such as ACLs and owners, file system namespace information
(i.e., directories), and configuration/management information
about the storage cluster itself.
Block-level Metadata
Block-level metadata is managed internally by OSDFS, the file
system that is optimized to store objects. OSDFS uses a floating
block allocation scheme where data, block pointers, and object
descriptors are batched into large write operations. The write
buffer is protected by the integrated UPS, and it is flushed to
disk on power failure or system panics. Fragmentation was an
issue in early versions of OSDFS that used a first-fit block allo-
cator, but this has been significantly mitigated in later versions
that use a modified best-fit allocator.
Parallel File Systems
Panasas HPC Storage
169
OSDFS stores higher level file system data structures, such as
the partition and object tables, in a modified BTree data struc-
ture. Block mapping for each object uses a traditional direct/
indirect/double-indirect scheme. Free blocks are tracked by
a proprietary bitmap-like data structure that is optimized for
copy-on-write reference counting, part of OSDFS’s integrated
support for object- and partition-level copy-on-write snapshots.
Block-level metadata management consumes most of the cycles
in file system implementations. By delegating storage manage-
ment to OSDFS, the Panasas metadata managers have an or-
der of magnitude less work to do than the equivalent SAN file
system metadata manager that must track all the blocks in the
system.
File-level Metadata
Above the block layer is the metadata about files. This includes
user-visible information such as the owner, size, and modification
time, as well as internal information that identifies which ob-
jects store the file and how the data is striped across those
objects (i.e., the file’s storage map). Our system stores this file
metadata in object attributes on two of the N objects used to
store the file’s data. The rest of the objects have basic attrib-
utes like their individual length and modify times, but the high-
er-level file system attributes are only stored on the two attrib-
ute-storing components.
File names are implemented in directories similar to traditional
UNIX file systems. Directories are special files that store an array
of directory entries. A directory entry identifies a file with a tuple
of <serviceID, partitionID, objectID>, and also includes two <os-
dID> fields that are hints about the location of the attribute stor-
ing components. The partitionID/objectID is the two-level ob-
ject numbering scheme of the OSD interface, and Panasas uses
a partition for each volume. Directories are mirrored (RAID-1)
in two objects so that the small write operations associated
with directory updates are efficient.
Clients are allowed to read, cache and parse directories, or they
can use a Lookup RPC to the metadata manager to translate
a name to an <serviceID, partitionID, objectID> tuple and the
<osdID> location hints. The serviceID provides a hint about the
metadata manager for the file, although clients may be redirect-
ed to the metadata manager that currently controls the file.
The osdID hint can become out-of-date if reconstruction or
active balancing moves an object. If both osdID hints fail, the
metadata manager has to multicast a GetAttributes to the stor-
age nodes in the BladeSet to locate an object. The partitionID
and objectID are the same on every storage node that stores a
component of the file, so this technique will always work. Once
the file is located, the metadata manager automatically updates
the stored hints in the directory, allowing future accesses to
bypass this step.
OSDs
1. CREATE
7. WRITE3. CREATE
MetadataServer
6. REPLY
Client
Txn_log
2
8
4
5
oplog
caplog
Replycache
Creating a file
170
File operations may require several object operations. The figure
on the left shows the steps used in creating a file. The metadata
manager keeps a local journal to record in-progress actions so it
can recover from object failures and metadata manager crashes
that occur when updating multiple objects. For example, creat-
ing a file is fairly complex task that requires updating the par-
ent directory as well as creating the new file. There are 2 Create
OSD operations to create the first two components of the file,
and 2 Write OSD operations, one to each replica of the parent
directory. As a performance optimization, the metadata serv-
er also grants the client read and write access to the file and
returns the appropriate capabilities to the client as part of the
FileCreate results. The server makes record of these write
capabilities to support error recovery if the client crashes while
writing the file. Note that the directory update (step 7) occurs
after the reply, so that many directory updates can be batched
together. The deferred update is protected by the op-log
record that gets deleted in step 8 after the successful directory
update.
The metadata manager maintains an op-log that records the
object create and the directory updates that are in progress.
This log entry is removed when the operation is complete. If the
metadata service crashes and restarts, or a failure event moves
the metadata service to a different manager node, then the op-
log is processed to determine what operations were active at
the time of the failure. The metadata manager rolls the opera-
tions forward or backward to ensure the object store is consist-
ent. If no reply to the operation has been generated, then the op-
eration is rolled back. If a reply has been generated but pending
operations are outstanding (e.g., directory updates), then the
operation is rolled forward.
The write capability is stored in a cap-log so that when a meta-
data server starts it knows which of its files are busy. In addition
“Outstanding in the HPC world, the ActiveStor solutions provided
by Panasas are undoubtedly the only HPC storage solutions that
combine highest scalability and performance with a convincing
ease of management.“
Christoph Budziszewski | HPC Solution Architect
Parallel File Systems
Panasas HPC Storage
171
to the “piggybacked” write capability returned by FileCreate,
the client can also execute a StartWrite RPC to obtain a separate
write capability. The cap-log entry is removed when the client
releases the write cap via an EndWrite RPC. If the client reports
an error during its I/O, then a repair log entry is made and the file
is scheduled for repair. Read and write capabilities are cached
by the client over multiple system calls, further reducing meta-
data server traffic.
System-level Metadata
The final layer of metadata is information about the overall sys-
tem itself. One possibility would be to store this information in
objects and bootstrap the system through a discovery protocol.
The most difficult aspect of that approach is reasoning about
the fault model. The system must be able to come up and be
manageable while it is only partially functional. Panasas chose
instead a model with a small replicated set of system managers,
each that stores a replica of the system configuration metadata.
Each system manager maintains a local database, outside of
the object storage system. Berkeley DB is used to store tables
that represent our system model. The different system manager
instances are members of a replication set that use Lamport’s
part-time parliament (PTP) protocol to make decisions and up-
date the configuration information. Clusters are configured with
one, three, or five system managers so that the voting quorum
has an odd number and a network partition will cause a minori-
ty of system managers to disable themselves.
System configuration state includes both static state, such
as the identity of the blades in the system, as well as dynam-
ic state such as the online/offline state of various services and
error conditions associated with different system components.
Each state update decision, whether it is updating the admin
password or activating a service, involves a voting round and an
update round according to the PTP protocol. Database updates
are performed within the PTP transactions to keep the data-
bases synchronized. Finally, the system keeps backup copies of
the system configuration databases on several other blades to
guard against catastrophic loss of every system manager blade.
Blade configuration is pulled from the system managers as part
of each blade’s startup sequence. The initial DHCP handshake
conveys the addresses of the system managers, and thereafter
the local OS on each blade pulls configuration information from
the system managers via RPC.
The cluster manager implementation has two layers. The lower
level PTP layer manages the voting rounds and ensures that par-
titioned or newly added system managers will be brought up-to-
date with the quorum. The application layer above that uses the
voting and update interface to make decisions. Complex system
operations may involve several steps, and the system manager
has to keep track of its progress so it can tolerate a crash and
roll back or roll forward as appropriate.
For example, creating a volume (i.e., a quota-tree) involves file
system operations to create a top-level directory, object oper-
ations to create an object partition within OSDFS on each Stor-
ageBlade module, service operations to activate the appropri-
ate metadata manager, and configuration database operations
to reflect the addition of the volume.
Recovery is enabled by having two PTP transactions. The initial
PTP transaction determines if the volume should be created, and
it creates a record about the volume that is marked as incom-
plete. Then the system manager does all the necessary service
activations, file and storage operations. When these all com-
plete, a final PTP transaction is performed to commit the oper-
ation. If the system manager crashes before the final PTP trans-
action, it will detect the incomplete operation the next time it
restarts, and then roll the operation forward or backward.
172
BeeGFSBeeGFS combines multiple storage servers to provide a highly
scalable shared network file system with striped file contents.
This way, it allows users to overcome the tight performance lim-
itations of single servers, single network interconnects, a limit-
ed number of hard drives etc. In such a system, high throughput
demands of large numbers of clients can easily be satisfied, but
even a single client can profit from the aggregated performance
of all the storage servers in the system.
Key Aspects:
� Maximum Flexibility
� BeeGFS supports a wide range of Linux distributions as well
as Linux kernels
� Storage servers run on top of an existing local file system
using the POSIX interface
� Clients and servers can be added to an existing system with-
out downtime
� BeeGFS supports multiple networks and dynamic failover
between different networks.
� BeeGFS provides support for metadata and file contents
mirroring
This is made possible by a separation of metadata and file con-
tents. While storage servers are responsible for storing stripes of
the actual contents of user files, metadata servers do the coordi-
nation of file placement and striping among the storage servers
and inform the clients about certain file details when necessary.
When accessing file contents, BeeGFS clients directly contact the
storage servers to perform file I/O and communicate with multiple
servers simultaneously, giving your applications truly parallel
access to the file data. To keep the metadata access latency (e.g.
directory lookups) at a minimum, BeeGFS allows you to also dis-
tribute the metadata across multiple servers, so that each of the
metadata servers stores a part of the global file system name-
space.
Parallel File Systems
BeeGFS
173
The following picture shows the system architecture and roles
within an BeeGFS instance. Note that although all roles in the pic-
ture are running on different hosts, it is also possible to run any
combination of client and servers on the same machine.
A very important differentiator for BeeGFS is the ease-of-use. The
philosophy behind BeeGFS is to make the hurdles as low as pos-
sible and allow the use of a parallel file system to as many people
as possible for their work. BeeGFS was designed to work with any
local file system (such as ext4 or xfs) for data storage – as long as
it is POSIX compliant. That way system administrators can chose
the local file system that they prefer and are experienced with
and therefore do not need to learn the use of new tools. This is
making it much easier to adopt the technology of parallel stor-
age – especially for sites with limited resources and manpower.
BeeGFS needs four main components to work
� the Management Server
� the Object Storage Server
� the Metadata Server
� the file system client
There are two supporting daemons
� The file system client needs a “helper-daemon” to run on the
client.
� The Admon daemon might run in the storage cluster and give
system administrators a better view about what is going on.
It is not a necessary component and a BeeGFS system is fully
operable without it.
Management Server (MS)The MS is the component of the file system making sure that all
other processes can find each other. It is the first daemon that
has to be set up – and all configuration files of a BeeGFS instal-
lation have to point to the same MS. There is always exactly one
MS in a system. The MS maintains a list of all file system compo-
nents – this includes clients, Metadata Servers, Metadata Targets,
Storage Servers and Storage Targets.
Metadata Server (MDS)The MDS contains the information about MetaData in the
system. BeeGFS implements a fully scalable architecture for
BeeGFS Architecture
174
MetaData and allows a practically unlimited number of MDS.
Each MDS has exactly one MetaDataTarget (MDT). Each directory
in the global file system is attached to exactly one MDS handling
its content. Since the assignment of directories to MDS is random
BeeGFS can make efficient use of a (very) large number of MDS.
As long as the number of directories is significantly larger than
the number of MDS (which is normally the case), there will be a
roughly equal share of load on each of the MDS.
The MDS is a fully multithreaded daemon – the number of
threads to start has to be set to appropriate values for the used
hardware. Factors to consider here are number of CPU cores in
the MDS, number of Client, workload, number and type of disks
forming the MDT. So in general serving MetaData needs some
CPU power – and especially, if one wants to make use of fast un-
derlying storage like SSDs, the CPUs should not be too light to
avoid becoming the bottleneck.
Object Storage Server (OSS)The OSS is the main service to store the file contents. Each OSS
might have one or many ObjectStorageTargets (OST) – where
an OST is a RAID-Set (or LUN) with a local file system (such as xfs
or ext4) on top. A typical OST is between 6 and 12 hard drives in
RAID6 – so an OSS with 36 drives might be organized with 3 OSTs
each being a RAID6 with 12 drives.
The OSS is a user space daemon that is started on the appro-
priate machine. It will work with any local file system that
is POSIX compliant and supported by the Linux distribution.
The underlying file system may be picked according to the work-
load – or personal preference and experience.
Like the MDS, the OSS is a fully multithreaded daemon as well –
and so as with the MDS, the choice of the number of threads has
to be made with the underlying hardware in mind. The most im-
portant factor for the number of OSS-threads is the performance
„transtec has successfully deployed the first petabyte filesystem
with BeeGFS. And it is very impressive to see how BeeGFS combines
performance with reliability.”
Hossein Rouhani | HPC Solution Engineer
Parallel File Systems
BeeGFS
175
and number of the OSTs that OSS is serving. In contradiction to
the MDS, the traffic on the OSTs should be (and normally is) to a
large extent sequential.
File System ClientBeeGFS comes with a native client that runs in Linux – it is a ker-
nel module that has to be compiled to match the used kernel. The
client is an open-source product made available under GPL. The
client has to be installed on all hosts that should access BeeGFS
with maximum performance.
The client needs two services to start:
fhgfs-helperd: This daemon provides certain auxiliary function-
alities for the fhgfs-client. The helperd does not require any
additional configuration – it is just accessed by the fhgfs-client
running on the same host. The helperd is mainly doing DNS for
the client kernel module and provides the capability to write the
logfile.
fhgfs-client: This service loads the client kernel module – if nec-
essary it will (re-)compile the kernel module. The recompile is
done using an automatic build process that is started when (for
example) the kernel version changes. This helps to make the file
system easy to use since, after applying a (minor or major) kernel
update, the BeeGFS client will still start up after a reboot, re-com-
pile and the file system will be available.
Maximum Scalability � BeeGFS supports distributed file contents with flexible
striping across the storage servers as well as distributed
metadata
� Best in class client throughput of 3.5 GB/s write and 4.0 GB/s
read with a single I/O stream on FDR Infiniband
� Best in class metadata performance with linear scalability
through dynamic metadata namespace partitioning
� Best in class storage throughput with flexible choice of
underlying file system to perfectly fit the storage hardware.
Maximum UsabilityBeeGFS was designed with easy installation and admin-
istration in mind. BeeGFS requires no kernel patches (the
client is a patchless kernel module, the server components
are userspace daemons), comes with graphical cluster in-
stallation tools and allows you to add more clients and serv-
ers to the running system whenever you want it. The graph-
ical administration and monitoring system enables user to
handle typical management tasks in a intuitive way: cluster
installation, storage service management, live throughput and
health monitoring, file browsing, striping configuration and
more.
Features:
� Data and metadata for each file are separated into stripes and
distributed across the existing servers to aggregate server
speed
� Secure authentication process between client and server to
protect sensitive data
� Fair I/O option on user level to prevent a single user with mul-
tiple requests to stall other users request
� Automatic network failover, i.e. if Infiniband is down, BeeGFS
automatically uses Ethernet if available
� Optional server failover for external shared memory
resources - high availability
� Runs on non-standard architectures, e.g. PowerPC or Intel
Xeon Phi
� Provides a coherent mode, in which it is guaranteed that
changes to a file or directory by one client are always immedi-
ately visible to other clients.
BOX
176
transtec has an outstanding history of parallel filesystem im-
plementations since more than 12 years. Starting with some of
the first Lustre implementations, some of them extending to
nearly 1 Petabyte gross capacity, which constituted an enor-
mous size back 10 years ago, over to the deployment of many
enterprise-ready commercial solutions, we have now success-
fully deployed dozens of BeeGFS environment of various sizes.
In especial, transtec has been the first provider of a 1-peta-
byte-filesystem based on BeeGFS, then known as FraunhoferFS,
or FhGFS.
No matter what the customer’s requirements are: transtec has
knowledge in all parallel filesystem technologies and is able to
provide the customer with an individually sized and tailor-made
solution that meets their demands concerning capacity and per-
formance, and above all – because this is what customers take
as a matter of course – is stable and reliable at the same time.
Parallel File Systems
General Parallel File System (GPFS)
177
General Parallel FilesystemThe IBM General Parallel File System (GPFS) has always been con-
sidered a pioneer of big data storage and continues today to lead
in introducing industry leading storage technologies.
Since 1998 GPFS has lead the industry with many technologies
that make the storage of large quantities of file data possible.
The latest version continues in that tradition, GPFS 3.5 represents
a significant milestone in the evolution of big data management.
GPFS 3.5 introduces some revolutionary new features that clear-
ly demonstrate IBM’s commitment to providing industry leading
storage solutions.
What is GPFS?GPFS is more than clustered file system software; it is a full
featured set of file management tools. This includes advanced
storage virtualization, integrated high availability, automated
tiered storage management and the performance to effectively
manage very large quantities of file data.GPFS allows a group of
computers concurrent access to a common set of file data over
a common SAN infrastructure, a network or a mix of connection
types. The computers can run any mix of AIX, Linux or Windows
Server operating systems. GPFS provides storage management,
information life cycle management tools, centralized adminis-
tration and allows for shared access to file systems from remote
GPFS clusters providing a global namespace.
A GPFS cluster can be a single node, two nodes providing a high
availability platform supporting a database application, for ex-
ample, or thousands of nodes used for applications like the mod-
eling of weather patterns. The largest existing configurations ex-
ceed 5,000 nodes. GPFS has been available on since 1998 and has
been field proven for more than 14 years on some of the world’s
most powerful supercomputers2 to provide reliability and effi-
cient use of infrastructure bandwidth.
GPFS was designed from the beginning to support high perfor-
mance parallel workloads and has since been proven very effec-
tive for a variety of applications. Today it is installed in clusters
supporting big data analytics, gene sequencing, digital media
and scalable file serving. These applications are used across many
industries including financial, retail, digital media, biotechnolo-
gy, science and government. GPFS continues to push technology
limits by being deployed in very demanding large environments.
You may not need multiple petabytes of data today, but you will,
and when you get there you can rest assured GPFS has already
been tested in these enviroments. This leadership is what makes
GPFS a solid solution for any size application. Supported operat-
ing systems for GPFS Version 3.5 include AIX, Red Hat, SUSE and
Debian Linux distributions and Windows Server 2008.
The file systemA GPFS file system is built from a collection of arrays that contain
the file system data and metadata. A file system can be built from
a single disk or contain thousands of disks storing petabytes of
data. Each file system can be accessible from all nodes within the
cluster. There is no practical limit on the size of a file system. The
architectural limit is 299 bytes. As an example, current GPFS cus-
tomers are using single file systems up to 5.4PB in size and others
have file systems containing billions of files.
Application interfacesApplications access files through standard POSIX file system in-
terfaces. Since all nodes see all of the file data applications can
scale-out easily. Any node in the cluster can concurrently read or
update a common set of files. GPFS maintains the coherency and
consistency of the file system using sophisticated byte range
locking, token (distributed lock) management and journaling.
This means that applications using standard POSIX locking se-
mantics do not need to be modified to run successfully on a GPFS
file system.
178
In addition to standard interfaces GPFS provides a unique set
of extended interfaces which can be used to provide advanced
application functionality. Using these extended interfaces an
application can determine the storage pool placement of a file,
create a file clone and manage quotas. These extended interfaces
provide features in addition to the standard POSIX interface.
Performance and scalabilityGPFS provides unparalleled performance for unstructured data.
GPFS achieves high performance I/O by:
Striping data across multiple disks attached to multiple nodes.
� High performance metadata (inode) scans.
� Supporting a wide range of file system block sizes to match
I/O requirements.
� Utilizing advanced algorithms to improve read-ahead and
write-behind IO operations.
� Using block level locking based on a very sophisticated scalable
token management system to provide data consistency while al-
lowing multiple application nodes concurrent access to the files.
When creating a GPFS file system you provide a list of raw devices
and they are assigned to GPFS as Network Shared Disks (NSD). Once
a NSD is defined all of the nodes in the GPFS cluster can access the
disk, using local disk connection, or using the GPFS NSD network
protocol for shipping data over a TCP/IP or InfiniBand connection.
GPFS token (distributed lock) management coordinates access to
NSD’s ensuring the consistency of file system data and metadata
when different nodes access the same file. Token management
responsibility is dynamically allocated among designated nodes
in the cluster. GPFS can assign one or more nodes to act as token
managers for a single file system. This allows greater scalabil-
ity when you have a large number of files with high transaction
workloads. In the event of a node failure the token management
responsibility is moved to another node.
Parallel File Systems
General Parallel File System (GPFS)
179
All data stored in a GPFS file system is striped across all of the disks
within a storage pool, whether the pool contains 2 LUNS or 2,000
LUNS. This wide data striping allows you to get the best performance
for the available storage. When disks are added to or removed from
a storage pool existing file data can be redistributed across the new
storage to improve performance. Data redistribution can be done
automatically or can be scheduled. When redistributing data you
can assign a single node to perform the task to control the impact
on a production workload or have all of the nodes in the cluster par-
ticipate in data movement to complete the operation as quickly as
possible. Online storage configuration is a good example of an en-
terprise class storage management feature included in GPFS.
To achieve the highest possible data access performance GPFS
recognizes typical access patterns including sequential, reverse
sequential and random optimizing I/O access for these patterns.
Along with distributed token management, GPFS provides scalable
metadata management by allowing all nodes of the cluster access-
ing the file system to perform file metadata operations. This feature
distinguishes GPFS from other cluster file systems which typically
have a centralized metadata server handling fixed regions of the
file namespace. A centralized metadata server can often become a
performance bottleneck for metadata intensive operations, limiting
scalability and possibly introducing a single point of failure. GPFS
solves this problem by enabling all nodes to manage metadata.
AdministrationGPFS provides an administration model that is easy to use and
is consistent with standard file system administration practices
while providing extensions for the clustering aspects of GPFS. These
functions support cluster management and other standard file sys-
tem administration functions such as user quotas, snapshots and
extended access control lists. GPFS administration tools simpli-
fy cluster-wide tasks. A single GPFS command can perform a file
system function across the entire cluster and most can be issued
from any node in the cluster. Optionally you can designate a group
of administration nodes that can be used to perform all cluster
administration tasks, or only authorize a single login session to per-
form admin commands cluster-wide. This allows for higher security
by reducing the scope of node to node administrative access.
Rolling upgrades allow you to upgrade individual nodes in the clus-
ter while the file system remains online. Rolling upgrades are sup-
ported between two major version levels of GPFS (and service levels
within those releases). For example you can mix GPFS 3.4 nodes with
GPFS 3.5 nodes while migrating between releases.
Quotas enable the administrator to manage file system usage by us-
ers and groups across the cluster. GPFS provides commands to gen-
erate quota reports by user, group and on a sub-tree of a file system
called a fileset. Quotas can be set on the number of files (inodes) and
the total size of the files. New in GPFS 3.5 you can now define user
and group per fileset quotas which allows for more options in quo-
ta configuration. In addition to traditional quota management, the
GPFS policy engine can be used query the file system metadata and
generate customized space usage reports. An SNMP interface al-
lows monitoring by network management applications. The SNMP
agent provides information on the GPFS cluster and generates traps
when events occur in the cluster. For example, an event is generated
when a file system is mounted or if a node fails. The SNMP agent
runs on Linux and AIX. You can monitor a heterogeneous cluster as
long as the agent runs on a Linux or AIX node.
You can customize the response to cluster events using GPFS
callbacks. A callback is an administrator defined script that is
executed when an event occurs, for example, when a file system is
un-mounted for or a file system is low on free space. Callbacks can
be used to create custom responses to GPFS events and integrate
these notifications into various cluster monitoring tools. GPFS
provides support for the Data Management API (DMAPI) interface
which is IBM’s implementation of the X/Open data storage man-
agement API. This DMAPI interface allows vendors of storage man-
agement applications such as IBM Tivoli Storage Manager (TSM)
and High Performance Storage System (HPSS) to provide Hierarchi-
cal Storage Management (HSM) support for GPFS.
180
GPFS supports POSIX and NFSv4 access control lists (ACLs). NFSv4
ACLs can be used to serve files using NFSv4, but can also be used in
other deployments, for example, to provide ACL support to nodes
running Windows. To provide concurrent access from multiple
operating system types GPFS allows you to run mixed POSIX and
NFS v4 permissions in a single file system and map user and group
IDs between Windows and Linux/UNIX environments. File systems
may be exported to clients outside the cluster through NFS. GPFS is
often used as the base for a scalable NFS file service infrastructure.
The GPFS clustered NFS (cNFS) feature provides data availability to
NFS clients by providing NFS service continuation if an NFS server
fails. This allows a GPFS cluster to provide scalable file service
by providing simultaneous access to a common set of data from
multiple nodes. The clustered NFS tools include monitoring of file
services and IP address fail over. GPFS cNFS supports NFSv3 only.
You can export a GPFS file system using NFSv4 but not with cNFS.
Data availabilityGPFS is fault tolerant and can be configured for continued access
to data even if cluster nodes or storage systems fail. This is ac-
complished though robust clustering features and support for
synchronous and asynchronous data replication.
GPFS software includes the infrastructure to handle data con-
sistency and availability. This means that GPFS does not rely on
external applications for cluster operations like node failover.
The clustering support goes beyond who owns the data or who
has access to the disks. In a GPFS cluster all nodes see all of the
data and all cluster operations can be done by any node in the
cluster with a server license. All nodes are capable of perform-
ing all tasks. What tasks a node can perform is determined by the
type of license and the cluster configuration.
As a part of the built-in availability tools GPFS continuously mon-
itors the health of the file system components. When failures
are detected appropriate recovery action is taken automatically.
Parallel File Systems
General Parallel File System (GPFS)
181
Extensive journaling and recovery capabilities are provided
which maintain metadata consistency when a node holding
locks or performing administrative services fails.
Snapshots can be used to protect the file system’s contents
against a user error by preserving a point in time version of the
file system or a sub-tree of a file system called a fileset. GPFS im-
plements a space efficient snapshot mechanism that generates a
map of the file system or fileset at the time the snaphot is taken.
New data blocks are consumed only when the file system data
has been deleted or modified after the snapshot was created.
This is done using a redirect-on-write technique (sometimes
called copy-on-write). Snapshot data is placed in existing storage
pools simplifying administration and optimizing the use of ex-
isting storage. The snapshot function can be used with a backup
program, for example, to run while the file system is in use and
still obtain a consistent copy of the file system as it was when the
snapshot was created. In addition, snapshots provide an online
backup capability that allows files to be recovered easily from
common problems such as accidental file deletion.
Data ReplicationFor an additional level of data availability and protection synchro-
nous data replication is available for file system metadata and
data. GPFS provides a very flexible replication model that allows
you to replicate a file, set of files, or an entire file system. The rep-
lication status of a file can be changed using a command or by us-
ing the policy based management tools. Synchronous replication
allows for continuous operation even if a path to an array, an array
itself or an entire site fails.
Synchronous replication is location aware which allows you to
optimize data access when the replicas are separated across a
WAN. GPFS has knowledge of what copy of the data is “local” so
read-heavy applications can get local data read performance even
when data replicated over a WAN. Synchronous replication works
well for many workloads by replicating data across storage arrays
within a data center, within a campus or across geographical dis-
tances using high quality wide area network connections.
When wide area network connections are not high performance
or are not reliable, an asynchronous approach to data replication
is required. GPFS 3.5 introduces a feature called Active File Man-
agement (AFM). AFM is a distributed disk caching technology de-
veloped at IBM Research that allows the expansion of the GPFS
global namespace across geographical distances. It can be used to
provide high availability between sites or to provide local “copies”
of data distributed to one or more GPFS clusters. For more details
on AFM see the section entitled Sharing data between clusters.
For a higher level of cluster reliability GPFS includes advanced
clustering features to maintain network connections. If a network
connection to a node fails GPFS automatically tries to reestablish
the connection before marking the node unavailable. This can pro-
vide for better uptime in environments communicating across a
WAN or experiencing network issues.
Using these features along with a high availability infrastructure
ensures a reliable enterprise class storage solution.
GPFS Native Raid (GNR)Larger disk drives and larger file systems are creating challeng-
es for traditional storage controllers. Current RAID 5 and RAID 6
based arrays do not address the challenges of Exabyte scale stor-
age performance, reliability and management. To address these
challenges GPFS Native RAID (GNR) brings storage device man-
agement into GPFS. With GNR GPFS can directly manage thou-
sands of storage devices. These storage devices can be individual
disk drives or any other block device eliminating the need for a
storage controller.
GNR employs a de-clustered approach to RAID. The de-clustered
architecture reduces the impact of drive failures by spreading
data over all of the available storage devices improving appli-
cation IO and recovery performance. GNR provides very high
reliability through an 8+3 Reed Solomon based raid code that
divides each block of a file into 8 parts and associated parity.
182
This algorithm scales easily starting with as few as 11 storage
devices and growing to over 500 per storage pod. Spreading the
data over many devices helps provide predicable storage perfor-
mance and fast recovery times measured in minutes rather than
hours in the case of a device failure.
In addition to performance improvements GNR provides
advanced checksum protection to ensure data integrity.
Checksum information is stored on disk and verified all the way
to the NSD client.
Information lifecycle management (ILM) toolsetGPFS can help you to achieve data lifecycle management efficien-
cies through policy-driven automation and tiered storage man-
agement. The use of storage pools, filesets and user-defined poli-
cies provide the ability to better match the cost of your storage to
the value of your data.
Storage pools are used to manage groups of disks within a file sys-
tem. Using storage pools you can create tiers of storage by group-
ing disks based on performance, locality or reliability characteris-
tics. For example, one pool could contain high performance solid
state disk (SSD) disks and another more economical 7,200 RPM disk
storage. These types of storage pools are called internal storage
pools. When data is placed in or moved between internal storage
pools all of the data management is done by GPFS. In addition to
internal storage pools GPFS supports external storage pools.
External storage pools are used to interact with an external stor-
age management application including IBM Tivoli Storage Man-
ager (TSM) and High Performance Storage System (HPSS). When
moving data to an external pool GPFS handles all of the metada-
ta processing then hands the data to the external application for
storage on alternate media, tape for example. When using TSM or
HPSS data can be retrieved from the external storage pool on de-
mand, as a result of an application opening a file or data can be
Parallel File Systems
General Parallel File System (GPFS)
183
retrieved in a batch operation using a command or GPFS policy.
A fileset is a sub-tree of the file system namespace and provides
a way to partition the namespace into smaller, more manageable
units.
Filesets provide an administrative boundary that can be used to
set quotas, take snapshots, define AFM relationships and be used
in user defined policies to control initial data placement or data
migration. Data within a single fileset can reside in one or more
storage pools. Where the file data resides and how it is managed
once it is created is based on a set of rules in a user defined policy.
There are two types of user defined policies in GPFS: file placement
and file management. File placement policies determine in which
storage pool file data is initially placed. File placement rules are
defined using attributes of a file known when a file is created such
as file name, fileset or the user who is creating the file. For example
a placement policy may be defined that states ‘place all files with
names that end in .mov onto the near-line SAS based storage pool
and place all files created by the CEO onto the SSD based storage
pool’ or ‘place all files in the fileset ‘development’ onto the SAS
based storage pool’.
Once files exist in a file system, file management policies can be
used for file migration, deletion, changing file replication status
or generating reports. You can use a migration policy to transpar-
ently move data from one storage pool to another without chang-
ing the file’s location in the directory structure. Similarly you can
use a policy to change the replication status of a file or set of files,
allowing fine grained control over the space used for data availa-
bility. You can use migration and replication policies together, for
example a policy that says: ‘migrate all of the files located in the
subdirectory /database/payroll which end in *.dat and are greater
than 1 MB in size to storage pool #2 and un-replicate these files’.
File deletion policies allow you to prune the file system, deleting
files as defined by policy rules. Reporting on the contents of a file
system can be done through list policies. List policies allow you
to quickly scan the file system metadata and produce information
listing selected attributes of candidate files. File management pol-
icies can be based on more attributes of a file than placement pol-
icies because once a file exists there is more known about the file.
For example file placement attributes can utilize attributes such
as last access time, size of the file or a mix of user and file size. This
may result in policies like: ‘Delete all files with a name ending in
.temp that have not been accessed in the last 30 days’, or ‘Migrate
all files owned by Sally that are larger than 4GB to the SATA stor-
age pool’.
Rule processing can be further automated by including attrib-
utes related to a storage pool instead of a file using the thresh-
old option. Using thresholds you can create a rule that moves
files out of the high performance pool if it is more than 80%
full, for example. The threshold option comes with the ability to
set high, low and pre-migrate thresholds. Pre-migrated files are
files that exist on disk and are migrated to tape. This method
is typically used to allow disk access to the data while allow-
ing disk space to be freed up quickly when a maximum space
threshold is reached. This means that GPFS begins migrating
data at the high threshold, until the low threshold is reached.
If a pre-migrate threshold is set GPFS begins copying data un-
til the pre-migrate threshold is reached. This allows the data to
continue to be accessed in the original pool until it is quickly
deleted to free up space the next time the high threshold is
reached. Thresholds allow you to fully utilize your highest per-
formance storage and automate the task of making room for
new high priority content.
Policy rule syntax is based on the SQL 92 syntax standard and
supports multiple complex statements in a single rule enabling
powerful policies. Multiple levels of rules can be applied to a
file system, and rules are evaluated in order for each file when
the policy engine executes allowing a high level of flexibility.
184
GPFS provides unique functionality through standard interfaces,
an example of this is extended attributes. Extended attributes
are a standard POSIX facility.
GPFS has long supported the use of extended attributes, though
in the past they were not commonly used, in part because of per-
formance concerns. In GPFS 3.4, a comprehensive redesign of the
extended attributes support infrastructure was implemented,
resulting in significant performance improvements. In GPFS 3.5,
extended attributes are accessible by the GPFS policy engine al-
lowing you to write rules that utilize your custom file attributes.
Executing file management operations requires the ability to
efficiently process the file metadata. GPFS includes a high per-
formance metadata scan interface that allows you to efficiently
process the metadata for billions of files. This makes the GPFS ILM
toolset a very scalable tool for automating file management. This
high performance metadata scan engine employs a scale-out ap-
proach. The identification of candidate files and data movement
operations can be performed concurrently by one or more nodes
in the cluster. GPFS can spread rule evaluation and data move-
ment responsibilities over multiple nodes in the cluster pro-
viding a very scalable, high performance rule processing engine.
Cluster configurationsGPFS supports a variety of cluster configurations independent of
which file system features you use. Cluster configuration options
can be characterized into three basic categories:
� Shared disk
� Network block I/O
� Synchronously sharing data between clusters.
� Asynchronously sharing data between clusters.
Shared disk
A shared disk cluster is the most basic environment. In this con-
figuration, the storage is directly attached to all machines in the
cluster as shown in Figure 1. The direct connection means that
Parallel File Systems
General Parallel File System (GPFS)
185
each shared block device is available concurrently to all of the
nodes in the GPFS cluster. Direct access means that the storage
is accessible using a SCSI or other block level protocol using a
SAN, Infiniband, iSCSI, Virtual IO interface or other block level IO
connection technology. Figure 1 illustrates a GPFS cluster where
all nodes are connected to a common fibre channel SAN. This
example shows a fibre channel SAN though the storage attach-
ment technology could be InfiniBand, SAS, FCoE or any other.
The nodes are connected to the storage using the SAN and to
each other using a LAN. Data used by applications running on
the GPFS nodes flows over the SAN and GPFS control informa-
tion flows among the GPFS instances in the cluster over the LAN.
This configuration is optimal when all nodes in the cluster need
the highest performance access to the data. For example, this is
a good configuration for providing etwork file service to client
systems using clustered NFS, high-speed data access for digital
media applications or a grid infrastructure for data analytics.
Network-based block IO
As data storage requirements increase and new storage and con-
nection technologies are released a single SAN may not be a suf-
ficient or appropriate choice of storage connection technology.
In environments where every node in the cluster is not attached
to a single SAN, GPFS makes use of an integrated network block
device capability. GPFS provides a block level interface over
TCP/IP networks called the Network Shared Disk (NSD) protocol.
Whether using the NSD protocol or a direct attachment to the
SAN the mounted file system looks the same to the application,
GPFS transparently handles I/O requests.
GPFS clusters can use the NSD protocol to provide high speed data
access to applications running on LAN-attached nodes. Data is
served to these client nodes from one or more NSD servers. In this
configuration, disks are attached only to the NSD servers. Each
NSD server is attached to all or a portion of the disk collection.
With GPFS you can define up to eight NSD servers per disk and
it is recommended that at least two NSD servers are defined for
each disk to avoid a single point of failure.
GPFS uses the NSD protocol over any TCP/IP capable network
fabric. On Linux GPFS can use the VERBS RDMA protocol on com-
patible fabrics (such as InfiniBand) to transfer data to NSD cli-
ents. The network fabric does not need to be dedicated to GPFS;
but should provide sufficient bandwidth to meet your GPFS
performance expectations and for applications which share the
bandwidth. GPFS has the ability to define a preferred network
subnet topology, for example designate separate IP subnets for
intra-cluster communication and the public network. This pro-
vides for a clearly defined separation of communication traffic
and allows you to increase the throughput and possibly the
number of nodes in a GPFS cluster. Allowing access to the same
disk from multiple subnets means that all of the NSD clients do
not have to be on a single physical network. For example you
can place groups of clients onto separate subnets that access
a common set of disks through different NSD servers so not all
NSD servers need to serve all clients. This can reduce the net-
working hardware costs and simplify the topology reducing
support costs, providing greater scalability and greater overall
performance.
Figure 1: SAN attached Storage
GPFS Nodes
SAN
186
An example of the NSD server model is shown in Figure 2. In this
configuration, a subset of the total node population is defined
as NSDserver nodes. The NSD Server is responsible for the ab-
straction of disk datablocks across a TCP/IP or Infiniband VERBS
(Linux only) based network. The factthat the disks are remote
is transparent to the application. Figure 2 shows an example of
a configuration where a set of compute nodes are connected
to a set of NSD servers using a high-speed interconnect or an
IP-based network such as Ethernet. In this example, data to the
NSD servers flows over the SAN and both data and control infor-
mation to the clients flow across the LAN. Since the NSD servers
are serving data blocks from one or more devices data access is
similar to a SAN attached environment in that data flows from
all servers simultaneously to each client. This parallel data ac-
cess provides the best possible throughput to all clients. In ad-
dition it provides the ability to scale up the throughput even to
a common data set or even a single file.
The choice of how many nodes to configure as NSD servers is
based on performance requirements, the network architecture
and the capabilities of the storage subsystems. High bandwidth
LAN connections should be used for clusters requiring signifi-
cant data transfer rates. This can include 1Gbit or 10 Gbit Ether-
net. For additional performance or reliability you can use link ag-
gregation (EtherChannel or bonding), networking technologies
like source based routing or higher performance networks such
as InfiniBand.
The choice between SAN attachment and network block I/O is a
performance and economic one. In general, using a SAN provides
the highest performance; but the cost and management com-
plexity of SANs for large clusters is often prohibitive. In these
cases network block I/O provides an option. Network block I/O
is well suited to grid computing and clusters with sufficient net-
work bandwidth between the NSD servers and the clients. For
example, an NSD protocol based grid is effective for web applica-
tions, supply chain management or modeling weather patterns.
SAN Storage
Parallel File Systems
General Parallel File System (GPFS)
187
Mixed ClustersThe last two sections discussed shared disk and network attached
GPFS cluster topologies. You can mix these storage attachment
methods within a GPFS cluster to better matching the IO require-
ments to the connection technology. A GPFS node always tries to
find the most efficient path to the storage. If a node detects a block
device path to the data it is used. If there is no block device path
then the network is used. This capability can be leveraged to pro-
vide additional availability. If a node is SAN attached to the storage
and there is an HBA failure, for example, GPFS can fail over to using
the network path to the disk. A mixed cluster topology can provide
direct storage access to non-NSD server nodes for high performance
operations including backups or data ingest.
Sharing data between clustersThere are two methods available to share data across GPFS clusters:
GPFS multi-cluster and a new feature called Active File Management
(AFM).
GPFS Multi-cluster allows you to utilize the native GPFS protocol to
share data across clusters. Using this feature you can allow other
clusters to access one or more of your file systems and you can
mount file systems that belong to other GPFS clusters for which
Figure 2: Network block IO
NSD Clients
I/O Servers
LAN
SAN
SAN Storage
you have been authorized. A multi-cluster environment allows the
administrator to permit access to specific file systems from another
GPFS cluster. This feature is intended to allow clusters toshare data
at higher performance levels than file sharing technologies like NFS
or CIFS. It is not intended to replace such file sharing technologies
which are optimized for desktop access or for access across unrelia-
ble network links. Multi-cluster capability is useful for sharing across
multiple clusters within a physical location or across locations. Clus-
ters are most often attached using a LAN, but in addition the cluster
connection could include a SAN. Figure 3 illustrates a multi-cluster
configuration with both LAN and mixed LAN and SAN connections.
In Figure 3, Cluster B and Cluster C need to access the data from
Cluster A. Cluster A owns the storage and manages the file system.
It may grant access to file systems which it manages to remote clus-
ters such as Cluster B and Cluster C. In this example, Cluster B and
Cluster C do not have any storage but that is not a requirement. They
could own file systems which may be accessible outside their clus-
ter. Commonly in the case where a cluster does not own storage, the
nodes are grouped into clusters for ease of management.
NSD Clients
LAN
SAN
NSD Servers
SAN Storage
Application Nodes
Figure 3: Mixed Cluster Atchitecture
188
When the remote clusters need access to the data, they mount the
file system by contacting the owning cluster and passing required
security checks. In Figure 3, Cluster B acesses the data through the
NSD protocol. Cluster C accesses data through an extension of the
SAN. For cluster C access to the data is similar to the nodes in the
host cluster. Control and administrative traffic flows through the IP
network shown in Figure 3 and data access is direct over the SAN.
Both types of configurations are possible and as in Figure 3 can be
mixed as required. Multi-cluster environments are well suited to
sharing data across clusters belonging to different organizations
for collaborative computing , grouping sets of clients for adminis-
trative purposes or implementing a global namespace across sep-
arate locations. A multi-cluster configuration allows you to connect
GPFS clusters within a data center, across campus or across reliable
WAN links. For sharing data between GPFS clusters across less re-
liable WAN links or in cases where you want a copy of the data in
multiple locations you can use a new feature introduced in GPFS 3.5
called Active File Management.
Active File Management (AFM) allows you to create associations be-
tween GPFS clusters. Now the location and flow of file data between
GPFS clusters can be automated. Relationships between GPFS clus-
ters using AFM are defined at the fileset level. A fileset in a file sys-
tem can be created as a “cache” that provides a view to a file system
in another GPFS cluster called the “home.” File data is moved into a
cache fileset on demand. When a file is read the in the cache fileset
LANSAN
Cluster A
Cluster C
Cluster B
LAN
Figure 4: Multi-cluster
Parallel File Systems
General Parallel File System (GPFS)
189
the file data is copied from the home into the cache fileset. Data
consistency and file movement into and out of the cache is man-
aged automatically by GPFS.
Cache filesets can be read-only or writeable. Cached data is locally
read or written. On read if the data is not in in the “cache” then
GPFS automatically creates a copy of the data. When data is written
into the cache the write operation completes locally then GPFS
asynchronously pushes the changes back to the home location.
You can define multiple cache filesets for each home data source.
The number of cache relationships for each home is limited only by
the bandwidth available at the home location. Placing a quota on a
cache fileset causes the data to be cleaned (evicted) out of the cache
automatically based on the space available. If you do not set a quota
a copy of the file data remains in the cache until manually evicted
or deleted. You can use AFM to create a global namespace within
a data center, across a campus or between data centers located
around the world. AFM is designed to enable efficient data transfers
over wide area network (WAN) connections. When file data is read
from home into cache that transfer can happen in parallel within a
node called a gateway or across multiple gateway nodes.
Using AFM you can now create a truly world-wide global namespace.
Figure 5 is an example of a global namespace built using AFM. In
this example each site owns 1/3 of the namespace, store3 “owns”
/data5 and /data6 for example. The other sites have cache filesets
that point to the home location for the data owned by the other
clusters. In this example store3 /data1-4 cache data owned by the
other two clusters. This provides the same namespace within each
GPFS cluster providing applications one path to a file across the en-
tire organization.
You can achieve this type of global namespace with multi-cluster or
AFM, which method you chose depends on the quality of the WAN
link and the application requirements.
Cache: /data1Local: /data3
Cache: /data5
Local: /data1Cache: /data3 Cache: /data5
Cache: /data1Cache: /data3Local: /data3
Store 1
Store 2
Store 3
Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6
Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6
Client access:/global/data1/global/data2/global/data3/global/data3/global/data4/global/data5/global/data6
Figure 5: Global Namespace using AFM
S
N
EW
SESW
NENW
tech BOX
190
What’s new in GPFS Version 3.5For those who are familiar with GPFS 3.4 this section provides a
list of what is new in GPFS Version 3.5.
Active File ManagementWhen GPFS was introduced in 1998 it represented a revolution
in file storage. For the first time a group of servers could share
high performance access to a common set of data over a SAN or
network. The ability to share high performance access to file data
across nodes was the introduction of the global namespace.
Later GPFS introduced the ability to share data across multiple
GPFS clusters. This multi-cluster capability enabled data sharing
between clusters allowing for better access to file data.
This further expanded the reach of the global namespace from
within a cluster to across clusters spanning a data center or a
country. There were still challenges to building a multi-cluster
global namespace. The big challenge is working with unreliable
and high latency network connections between the servers.
Active File Management (AFM) in GPFS addresses the WAN band-
width issues and enables GPFS to create a world-wide global name-
space. AFM ties the global namespace together asynchronous-
ly providing local read and write performance with automated
namespace management. It allows you to create associations be-
tween GPFS clusters and define the location and flow of file data.
High Performance Extended AttributesGPFS has long supported the use of extended attributes, though
in the past they were not commonly used, in part because of
Parallel File Systems
What’s new in GPFS Version 3.5
191
performance concerns. In GPFS 3.4, a comprehensive redesign of
the extended attributes support infrastructure was implement-
ed, resulting in significant performance improvements. In GPFS
3.5, extended attributes are accessible by the GPFS policy engine
allowing you to write rules that utilize your custom file attrib-
utes.
Now an application can use standard POSIX interfaces t to man-
age extended attributes and the GPFS policy engine can utilize
these attributes.
Independent FilesetsTo effectively manage a file system with billions of files requires
advanced file management technologies. GPFS 3.5 introduces
a new concept called the independent fileset. An independent
fileset has its own inode space. This means that an independent
fileset can be managed similar to a separate file system but still
allow you to realize the benefits of storage consolidation.
An example of an efficiency introduced with independent file-
sets is improved policy execution performance. If you use an in-
dependent fileset as a predicate in your policy, GPFS only needs
to scan the inode space represented by that fileset, so if you have
1 billion files in your file system and a fileset has an inode space
of 1 million files, the scan only has to look at 1 million inodes. This
instantly makes the policy scan much more efficient. Independ-
ent filesets enable other new fileset features in GPFS 3.5.
Fileset Level Snapshots
Snapshot granularity is now at the fileset level in addition to file
system level snapshots.
Fileset Level Quotas
User and group quotas can be set per fileset
File CloningFile clones are space efficient copies of a file where two instances
of a file share data they have in common and only changed
blocks require additional storage. File cloning is an efficient way
to create a copy of a file, without the overhead of copying all of
the data blocks.
IPv6 SupportIPv6 support in GPFS means that nodes can be defined using mul-
tiple addresses, both IPv4 and IPv6.
GPFS Native RAIDGPFS Native RAID (GNR) brings storage RAID management into
the GPFS NSD server. With GNR GPFS directly manages JBOD
based storage. This feature provides greater availability, flexibili-
ty and performance for a variety of application workloads.
GNR implements a Reed-Solomon based de-clustered RAID tech-
nology that can provide high availability and keep drive failures
from impacting performance by spreading the recovery tasks
over all of the disks. Unlike standard Network Shared Disk (NSD)
data access GNR is tightly integrated with the storage hardware.
For GPFS 3.5 GNR is available on the IBM Power 775 Supercom-
puter platform.
192
Parallel File Systems
Lustre
LustreMaking Parallel Storage Simpler and More ProductiveNot long ago storage for high performance computing (HPC)
meant complexity and challenging management. HPC was the
domain of government-sponsored research or data-intensive ap-
plications, like weather prediction or large-scale manufacturing
in the aerospace and automotive sectors.
Today, HPC has expanded beyond just national laboratories and
research institutes to become a key technology for enterprises
of all sizes as they seek to develop improved products or entirely
new industries. Getting the maximum performance from HPC and
data-intensive applications requires fast and scalable storage
software. Simply put, today’s HPC workloads require storage infra-
structure that scales endlessly and delivers unmatched I/O levels.
Intel Enterprise Edition for Lustre software
(Intel EE for Lustre software) unleashes the performance and
scalability of the Lustre parallel file system for HPC workloads,
including technical ‘big data’ applications common within to-
day’s enterprises. It allows end-users that need the benefits of
large–scale, high bandwidth storage to tap the power and scal-
ability of Lustre, with the simplified installation, configuration,
and management features provided by Intel Manager for Lustre
software, a management solution purpose-built by the Lustre
experts at Intel for the Lustre file system. Further, Intel EE for Lus-
tre software is backed by Intel, the recognized technical support
providers for Lustre, including 24/7 service level agreement (SLA)
coverage.
> Product Components
Intel Manager for Lustre
Intel Manager for Lustre software includes simple, but pow-
erful, management tools that provide a unified, consistent
view of Lustre storage systems and simplify the installation,
configuration, monitoring, and overall management of Lustre.
®
193
The manager consolidates all Lustre information in a central,
browser-accessible location for ease of use.
Integrated Apache Hadoop Adapter
When organizations operate both Lustre and Apache Hadoop
within a shared HPC infrastructure, there is a compelling use
case for using Lustre as the file system for Hadoop analytics, as
well as HPC storage.
Intel Enterprise Edition for Lustre includes an Intel-developed
adapter which allows users to run MapReduce applications di-
rectly on Lustre. This optimizes the performance of MapReduce
operations while delivering faster, more scalable, and easier to
manage storage.
> Benefits
Performance
Intel EE for Lustre software has been designed to enable ful-
ly parallel I/O throughput across thousands of clients, servers,
and storage devices. Metadata and data are stored on separate
servers to allow optimization of each system for the different
workloads they present. Improved metadata scalability using
Distributed Namespace (DNE) feature is now integrated in Intel
Manager for Lustre. Intel EE for Lustre can also scale down effi-
ciently to provide fast parallel storage for smaller organizations.
Capacity
The object-based storage architecture of Intel EE for Lustre soft-
ware can scale to tens of thousands of clients and petabytes of
data.
Affordability
Intel EE for Lustre software is based on the community release of
Lustre software, and is hardware, server, and network fabric neu-
tral. Enterprises can scale their storage deployments horizontal-
ly, yet continue to have simple-to-manage storage.
Maturity
Lustre has been in use in the world’s largest datacenters for over
a decade and hardened in the harshest big data environments.
Intel EE for Lustre software is rigorously tested, reliable, and
backed by Intel, the leading provider of technical support for
Lustre software. Intel EE for Lustre software delivers commer-
cial- ready Lustre in a package that can scale efficiently both up
and down to suit your business workloads, with built-in manage-
ability.
Key Features
Lustre powers 60% of the world’s top 100 fastest computers
(www.top500.org). Unleash the Lustre parallel file system as an
enterprise platform–offering higher throughput and helping to
prevent bottlenecks.
� Built on the community release of Lustre software
� Intel Manager for Lustre simplifies install and configuration
� Enormous storage capacity and I/O
� Open, documented interfaces for deep integration
� Throughput in excess of 1 terabyte per second
� Resilient, highly available storage
� Centralized, GUI-based administration for management
simplicity
� Integrated support for Hadoop MapReduce applications with
Lustre storage
� Rigorously tested, stable software proven across diverse
industries
� Flexible storage solution based on Intel-enhanced
community release software
� Global 24/7 technical support
> What’s New
Differentiated Storage Services (DSS)
Available onlywith Intel EE for Lustre software, this advanced
feature allows data “hints” to be passed through Lustre, enabling
cache mechanisms to prioritize data for optimal performance.
194
Intel Manager for Lustre Monitoring for OpenZFS
Beginning with this release, storage solutions based on Intel EE
for Lustre software can exploit the data resiliency and volume
management capabilities available with OpenZFS.
Integrated HPC Scheduler “Connector” for MapReduce
Building upon the unique Intel software “connector” that allows
Lustre to replace HDFS, the newest connector couples SLURM
with YARN, and augments the Hadoop scheduler allowing
MapReduce applications to be scheduled like any HPC job.
Support for New System Software
Intel EE for Lustre delivers new support for Red Hat Enterprise
Linux and CentOS 6.7 servers and clients, or for Red Hat Enter-
prise Linux and CentOS 7.1 clients only.
Increasing Metadata Performance
More easily scale out the metadata performance of Lustre by tak-
ing advantage of the new DNE feature using Intel Manager for
Lustre.
Parallel File Systems
Lustre
Hierarchical
Storage
Management
Policy-driven
tiered storage
Administrator’s UI
and DashboardCLI
REST API
Management and Monitoring Services
Connectors
Lustre File System Storage Plug-ln
Global Support and Services from Intel
195
Compare EditionsIntel Enterprise Edition
for Lustre Software
Intel Foundation Edition
for Lustre Software
Intel Cloud Edition
for Lustre Software
Delivered with powerfull storage solutions
from Intel Lustre solution resellerx x x
Easy-to-use web-based GUI to simplify
deployment, management and monitoring
of the Lustre file system
x
Uses Intel Manager
for Lustre Software
x
Uses AWS embedded
monitoring tools
Intel software connectors vital for MapReduce
and analytics with Lustre and HPCx
Stable Enterprise platform, built for reliability
and performance – with long term supportx x
Full distribution of the community release of Lustre x x
Intel training for reseller partners x x
Intel backed support for debugging and patches x x x
Developed and supported by Intel,
the Lustre expertsx x x
Intel Enterprise Edition for Lustre Software couples the perfor-
mance and scalability of the most widely used file system for
HPC, with advanced storage management tools for enterprises
needing scale-out storage to fuel their technical applications or
Hadoop MapReduce workloads.
Intel Foundation Edition for Lustre Software delivers a scalable
Lustre storage solution with trusted Intel support. It is ideal for
organizations who prefer to design and deploy their own com-
munity configurations, and want to augment in-house manage-
ment and support capabilities with Intel expertise.
Intel Cloud Edition for Lustre Software brings Lustre to a cloud
environment via Amazon Web Services allowing data scientists
and developers instant access to parallel clustered storage.
Intel Omni-Path Interconnect
Intel® Omni-Path Architecture (Intel® OPA), an element of Intel® Scalable System Frame-
work, delivers the performance for tomorrow’s high performance computing (HPC)
workloads and the ability to scale to tens of thousands of nodes – and eventually more
– at a price competitive with today’s fabrics. The Intel OPA 100 Series product line is an
end-to-end solution of PCIe* adapters, silicon, switches, cables, and management soft-
ware. As the successor to Intel® True Scale Fabric, this optimized HPC fabric is built upon
a combination of enhanced IP and Intel® technology.
For software applications, Intel OPA will maintain consistency and compatibility with
existing Intel True Scale Fabric and InfiniBand* APIs by working through the open source
OpenFabrics Alliance (OFA) software stack on leading Linux* distribution releases. Intel
True Scale Fabric customers will be able to migrate to Intel OPA through an upgrade
program.
High Throuhgput Computing | CAD | Big Data Analytics | Simulation | Aerospace | Automotive
198
Intel Omni-Path Interconnect
Achitecture
AchitectureThe Future of High Performance FabricsCurrent standards-based high performance fabrics, such as In-
finiBand, were not originally designed for HPC, resulting in per-
formance and scaling weaknesses that are currently impeding
the path to Exascale computing. Intel Omni-Path Architecture
is being designed specifically to address these issues and scale
cost-effectively from entry level HPC clusters to larger clusters
with 10,000 nodes or more. To improve on the InfiniBand specifi-
cation and design, Intel is using the industry’s best technologies
including those acquired from QLogic and Cray alongside Intel®
technologies.
While both Intel OPA and InfiniBand Enhanced Data Rate (EDR)
will run at 100Gbps, there are many differences. The enhance-
ments of Intel OPA will help enable the progression towards Ex-
ascale while cost-effectively supporting clusters of all sizes with
optimization for HPC applications at both the host and fabric
levels for benefits that are not possible with the standard Infini-
Band-based designs.
Intel OPA is designed to provide the:
� Features and functionality at both the host and fabric levels
to greatly raise levels of scaling
� CPU and fabric integration necessary for the increased
computing density, improved reliability, reduced power, and
lower costs required by significantly larger HPC deployments
� Fabric tools to readily install, verify, and manage fabrics at
this level of complexity
Intel Omni-Path Key Fabric Features and InnovationsAdaptive Routing
Adaptive Routing monitors the performance of the possible
paths between fabric end-points and selects the least congest-
ed path to rebalance the packet load. While other technologies
also support routing, the implementation is vital. Intel’s imple-
mentation is based on cooperation between the Fabric Manager
199
and the switch ASICs. The Fabric Manager – with a global view of
the topology – initializes the switch ASICs with several egress op-
tions per destination, updating these options as the fundamen-
tal fabric changes when links are added or removed. Once the
switch egress options are set, the Fabric Manager monitors the
fabric state, and the switch ASICs dynamically monitor and react
to the congestion sensed on individual links. This approach ena-
bles Adaptive Routing to scale as fabrics grow larger and more
complex.
Dispersive Routing
One of the critical roles of fabric management is the initialization
and configuration of routes through the fabric between pairs of
nodes. Intel Omni-Path Fabric supports a variety of routing meth-
ods, including defining alternate routes that disperse traffic
flows for redundancy, performance, and load balancing. Instead
of sending all packets from a source to a destination via a single
path, Dispersive Routing distributes traffic across multiple paths.
Once received, packets are reassembled in their proper order for
rapid, efficient processing. By leveraging more of the fabric to de-
liver maximum communications performance for all jobs, Disper-
sive Routing promotes optimal fabric efficiency.
Traffic Flow Optimization
Traffic Flow Optimization optimizes the quality of service be-
yond selecting the priority – based on virtual lane or service level
– of messages to be sent on an egress port. At the Intel Omni-Path
Architecture link level, variable length packets are broken up into
fixed-sized containers that are in turn packaged into fixed-sized
Link Transfer Packets (LTPs) for transmitting over the link. Since
packets are broken up into smaller containers, a higher priority
container can request a pause and be inserted into the ISL data
stream before completing the previous data.
The key benefit is that Traffic Flow Optimization reduces the vari-
ation in latency seen through the network by high priority traffic
in the presence of lower priority traffic. It addresses a traditional
weakness of both Ethernet and InfiniBand in which a packet
must be transmitted to completion once the link starts even if
higher priority packets become available.
Packet Integrity Protection
Packet Integrity Protection allows for rapid and transparent re-
covery of transmission errors between a sender and a receiver
on an Intel Omni-Path Architecture link. Given the very high In-
tel OPA signaling rate (25.78125G per lane) and the goal of sup-
porting large scale systems of a hundred thousand or more links,
transient bit errors must be tolerated while ensuring that the
performance impact is insignificant. Packet Integrity Protec-
tion enables recovery of transient errors whether it is between
a host and switch or between switches. This eliminates the need
for transport level timeouts and end-to-end retries. This is done
without the heavy latency penalty associated with alternate er-
ror recovery approaches.
Dynamic Lane Scaling
Dynamic Lane Scaling allows an operation to continue even if
one or more lanes of a 4x link fail, saving the need to restart or go
to a previous checkpoint to keep the application running. The job
can then run to completion before taking action to resolve the is-
sue. Currently, InfiniBand typically drops the whole 4x link if any
of its lanes drops, costing time and productivity.
Ask How Intel Omni-Path Architecture Can Meet Your
HPC Needs
Intel is clearing the path to Exoscale computing and addressing
tomorrow’s HPC issues. Contact your Intel representative or any
authorized Intel True Scale Fabric provider to discuss how Intel
Omni-Path Architecture can improve the performance of your
future HPC workloads.
200
Intel Omni-Path Interconnect
Host Fabric Interface (HFI)
Host Fabric Interface (HFI)Designed specifically for HPC, the Intel Omni-Path Host Fabric
Interface (Intel OP HFI) uses an advanced connectionless design
that delivers performance that scales with high node and core
counts, making it the ideal choice for the most demanding ap-
plication environments. Intel OP HFI supports 100 Gbps per port,
which means each Intel OP HFI port can deliver up to 25 GBps
per port of bidirectional bandwidth. The same ASIC utilized in the
Intel OP HFI will also be integrated into future Intel Xeon proces-
sors and used in third-party products.
This device has not been authorized as required by the rules of
the Federal Communications Commission. This device is not, and
may not be, offered for sale or lease, or sold or leased, until autho-
rization is obtained.
HighlightsEach HFI supports:
� Multi-core scaling – support for up to 160 contexts
� 16 Send DMA engines (M2IO usage)
� Efficiency – large MTU support (4 KB, 8 KB, and 10KB) for
reduced per-packet processing overheads. Improved packet-
level interfaces to improve utilization of on-chip resources.
� Receive DMA engine arrival notification
� Each HFI can map ~128 GB window at 64 byte granularity
� Up to 8 virtual lanes for differentiated QoS
� ASIC designed to scale up to 160M messages/second and
300M bidirectional messages/second
Intel Omni-Path Host Fabric Interface (HFI) Optimizations
Much of the improved HPC application performance and low
end-to-end latency at scale comes from the following enhance-
ments:
201
Enhanced Performance Scaled Messaging (PSM).
The application view of the fabric is derived heavily from – and
application-level software compatible with – the demonstrated
scalability of Intel True Scale Fabric architecture by leveraging
an enhanced next generation version of the Performance Scaled
Messaging (PSM) library. Major deployments by the US Depart-
ment of Energy and other have proven this scalability advantage.
PSM is specifically designed for the Message Passing Interface
(MPI) and is very lightweight – one-tenth of the user space code
– compared to using verbs. This leads to extremely high MPI and
Partitioned Global Address Space (PGAS) message rates (short
message efficiency) compared to using InfiniBand verbs.
Intel Omni-Path Host Fabric Interface Intel Foundation Edition for Lustre* Software
Adapter TypeeredLow Profile PCIe Card
PCIe x16
Low Profile PCIe Card
(PCIe x8)
Ports Single Single
Connector QSFP28 QSFP28
Link Speed 100Gb/s ~58Gb/s on 100Gb/s Link
Power (Typ./Max)
- Cooper
- Optical up to 3 Watts (Class 4)
10.6/14.9W (Optical)
7.4/11.7W (Copper)
6.3/8.3W (Copper)
9.5/11.5W (Optical)
Thermal/Temp. Passive (55° C @ 200 LFM) Passive (55° C @ 200 LFM)
“Connectionless” message routing.
Intel Omni-Path Architecture – based on a connectionless design
– does not establish connection address information between
nodes, cores, or processes while a traditional implementation
maintains this information in the cache of the adapter. As a re-
sult, the connectionless design delivers consistent latency inde-
pendent of the scale or messaging partners. This implementation
offers greater potential to scale performance across a large node
or core count cluster while maintaining low end-to-end latency
as the application is scaled across the cluster.
Intel Omni-Path Fabric Director Host Fabric Interface Adapters 100 Series
202
Intel Omni-Path Interconnect
Architecture Switch Fabric Optimizations
Architecture Switch Fabric OptimizationsWhile similar to existing technology, Intel Omni-Path Architec-
ture has been enhanced to overcome the scaling challenges of
large-sized clusters. These enhancements include:
High Message Rate Throughput. Intel Omni-Path Architecture
is designed to support high message rate traffic from each node
through the fabric. With ever-increasing processing power and
core counts in Intel Xeon and Intel Xeon Phi processors, that
means the fabric has to support high bandwidth as well as high
message rate throughput.
48-port Switch ASIC. Intel OPA switch 48-port design provides for
improved fabric scalability, reduced latency, increased density,
and reduced cost and power. In fact, the 48-port ASIC can enable 5
hop configurations of up to 27,648 nodes, or over 2.3x what’s pos-
sible with current InfiniBand solutions. Depending on fabric size,
this can reduce fabric infrastructure requirements in a typical
fat tree configuration by over 50 percent, since fewer switches,
cables, racks, and power are needed as compared to today’s 36-
port switch ASICs. Table 1 summarizes the Intel OPA advantages.
Deterministic Latency. Features in Intel OPA will help minimize
the negative performance impacts of large Maximum Transfer
Units (MTUs) on small messages and help maintain consistent
latency for interprocess communication (IPC) messages, such as
Message Passing Interface (MPI) messages, when large messag-
es – typically storage – are being simultaneously transmitted in
the fabric. This will allow Intel OPA to bypass lower priority large
packets to allow higher priority small packets, creating a low and
more predictable latency through the fabric.
203
Enhanced End-to-End Reliability. Intel Omni-Path Architecture
will also deliver efficient detection and error correction, which
is expected to be much more efficient then forward error cor-
rection (FEC) defined in the InfiniBand standard. Enhancements
include zero load for detection, and if a correction is required,
packets only need to be retransmitted from the last link – not all
the way from the sending node – which enables near zero addi-
tional latency for a correction.
Intel Omni-Path Edge Switch 100 Series
Intel Omni-Path Edge Switches 100 Series 48 Port 100 Series 24 Port
Porst 48 24
Rack Space Required 1U (1.75”) 1U (1.75”)
Capacity 9.6Tb/s 4.8Tb/s
Port Speed 100Gb/s 100Gb/s
Switch Latency 100-110ns 100-110ns
Interface QSFP28 QSFP28
Chassis Mgmt. Yes Yes
Embedded Management Optional Optional
Power Supply min/max 1/2 1/2
Power (Typ./Max)
Input 100-240 VAC 50-60 Hz
Optical Power w/Class 4 Optical 3 Watts max
189/238 W (Cooper)
356/408 W (All Optical)
146/179 W (Cooper)
231/264 W (All Optical)
Fans N+1 (Speed Control) N+1 (Speed Control)
Airflow (Customer Changeable) Forward/Reverse Forward/Reverse
204
Intel Omni-Path Interconnect
Fabric Software Components
Fabric Software ComponentsSoftware ComponentsThe Intel Omni-Path Architecture (Intel OPA) host software stack
and the Intel Fabric Suite are the components of Intel OPA soft-
ware. Intel will Open Source all Host Fabric Components such as
the Driver, PSM, Libfabric providers, Fabric Manager, FastFabric
tools, HFI diags, FM GUI, OFA enhancements: patches, plugins, etc.
In addition, host components will be upstreamed to be delivered
as a part of the Operating system.
Intel OPA Host SoftwareIntel OPA runs today’s open source application software written
for existing OpenFabrics Alliance interfaces with no code chang-
es required. Intel OPA host software is also being open sourced
to enable an ecosystem of applications that “just work” together.
Utilizing Performance Scaled Messaging (PSM), Intel OPA host
software creates a fast data path with a lightweight, high per-
formance computing (HPC) optimized software driver layer. In
addition, standard I/O-focused protocols are supported by the
standard verbs layer.
Intel Omni-Path Fabric SuiteIntel Omni-Path Fabric (Intel OP Fabric) provides comprehen-
sive control of administrative functions using a mature subnet
manager. With advanced routing algorithms, powerful diagnos-
tic tools, and full subnet manager failover, the Intel OP Fabric
manager simplifies subnet, fabric, and individual component
management, easing the deployment and optimization of large
fabrics.
Intel Omni-Path Fabric Manager GUIThe manager GUI provides an intuitive, scalable dashboard with
analysis tools for monitoring Intel OP Fabric status and configu-
ration. The GUI runs on a Linux or Windows OS with TCP/IP con-
nectivity to the Intel OP Fabric manager.
205
Intel Omni-Path Fabric Suite FastFabric ToolsetGuided by an intuitive interface, the FastFabric Toolset in the In-
tel OP Fabric suite provides for rapid, error-free installation and
configuration of Intel OPA host and management software tools,
as well as simplified installation, configuration, validation, and
optimization of HPC fab
� Automated host, switch, and chassis software or firmware
installation – fast and in parallel
� Powerful fabric deployment and verification tools
� Advanced fabric routing methods and analysis tools
� Quality of service (QoS) support with virtual fabrics.
Numascale
Numascale’s NumaConnect technology enables computer system vendors to build
scalable servers with the functionality of enterprise mainframes at the cost level of
clusters. The technology unites all the processors, memory and IO resources in the
system in a fully virtualized environment controlled by standard operating systems.
High Throuhgput Computing | CAD | Big Data Analytics | Simulation | Aerospace | Automotive
Numascale
NumaConnect Background
208
NumaConnect BackgroundSystems based on NumaConnect will efficiently support all
classes of applications using shared memory or message pass-
ing through all popular high level programming models. System
size can be scaled to 4k nodes where each node can contain mul-
tiple processors. Memory size is limited by the 48-bit physical
address range provided by the Opteron processors resulting in a
total system main memory of 256 TBytes.
At the heart of NumaConnect is NumaChip; a single chip that
combines the cache coherent shared memory control logic with
an on-chip 7 way switch. This eliminates the need for a separate,
central switch and enables linear capacity and cost scaling.
The continuing trend with multi-core processor chips is enabling
more applications to take advantage of parallel processing. Nu-
maChip leverages the multi-core trend by enabling applications
to scale seamlessly without the extra programming effort re-
quired for cluster computing. All tasks can access all memory
and IO resources. This is of great value to users and the ultimate
way to virtualization of all system resources.
All high speed interconnects now use the same kind of physical
interfaces resulting in almost the same peak bandwidth. The
differentiation is in latency for the critical short transfers, func-
tionality and software compatibility. NumaConnect differenti-
ates from all other interconnects through the ability to provide
unified access to all resources in a system and utilize caching
techniques to obtain very low latency.
Key Facts:
� Scalable, directory based Cache Coherent Shared Memory
interconnect for Opteron
� Attaches to coherent HyperTransport (cHT) through HTX con-
nector, pick-up module or mounted directly on main board
� Configurable Remote Cache for each node
209
� Full 48 BIT physical address space (256 Tbytes)
� Up to 4k (4096) nodes
� ≈1 microsecond MPI latency (ping-pong/2)
� On-chip, distributed switch fabric for 2 or 3 dimensional torus
topologies
Expanding the capabilities of multi-core processors
Semiconductor technology has reached a level where proces-
sor frequency can no longer be increased much due to power
consumption with corresponding heat dissipation and thermal
handling problems. Historically, processor frequency scaled at
approximately the same rate as transistor density and resulted
in performance improvements for most all applications with
no extra programming efforts. Processor chips are now instead
being equipped with multiple processors on a single die. Utiliz-
ing the added capacity requires softwarethat is prepared for
parallel processing. This is quite obviously simple for individual
and separated tasks that can be run independently, but is much
more complex for speeding up single tasks.
The complexity for speeding up a single task grows with the log-
ic distance between the resources needed to do the task, i.e. the
fewer resources that can be shared, the harder it is. Multi-core
processors share the main memory and some of the cache
levels, i.e. they are classified as Symmetrical Multi Processors
(SMP). Modern processors chips are also equipped with signals
and logic that allow connecting to other processor chips still
maintaining the same logic sharing of memory. The practical
limit is at two to four processor sockets before the overheads
reduce performance scaling instead of increasing it. This is nor-
mally restricted to a single motherboard.
Currently, scaling beyond the single/dual SMP motherboards is
done through some form of network connection using Ethernet
or a higher speed interconnect like InfiniBand. This requires pro-
cesses runningon the different compute nodes to communicate
through explicit messages. With this model, programs that need
to be scaled beyond a small number of processors have to be
written in a more complex way where the data can no longer
be shared among all processes, but need to be explicitly decom-
posed and transferred between the different processors’ mem-
ories when required.
NumaConnect uses a scalable approachto sharing all memory
based on distributed directories to store information about
shared memory locations. This means that programs can be
scaled beyond the limit of a single motherboard without any
changes to the programming principle. Any process running on
any processor in the system can use any part of the memory re-
gardless if the physical location of the memory is on a different
motherboard.
NumaConnect Value PropositionNumaConnect enables significant cost savings in three dimen-
sions; resource utilization, system management and program-
mer productivity.
According to long time users of both large shared memory sys-
tems (SMPs) and clusters in environments with a variety of ap-
plications, the former provide a much higher degree of resource
utilization due to the flexibility of all system resources. They
indicate that large mainframe SMPs can easily be kept at more
than 90% utilization and that clusters seldom can reach more
than 60-70% in environments running a variety of jobs. Better
compute resource utilization also contributes to more efficient
use of the necessary infrastructure with power consumption
and cooling as the most prominent ones (account for approxi-
mately one third of the overall coast) with floor-space as a sec-
ondary aspect.
Regarding system management, NumaChip can reduce the
number of individual operating system images significantly.
210
In a system with 100Tflops computing power, the number of
system images can be reduced from approximately 1 400 to
40, a reduction factor of 35. Even if each of those 40 OS imag-
es require somewhat more resources for management than the
1 400 smaller ones, the overall savings are significant.
Parallel processing in a cluster requires explicit message pass-
ing programming whereas shared memory systems can utilize
compilers and other tools that are developed for multi-core pro-
cessors. Parallel programming is a complex task and programs
written for message passing normally contain 50%-100% more
code than programs written for shared memory processing.
Since all programs contain errors, the probability of errors in
message passing programs is 50%-100% higher than for shared
memory programs. A significant amount of software develop-
ment time is consumed by debugging errors further increasing
the time to complete development of an application.
In principle, servers are multi-tasking, multi-user machines that
are fully capable of running multiple applications at any given
time. Small servers are very cost-efficient measured by a peak
price/performance ratio because they are manufactured in very
high volumes and use many of the same components as desk-
side and desktop computers. However, these small to medium
sized servers are not very scalable. The most widely used config-
uration has 2 CPU sockets that hold from 4 to 16 CPU cores each.
They cannot be upgraded with out changing to a different main
board that also normally requires a larger power supply and a
different chassis. In turn, this means that careful capacity plan-
ning is required to optimize cost and if compute requirements
increase, it may be necessary to replace the entire server with a
bigger and much more expensive one since the price increase is
far from linear.
NumaChip contains all the logic needed to build Scale-Up sys-
tems based on volume manufactured server components. This
Numascale
Technology
211
drives the cost per CPU core down to the same level while offer-
ing the same capabilities as the mainframe type servers. Where
IT budgets are in focus the price difference is obvious and Nu-
maChip represents a compelling proposition to get mainframe
capabilities at the cost level of high-end cluster technology. The
expensive mainframes still include some features for dynamic
system reconfiguration that NumaChip systems will not offer in-
itially. Such features depend on operating system software and
can be also be implemented in NumaChip based systems.
TechnologyMulti-core processors and shared memory
Shared memory programming for multi-processing boosts pro-
grammer productivity since it is easier to handle than the alter-
native message passing paradigms. Shared memory programs
are supported by compiler tools and require less code than the
alternatives resulting in shorter development time and fewer
program bugs. The availability of multi-core processors on all
major platforms including desktops and laptops is driving more
programs to take advantage of the increased performance
potential. NumaChip offers seamless scaling within the same
programming paradigm regardless of system size from a single
processor chip to systems with more than 1,000 processor chips.
Other interconnect technologies that do not offer cc-NUMA
capabilities require that applications are written for message
passing, resulting in larger programs with more bugs and cor-
respondingly longer development time while systems built with
NumaChip can run any program efficiently.
Virtualization
The strong trend of virtualization is driven by the desire of ob-
taining higher utilization of resources in the datacenter. In short,
it means that any application should be able to run on any serv-
er in the datacenter so that each server can be better utilized
by combining more applications on each server dynamically
according to user loads. Commodity server technology repre-
sents severe limitations in reaching this goal.
One major limitation is that the memory requirements of any
given application need to be satisfied by the physical server
that hosts the application at any given time. In turn, this means
that if any application in the datacenter shall be dynamically
executable on all of the servers at different times, all of the serv-
ers must be configured with the amount of memory required by
the most demanding application, but only the one running the
app will actually use that memory. This is where the mainframes
excel since these have a flexible shared memory architecture
where any processor can use any portion of the memory at any
given time, so they only need to be configured to be able to han-
dle the most demanding application in one instance.
NumaChip offers the exact same feature, by providing any appli-
cation with access to the aggregate amount of memory in the
system. In addition, it also offers all applications access to all I/O
devices in the system through the standard virtual view provid-
ed by the operating system.
The two distinctly different architectures of clusters and main-
frames are shown in Figure 1. In Clusters processes are loosely
coupled through a network like Ethernet or InfiniBand. An appli-
cation that needs to utilize more processors or I/O than those
present in each server must be programmed to do so from the
beginning. In the main-frame, any application can use any re-
source in the system as a virtualized resource and the compiler
can generate threads to be executed on any processor.
In a system interconnected with NumaChip, all processors can
access all the memory and all the I/O resources in the system
in the same way as on a mainframe. NumaChip provides a fully
virtualized hardware environment with shared memory and I/O
and with the same ability as mainframes to utilize compiler gen-
erated parallel processes and threads.
Clustered Architecture
Mainframe Architecture
212
Operating Systems
Systems based on NumaChip can run standard operating sys-
tems that handle shared memory multiprocessing. Numascale
provides a bootstrap loader that is invoked after power-up and
performs initialization of the system by setting up node address
routing tables. When the standard bootstrap loader is launched,
the system will appear as a large unified shared memory system.
Cache Coherent Shared Memory
The big differentiator for NumaConnect compared to other
high-speed interconnect technologies is the shared memory
Clustered vs Mainframe Architecture
Numascale
Technology
213
� Resources can be mapped and used by any processor in the
system – optimal use of resources in a virtualized environ-
ment
� Process scheduling is synchronized through a single, real-time
clock - avoids serialization of scheduling associated with
asynchronous operating systems in a cluster and the corre-
sponding loss of efficiency
Scalability and Robustness
The initial design aimed at scaling to very large numbers of pro-
cessors with 64-bit physical address space with 16 bits for node
identifier and 48 bits of address within each node. The current
implementation for Opteron is limited by the global physical ad-
dress space of 48 bits, with 12 bits used to address 4 096 physical
nodes for a total physical address range of 256 Terabytes.
A directory based cache coherence protocol was developed to
handle scaling with significant number of nodes sharing data to
avoid overloading the interconnect between nodes with coher-
ency traffic which would seriously reduce real data throughput.
and cache coherency mechanisms. These features allow pro-
grams to access any memory location and any memory mapped
I/O device in a multiprocessor system with high degree of effi-
ciency. It provides scalable systems with a unified programming
model that stays the same from the small multi-core machines
used in laptops and desktops to the largest imaginable single
system image machines that may contain thousands of proces-
sors.
There are a number of pros for shared memory machines that
lead experts to hold the architecture as the holy grail of comput-
ing compared to clusters:
� Any processor can access any data location through direct
load and store operations - easier programming, less code to
write and debug
� Compilers can automatically exploit loop level parallelism –
higher efficiency with less human effort
� System administration relates to a unified system as opposed
to a large number of separate images in a cluster – less effort
to maintain
NumaChip System Architecture
214
The basic ring topology with distributed switching allows a
number of different interconnect configurations that are more
scalable than most other interconnect switch fabrics. This also
eliminates the need for a centralized switch and includes inher-
ent redundancy for multidimensional topologies.
Functionality is included to manage robustness issues associ-
ated with high node counts and extremely high requirements
for data integrity with the ability to provide high availability for
systems managing critical data in transaction processing and
realtime control. All data that may exist in only one copy are ECC
protected with automatic scrubbing after detected single bit
errors and automatic background scrubbing to avoid accumula-
tion of single bit errors.
Integrated, distributed switching
NumaChip contains an on-chip switch to connect to other nodes
in a NumaChip based system, eliminating the need to use a
centralized switch. The on-chip switch can connect systems in
one, two or three dimensions. Small systems can use one, medi-
um-sized system two and large systems will use all three dimen-
sions to provide efficient and scalable connectivity between
processors.
The two- and three-dimensional topologies (called Torus) also
have the advantage of built-in redundancy as opposed to sys-
tems based on centralized switches, where the switch repre-
sents a single point of failure.
The distributed switching reduces the cost of the system since
there is no extra switch hardware to pay for. It also reduces the
amount of rack space required to hold the system as well as the
power consumption and heat dissipation from the switch hard-
ware and the associated power supply energy loss and cooling
requirements.
Numascale
Technology
215
Block Diagram of NumaChip
Multiple links allow flexible system configurations in multi-dimensional topologies
S
N
EW
SESW
NENW
tech BOX
216
Redefining Scalable OpenMP and MPI Shared Memory Advantages
Multi-processor shared memory processing has long been the
preferred method for creating and running technical computing
codes. Indeed, this computing model now extends from a user’s
dual core laptop to 16+ core servers. Programmers often add
parallel OpenMP directives to their programs in order to take
advantage of the extra cores on modern servers. This approach
is flexible and often preserves the “sequential” nature of the
program (pthreads can of course also be used, but OpenMP is
much easier to use). To extend programs beyond a single server,
however, users must use the Message Passing Interface (MPI) to
allow the program to operate across a high-speed interconnect.
Interestingly, the advent of multi-core servers has created a par-
allel asymmetric computing model, where programs must map
themselves to networks of shared memory SMP servers. This
asymmetric model introduces two levels of communication,
local within a node, and distant to other nodes. Programmers
often create pure MPI programs that run across multiple cores
on multiple nodes. While a pure MPI program does represent
the greatest common denominator, better performance may be
sacrificed by not utilizing the local nature of multi-core nodes.
Hybrid (combined MPI/OpenMP) models have been able to pull
more performance from cluster hardware, but often introduce
programming complexity and may limit portability.Clearly, us-
ers prefer writing software for large shared memory systems to
MPI programming. This preference becomes more pronounced
when large data sets are used. In a large SMP system the data
are simply used in place, whereas in a distributed memory clus-
ter the dataset must be partitioned across compute nodes.
Numascale
Redefining Scalable OpenMP and MPI
217
In summary, shared memory systems have a number of highly
desirable features that offer ease of use and cost reduction over
traditional distributed memory systems:
� Any processor can access any data location through direct load
and store operations, allowing easier programming (less time
and training) for end users, with less code to write and debug.
� Compilers, such as those supporting OpenMP, can automatically
exploit loop level parallelism and create more efficient codes,
increasing system throughput and better resource utilization.
� System administration of a unified system (as opposed to
a large number of separate images in a cluster) results in
reduced effort and cost for system maintenance.
� Resources can be mapped and used by any processor in
the system, with optimal use of resources in a single image
operating system environment.
Shared Memory as a Universal Platform
Although the advantages of shared memory systems are clear,
the actual implementation of such systems “at scale” has been
difficult prior to the emergence of NumaConnect technolo-
gy. There have traditionally been limits to the size and cost of
shared memory SMP systems, and as a result the HPC commu-
nity has moved to distributed memory clusters that now scale
into the thousands of cores. Distributed memory programming
occurs within the MPI library, where explicit communication
pathways are established between processors (i.e., data is es-
sentially copied from machine to machine). A large number of
existing applications use MPI as a programming model. Fortu-
nately, MPI codes can run effectively on shared memory sys-
tems. Optimizations have been built into many MPI versions
that recognize the availability of shared memory and avoid full
message protocols when communicating between processes.
Shared memory programming using OpenMP has been useful on
small-scale SMP systems such as commodity workstations and
servers. Providing large-scale shared memory environments for
these codes, however, opens up a whole new world of perfor-
mance capabilities without the need for re-programming. Using
NumaConnect technology, scalable shared memory clusters are
capable of efficiently running both large-scale OpenMP and MPI
codes without modification.
Record-Setting OpenMP Performance
In the HPC community NAS Parallel Benchmarks (NPB) have
been used to test the performance of parallel computers (http://
www.nas.nasa.gov/publications/npb.html). The benchmarks are
a small set of programs derived from Computational Fluid Dy-
namics (CFD) applications that were designed to help evaluate
the performance of parallel supercomputers. Problem sizes in
NPB) are predefined and indicated as different classes (currently
A through F, with F being the largest).
Reference implementations of NPB are available in common-
ly-used programming models such as MPI and OpenMP, which
make them ideal for measuring the performance of both distrib-
uted memory and SMP systems. These benchmarks were com-
piled with Intel ifort version 14.0.0. (Note: the current-generated
code is slightly faster, but Numascale is working on NumaCon-
nect optimizations for the GNU compilers and thus suggests
using gcc and gfortran for OpenMP applications.) For the follow-
ing tests, the NumaConnect Shared Memory benchmark system
has 1TB of memory and 256 cores. It utilizes eight servers, each
equipped with two x AMD Opteron 2.5 GHz 6380 CPUs, each with
16 cores and 128GB of memory. Figure One shows the results for
running NPB-SP (Scalar Penta-diagonal solver) over a range of 16
to 121 cores using OpenMP for the Class D problem size.
Figure Two shows results for the NPB-LU benchmark (Lower-Up-
per Gauss-Seidel solver) over a range of 16 to 121 cores, using
OpenMP for the Class D problem size.
218
Figure Three shows he NAS-SP benchmark E-class scaling per-
fectly from 64 processes (using affinity 0-255:4) to 121 process-
es (using affinity 0-241:2). Results indicate that larger problems
scale better on NumaConnect systems, and it was noted that
NASA has never seen OpenMP E Class results with such a high
number of cores.
OpenMP NAS Parallel results for NPB-SP (Class D)
OpenMP NAS Parallel results for NPB-LU (Class E)
Numascale
Redefining Scalable OpenMP and MPI
219
OpenMP applications cannot run on InfinBand clusters without
additional software layers and kernel modifications. The Numa-
Connect cluster runs a standard Linux kernel image.
Surprisingly Good MPI Performance
Despite the excellent OpenMP shared memory performance
that NumaConnect can deliver, applications have historically
been written using MPI. The performance of these applications
is presented below. As mentioned, the NumaConnect system
can easily run MPI applications.
Figure Four is a comparison of NumaConnect and FDR InfiniB-
and NPB-SP (Class D). The results indicate that NumaConnect
performance is superior to that of a traditional distributed In-
finiBand memory cluster. MPI tests were run with OpenMPI and
gfortran 4.8.1 using the same hardware mentioned above.
Both industry-standard OpenMPI and MPICH2 work in shared
memory mode. Numascale has implemented their own version
of the OpenMPI BTL (Byte Transfer Layer) to optimize the com-
munication by utilizing non-polluting store instructions. MPI
messages require data to be moved, and in a shared memory en-
vironment there is no reason to use standard instructions that
implicitly result in cache pollution and reduced performance.
This results in very efficient message passing and excellent MPI
performance.
Similar results are shown in Figure Five for the NAS-LU (Class D).
NumaConnect’s performance over InfiniBand may be one of the
more startling results for the NAS benchmarks. Recall again that
OpenMP applications cannot run on InfiniBand clusters without
additional software layers and kernel modifications.
OpenMP NAS results for NPB-SP (Class E)
NPB-SP comparison of NumaConnect to FDI InfiniBand
NPB-LU comparison of NumaConnect to FDI InfiniBand
S
N
EW
SESW
NENW
tech BOX
220
Big Data Big Data applications – once limited to a few exotic disciplines
– are steadily becoming the dominant feature of modern com-
puting. In industry after industry, massive datasets are being
generated by advanced instruments and sensor technology.
Consider just one example, next generation DNA sequencing
(NGS). Annual NGS capacity now exceeds 13 quadrillion base
pairs (the As, Ts, Gs, and Cs that make up a DNA sequence). Each
base pair represents roughly 100 bytes of data (raw, analyzed,
and interpreted). Turning the swelling sea of genomic data into
useful biomedical information is a classic Big Data challenge,
one of many, that didn’t exist a decade ago.
This mainstreaming of Big Data is an important transformational
moment in computation. Datasets in the 10-to-20 Terabytes (TB)
range are increasingly common. New and advanced algorithms
for memory-intensive applications in Oil & Gas (e.g. seismic data
processing), finance (real-time trading) social media (database),
and science (simulation and data analysis), to name but a few,
are hard or impossible to run efficiently on commodity clusters.
The challenge is that traditional cluster computing based on
distributed memory – which was so successful in bringing down
the cost of high performance computing (HPC) – struggles when
forced to run applications where memory requirements exceed
the capacity of a single node. Increased interconnect latencies,
longer and more complicated software development, inefficient
system utilization, and additional administrative overhead are
all adverse factors. Conversely, traditional mainframes running
shared memory architecture and a single instance of the OS have
always coped well with Big Data Crunching jobs.
Numascale
Big Data
221
Numascale’s SolutionUntil now, the biggest obstacle to wider use of shared memory
computing has been the high cost of mainframes and high-end
‘super-servers’. Given the ongoing proliferation of Big Data appli-
cations, a more efficient and cost-effective approach to shared
memory computing is needed. Now, Numascale, a four-year-old
spin-off from Dolphin Interconnect Solutions, has developed a
technology, NumaConnect, which turns a collection of standard
servers with separate memories and IO into a unified system
that delivers the functionality of high-end enterprise servers and
mainframes at a fraction of the cost.
Numascale Technology snapshot (board and chip):
NumaConnect links commodity servers together to form a single
unified system where all processors can coherently access and
share all memory and I/O. The combined system runs a single
instance of a standard operating system like Linux. At the heart
of NumaConnect is NumaChip – a single chip that combines the
cache coherent shared memory control logic with an on-chip
7-way switch. This eliminates the need for a separate, central
switch and enables linear capacity and cost scaling.
Systems based on NumaConnect support all classes of applica-
tions using shared memory or message passing through all pop-
ular high level programming models. System size can be scaled
to 4k nodes where each node can contain multiple processors.
Memory size is limited only by the 48-bit physical address range
provided by the Opteron processors resulting in a record-break-
ing total system main memory of 256 TBytes. (For details of Nu-
mascale technology see: The NumaConnect White Paper)
The result is an affordable, shared memory computing option to
tackle data-intensive applications. NumaConnect-based systems
running with entire data sets in memory are “orders of magni-
tude faster than clusters or systems based on any form of existing
mass storage devices and will enable data analysis and decision
support applications to be applied in new and innovative ways,”
says Kåre Løchsen, Numascale founder and CEO.
The big differentiator for NumaConnect compared to other high-
speed interconnect technologies is the shared memory and
cache coherency mechanisms. These features allow programs to
access any memory location and any memory mapped I/O device
in a multiprocessor system with high degree of efficiency. It pro-
vides scalable systems with a unified programming model that
stays the same from the small multi-core machines used in lap-
tops and desktops to the largest imaginable single system image
machines that may contain thousands of processors and tens to
hundreds of terabytes of main memory.
Early adopters are already demonstrating performance gains
and costs savings. A good example is Statoil, the global energy
company based in Norway. Processing seismic data requires
massive amounts of floating point operations and is normally
performed on clusters. Broadly speaking, this kind of processing
is done by programs developed for a message-passing paradigm
(MPI). Not all algorithms are suited for the message passing para-
digm and the amount of code required is huge and the develop-
ment process and debugging task are complex.
Numascale’s implementation of shared memory has many inno-
vations. Its on-chip switch, for example, can connect systems in
one, two, or three dimensions (e.g. 2D and 3D Torus). Small sys-
tems can use one, medium sized system two, and large systems
will use all three dimensions to provide efficient and scalable
connectivity between processors. The distributed switching re-
duces the cost of the system since there is no extra switch hard-
ware to pay for. It also reduces the amount of rack space required
to hold the system as well as the power consumption and heat
dissipation from the switch hardware and the associated power
supply energy loss and cooling requirements.
S
N
EW
SESW
NENW
tech BOX
222
Cache Coherence Multiprocessor systems with caches and shared memory space
need to resolve the problem of keeping shared data coherent.
This means that the most recently written data to a memory lo-
cation from one processor needs to be visible to other proces-
sors in the system immediately after a synchronization point.
The synchronization point is normally implemented as a barrier
through a lock (semaphore) where the process that stores data
releases a lock after finishing the store sequence.
When the lock is released, the most recent version of the data
that was stored by the releasing process can be read by another
process. If there were no individual caches or buffers in the data
paths between processors and the shared memory system, this
would be a relatively trivial task. The presence of caches chang-
es this picture dramatically. Even if the caches were so-called
write-through, in which case all store operations proceed all
the way to main memory, it would still mean that the proces-
sor wanting to read shared data could have copies of previous
Numascale
Cache Coherence
223
versions of the data in the cache and such a read operation
would return a stale copy seen from the producer process un-
less certain steps are taken to avoid the problem.
Early cache systems would simply require a cache clear opera-
tion to be issued before reading the shared memory buffer in
which case the reading process would encounter cache misses
and automatically fetch the most recent copy of the data from
main memory. This could be done without too much loss of per-
formance when the speed difference between memory, cache
and processors was relatively small.
As these differences grew and write-back caches were intro-
duced, such coarse-grained coherency mechanisms were no
longer adequate. A write-back cache will only store data back
to main memory upon cache line replacements, i.e. when the
cache line will be replaced due to a cache miss of some sort. To
accommodate this functionality, a new cache line state, “dirty”,
was introduced in addition to the traditional “hit/miss” states.
S
N
EW
SESW
NENW
tech BOX
224
NUMA and ccNUMA The term Non-Uniform Memory Access means that memory
accesses from a given processor in a multiprocessor system will
have different access time depending on the physical location
of the memory. The “cc” means Cache Coherence, (i.e. all memory
references from all processors will return the latest updated
data from any cache in the system automatically.)
Modern multiprocessor systems where processors are con-
nected through a hierarchy of interconnects (or buses) will per-
form differently depending on the relative physical location of
processors and the memory being accessed. Most small multi-
processor systems used to be symmetric (SMP – Symmetric Mul-
ti-Processor) having a couple of processors hooked up to a main
memory system through a central bus where all processors had
equal priority and the same distance from the memory.
This was changed through AMD’s introduction of Hypertrans-
port with on-chip DRAM controllers. When more than one such
processor chip (normally also with multiple CPU cores per chip)
are connected, the criteria for being categorized as NUMA are
fulfilled. This is illustrated by the fact that reading or writing
data from or to the local memory is indeed quite a bit faster
than performing the same operations on the memory that is
controlled by the other processor chip.
Seen from a programming point of view, the uniform, global
shared address space is a great advantage since any operand or
Numascale
NUMA and ccNUMA
225
variable can be directly referenced from he program through a
single load or store operation. This is both simple and incurs no
overhead. In addition, the large physical memory capacity elimi-
nates the need for data decomposition and the need for explicit
message passing between processes. It is also of great impor-
tance that the programming model can be maintained through-
out the whole range of system sizes from small desktop systems
with a couple of processor cores to huge supercomputers with
thousands of processors.
The Non-Uniformity seen from the program is only coupled with
the access time difference between different parts of the physi-
cal memory. Allocation of physical memory is handled by the op-
erating system (OS) and modern OS kernels include features for
optimizing allocation of memory to match with the scheduling
of processors to execute the program. If a programmer wishes
to influence memory allocation and process scheduling in any
way, OS calls to pin-down memory and reserve processors can
be inserted in the code.
This is of course less desirable than to let the OS handle it auto-
matically because it increases the efforts required to port the
application between different operating systems. In normal
cases with efficient caching of remote accesses, the automatic
handling by the OS will be sufficient to ensure high utilization of
processors even if the memory is physically scattered between
the processing nodes.
Life Sciences | CAE | High Performance Computing | Big Data Analytics | Simulation | CAD
Allinea Forge for parallel Software DevelopmentAllinea Forge is the complete toolsuite for software development – with everything
needed to debug, profile, optimize, edit and build C, C++ and Fortran applications on
Linux for high performance – from single threads through to complex parallel HPC codes
with MPI, OpenMP, threads or CUDA.
Leading enterprises and labs that need high performance computing rely on Allinea
Forge to help them develop fast, robust software and keep their development teams on
track. You’d be in good company with us – users on 70% of the world’s largest supercom-
puters and clusters rely on Allinea Forge.
228
Application Performance and Development ToolsAt its heart, High Performance Computing is used to answer
scientific questions, to engineer better products, to simulate
and predict the world around us, to analyze data and to make
discoveries.
While the hardware is often the focus and grabs the head-
lines, the reality is that HPC software applications determine
everything: the only purpose of a High Performance Computing
system is to run software for “the mission”.
The whole of HPC depends on the software. For that reason it is
important to ensure that the software that is run is:
� Able to exploit the hardware
� Fast and efficient in terms of the resources and time used
� Robust and reliable
� Energy efficient
To take the first criteria as an example: procuring hardware that
has one component that is over-specified for the application pro-
file means that this part of the investment is wasted and could
have achieved more output elsewhere. If storage is the bottle-
neck of the workflow, then a faster and more expensive proces-
sor may not deliver any extra scientific or industrial throughput.
Increasingly the demand to fit energy or power usage within
physical or cost constraints is adding energy and power as well
time into this “optimization challenge”.
Tools from Allinea Software are widely used in the HPC industry
to analyze, tune and develop applications in order to achieve the
scientific or business demands of HPC.
Allinea Forge for parallel
Software Development
Application Performance and
Development Tools
229
Analyzing the ApplicationsTo increase the discovery provided by a current or future HPC sys-
tem, it is essential to analyze the applications that are run. This
assessment should be performed on a regular basis over the life-
time of a system – and also prior to system acquisition to drive
relevant system specification. Frequent analysis enables tem-
porary system issues and specific run-configuration issues to be
addressed quickly. The need for analysis is applicable where soft-
ware is provided by third parties such as commercial engineer-
ing software – as well as for in-house or community software.
Allinea Performance Reports provides performance characteri-
zation that helps to understand the limiting factors for an appli-
cation running on a system. Its one-page reports show the pro-
cessor, networking and I/O time balance, along with advanced
performance and resource information such as processor vector-
ization and memory access, memory usage, core usage and syn-
chronization losses. This helps users to tune runtime parameters
better and also helps developers to target optimization efforts.
Developing High Performance SoftwareFor in-house and community software, the ability to both devel-
op and improve software increases the results and potential of
HPC systems.
However, today’s architectures, large code bases, higher node
and thread counts and the complexity of platform performance
place high demands on those that write software.
In addition to the best compilers (generating the fastest binaries
for a given source code) and fast libraries for core operations
such as dense matrix operations, it is necessary to have tools
that enable developers and scientists to apply their skills effec-
tively and to create more robust and efficient software.
Allinea Forge is a development tool suite that includes the
world’s leading debugging and performance profiling tools –
Allinea DDT and Allinea MAP. Using Allinea Forge enables
developers to create faster and more robust code, in a fraction
of the time.
The highly capable tools have ease-of-use at the fore: the suite
provides access to local, remote and cloud-based HPC systems
for developers – enabling their editing, building, debugging and
performance profiling of code on the systems. Designed for
developers working with HPC’s multi-process and multi-threaded
codes, it is deployed on systems ranging from technical work-
stations through to the world’s largest supercomputers.
Its debugger provides powerful insight into software defects
– controlling concurrent processes and threads, applying
advanced memory debugging to complex bugs, and visualizing
distributed application data and state.
Its performance profilers pinpoints application performance
bottlenecks and enables developers to analyze key performance
issues such as OpenMP or MPI synchronization, use of cache-un-
friendly memory access patterns, excessive I/O, vectorization – in
addition to power and energy use.
230
What’s in Allinea Forge? Allinea Forge includes
� The world’s leading C, C++,
Fortran and F90 debugger – Allinea DDT
� The fast profiler for high performance multi-threaded
or multi-process code – Allinea MAP
� One single crisp clear interface – and that supports
debugging, profiling, editing and building code on local
and remote systems
Edit, build and commitWhether debugging or profiling, Allinea Forge lets you make
those quick changes easily – with its built-in editor. There’s no
need for firing up a separate editor and breaking your workflow.
When the changes are done, simply rebuild and debug again, or
profile the code to see the impact of those changes.
Forge also supports the major source control systems and can
annotate your code to let you know when code was changed,
and who by. Ideal for those large multi-developer projects: find
out who broke the build easily!
� Supports Git, Mercurial, Subversion, CVS– update the code or
commit to preserve your work – or annotate the code with
version numbers and change messages
� Full syntax highlighting and code-folding – hide long and
irrelevant code blocks so that you can see the structure of
your code more clearly.
� Configurable build – supports any build command.
� Powerful search and navigation.
� In-built static analysis for C++ projects.
Access remote or cloud systems as seamlessly as your laptopThe workflow of a developer in high performance computing or
working in the cloud has specific needs for tools: you need to
work with your code and machine often from a distance – as of-
ten as you might work on code running locally.
Allinea Forge for parallel
Software Development
What’s in Allinea Forge?
231
The whole Allinea Forge toolkit has been built with you in mind.
It has unique remote connection support that brings editing,
debugging and profiling into your local desktop with minimal
network lag - enabling access to large scale machines from home
quickly, easily and securely.
� Connects via secure shell (SSH) – running the user interface
locally and controlling the remote session using our unique
scalable low traffic control architecture.
� Known to work with most common one-time-password (OTP)
authentication tokens.
Platform SupportAllinea Forge supports the platforms of technical and high per-
formance computing.
� Connects Hardware and O/S: Linux on Intel Xeon, Intel Xeon
Phi, ARM 64-bit, OpenPOWER.
� Parallel processing frameworks: Almost every known imple-
mentation of the MPI standard, OpenSHMEM and OpenMP.
� Coprocessors: NVIDIA GPUs – and models/languages such as
OpenACC and CUDA – and Intel Xeon Phi.
� Systems: Multicore laptops, through to supercomputers and
clusters – including those provided by key vendors such as
Bull, Cray, Dell, HP, IBM, Lenovo and SGI.
A remote client is available to allow access to supported remote
systems from OS/X and Windows laptops, in addition to Linux
platforms.
Cross-platform ToolsIt’s important to have great tools in order to develop great soft-
ware, that’s why we’re ensuring that everything from the latest
compilers, C++11 standards, OpenMP and MPI versions to NVIDIA
CUDA and Intel® Xeon Phi™, Intel Xeon, 64-bit ARM and Open-
POWER hardware are fully-supported by our tools.
Debugging with Allinea DDTAllinea DDT is the debugger for software engineers and scientists
developing C++, C or Fortran parallel and threaded applications
on CPUs, GPUs and Intel Xeon Phi coprocessors.
Its powerful intuitive graphical interface with automatic detec-
tion of memory bugs, divergent behavior and lightning-fast per-
formance at all scales combine to make Allinea DDT the number
one debugger in research, industry and academia.
Amongst DDT’s features:
� Powerful memory debugging – to catch problems like
memory leaks, dangling pointers or reading beyond the
bounds of arrays.
� Intelligent aggregation of multiple processes and threads
state and date – to simplify finding unusual behavior.
� Control multiple threads and processes simultaneously –
set breakpoints, run to a line, or step forwards and run.
� Automatic change detection of variables and smart
highlighting and graphs of values across threads and
processes.
� Unique large and distributed arrays viewing and filtering for
scientific codes.
� Extensible C++ STL and complex type viewing – to make
viewing opaque C++/C++11 classes easy.
Allinea MAP – intuitive lightweight profilingAllinea MAP is a profiler that shows you which lines of code are
slow and gets everything else out of your way.
Whether at one process or ten thousand, Allinea MAP is designed
to work out-of-the-box with no need for instrumentation and no
232
danger of creating large, unmanageable data files. Software en-
gineers developing parallel and threaded applications on CPUs,
GPUs and Intel Xeon Phi coprocessors rely on Allinea MAP’s unri-
valled capability.
Is the compiler vectorizing this loop? Is this code CPU-bound, or
memory-bound? Why is my multi-process code slow? Which MPI
calls is it waiting at? Is multithreaded synchronization destroy-
ing performance? Allinea MAP answers all these questions and
more in a beautiful, intuitive graphical interface.
It is used from multicore Linux workstations through to the
largest supercomputers on the planet.
� Fast – typically under 5% runtime overhead means that you
can profile realistic test cases that you care most about.
� Easy – the interactive user interface is clear and intuitive,
designed for developers and computational scientists.
� Scalable – architected for the biggest clusters and super-
computers – and the biggest and most complex codes!
� No fuss– profiles C++, C, Fortran with no relinking, instrumen-
tation or code changes required.
Allinea Forge for parallel
Software Development
What’s in Allinea Forge?
233
S
N
EW
SESW
NENW
tech BOX
Allinea DDT – a debugger for parallel and threaded Linux codeAllinea DDT makes debugging faster and gives developers
the break they need to deliver software on time. The compre-
hensive debugger for C, C++, Fortran and F90 multi-process
and multi-threaded applications:
� Debugs native Linux applications, including those
running on more than one server – such as on
HPC clusters using MPI.
� Controls processes and threads – step through program
execution, stop on variable changes, errors and at user
inserted breakpoints.
� Has market leading memory debugging to detect hard-
to-find or intermittent problems caused by memory
leaks, dangling pointers, beyond-bounds array access.
� Has powerful and extensible C++ STL debugging to view
STL containers intuitively.
� Shows variables and arrays across multiple processes
and detects changes automatically.
� Provides In-built editing, building and version control
integration – with outstanding support for working on
remote systems.
� Has an Offline mode for debugging non-interactively
and recording application behavior and state.
Powerful, scalable, multi-process, multi-threaded debuggingIf you work with multiple processes and threads, through
libraries like MPI or OpenMP, Allinea DDT makes concurrency
easy.
MAP exposes a wide set of performance problems and
bottlenecks by measuring:
� Computation – with self and child and call tree
representations over time.
� Thread activity – to identify over-subscribed cores and
sleeping threads that waste available CPU time for OpenMP
and pthreads.
� Instruction types (for x86_64) – to show use of eg. vector-
units or other performance extensions.
� Synchronization, communication and workload imbalance
for MPI or multi-process usage.
� I/O performance and time spent in I/O – to identify
bottlenecks in shared or local file systems.
234
Control threads and processes individually and collectively
� Set breakpoints on groups of MPI processes, and all or
individual threads (OpenMP and pthreads).
� Step or play processes and threads individually or en-masse.
� Create groups of processes based on variable/expression
values, current code location or process state.
See state and differences between processes and threads easily
DDT has unique data aggregation and difference-highlighting
views that automatically compare and contrast program state so
it’s quicker to handle multiple contexts.
The parallel stack view gives a scalable view of thread and pro-
cess stacks – from a handful of processes and threads to hun-
dreds of thousands – it groups stacks to highlight the differences
that can indicate divergence – ideal for MPI, OpenMP, pthreads,
CUDA and OpenACC debugging.
Unusual data values and changes are seen instantly with smart
highlighting and sparklines - thumbnail-plots that show values
of variables across every process. Read more about sparklines in
our blog. Additional cross-process and thread comparison tools
help to group or search data.
DDT also makes it easier to understand the types of bugs that
multi-process codes can introduce.
� Diagnose deadlocks, livelocks and message synchroniza-
tion errors with both graphical and table-based message
queue displays for MPI.
� Integration support for open-source MPI correctness tools.
Above all, it’s tried and tested at scale – with lightning-fast
performance even at the extreme: step and display 700,000
processes in a fraction of a second.
Allinea Forge for parallel
Software Development
What’s in Allinea Forge?
235
Data Browsing for C, C++ and F90 structures and classesWe know it’s essential to see variables quickly and understand
the most complex data structures – and with Allinea DDT you can.
� View variables and expressions – and watch as they change
as you debug through your code with automatic change
detection
� Browse through variables, objects, modules and complex
C++ classes, structs or F90 derived type data structures and
multi-dimensional arrays.
� Dereference pointers or find addresses of objects – and use
memory debugging to find out where a pointer was allocat-
ed and how large the allocation is.
� Extensible intelligent display of complex data structures
such as C++ and C++11 STL, Boost and Qt data containers -
including list, set, map, multimap, QString and many more.
� Add support for your own custom data types to show only
the data you care about.
� Set watchpoints to stop when data is read or written by any
process.
Memory DebuggingDDT’s built-in fast memory debugger reveals problems in heap
(allocated) memory usage automatically.
� Instantly stops on exceptions and crashes – identify the
source and data behind any crash with an easy to use com-
prehensive interface to application state.
� Catch out-of-bounds data access as soon as it occurs.
� Consign memory leaks to history: leak reports detail the
source locations that are responsible for the memory
allocation.
� Supports custom allocators – so that allocation locations
are reported at the level that matters to your application.
� Check for dangling pointers.
� Examine a pointer – check if it is valid and and view the
full stack of where it was allocated – and how large its
allocation is.
� C, C++ memory allocated with malloc, new and similar
� Fortran or F90 allocatable memory and pointers.
Powerful navigation, edit, build and version control – in a comprehensive toolsuiteIt’s natural to want to improve the speed of code after removing
the bugs. That’s why Allinea DDT is part of Allinea Forge. DDT is
available separately or in an Allinea Forge license that includes
the leading Linux parallel and threaded code profiler, Allinea
MAP. MAP shares the same user interface and installation as DDT
– which means it is easy to switch to profile a code once debug-
ging is complete.
236
The powerful user interface allows developers to edit, build and
commit whilst debugging – and much more! The development
capabilities in Allinea DDT and Allinea Forge are explored in the
Allinea Forge features page.
Working on remote systems? Connect from your laptop with na-
tive Mac, Windows and Linux clients with the free native remote
client – no more laggy X connection forwarding!
Thinking smarter about solving bugs
We’ve put time into debugging so that you don’t have to. Allinea
DDT integrates the best practice to help you spot the causes of
problems more quickly.
� Catch common errors at source with built-in static analysis
for C, C++ and Fortran – which warns about bugs before
you’ve even hit them.
� Why does this revision of my code behave differently to the
previous one? Version control integration highlights recent
changes – and allows you to see why and where changes
have been introduced.
� A single command in Allinea DDT will automatically log the
values of variables across all processes at each changed
section of code, allowing you to track down exactly how and
why a particular change introduced problems to your code.
To help you manage your progress during debugging our
logbook provides a handy record of the steps you’ve taken and
the things that have been observed.
� Automatic, always-on logging of your debugging activity so
that it’s easy to go back and review the evidence for things
you might have missed at the time.
� Share logbooks with others on your team,
making collaborative debugging a reality.
� Compare two traces side-by-side, to see changes between
systems, versions or process counts jump right out.
Allinea Forge for parallel
Software Development
What’s in Allinea Forge?
237
Scalable printf and powerful command-line modesTracepoints provide scalable “printf” debugging. DDT can insert
tracepoints into your code, and log values scalably each time a
line of code is passed. The results are available interactively – or
in the offline debugging mode.
Offline mode debugs whilst you sleep. Submit an offline debug
run to your batch scheduler from the command line and receive
a text or HTML error and trace report from DDT when the job is
done. Read more about offline debugging in our blog.
Handles and visualizes huge data setsDDT is designed to handle the large data sets of scientific and
technical computing. It can display or filter large arrays – even
when distributed on multiple processes.
� Export to industry standard data formats HDF5 and CSV.
� Browse arrays and gather statistics – over all your process-
es in parallel – with powerful filtering to search petabytes
in parallel.
� Visualize and debug data simultaneously using Allinea DDT’s
connector to the VisIt scientific visualization tool.
Cross-platform for the HPC architectures of today and tomorrowAllinea DDT is cross-platform – supporting all of today’s leading
technical-computing platforms. It’s a CUDA debugger, a parallel
debugger, a multi-threaded debugger, an OpenMP debugger and
a memory debugger all at the same time – and supports mixed
hybrid programming models.
Architectures
Intel Xeon, Intel Xeon Phi , NVIDIA CUDA , IBM BlueGene/Q,
OpenPOWER, ARM 8 (64 bit)
Models
MPI, OpenMP, CUDA , OpenACC, UPC, CoArray Fortran, PGAS
Languages, pthread-based multithreading, SHMEM, OpenSHMEM
Languages
Fortran, C++, C++11, C, PGAS Languages, CoArray Fortran
238
S
N
EW
SESW
NENW
tech BOX
MPI, OpenMP, C, C++ and F90 profiling – Allinea MAPProfiler Features and Benefits
We believe scientists and developers should be set free to spend
their time and energy doing great science and code – not bat-
tling with arcane or unnecessarily complex tools. Allinea MAP
is our way of giving time back to you – it is a profiler that shows
developers exactly where and why code is losing performance.
It provides:
� Effortless code profiling – without needing to change your
code or – on most systems – the way you build it.
� Profiling for applications running on more than one server
and multiple processes – such as on HPC clusters using MPI.
� Clear views of bottlenecks in I/O, in compute, in thread or in
multi-process activity.
� Deep insight into actual processor instruction types that
affect your performance such as vectorization and memory
bandwidth.
� Memory usage over time to discover high-watermarks and
changes in complete memory footprint.
� A powerful, navigable source browser in which you can edit,
build and commit your changes – with outstanding support
for working on remote systems.
More results, faster
Profiling your C++, C, Fortran or F90 code on Linux is as simple
as running “map -profile my_program.exe”. There are no ex-
tra steps and no complicated instrumentation. MPI, OpenMP,
pthreads or unthreaded codes can all be profiled with Allinea
MAP. The graphical results are precise, straightforward to inter-
pret and bottlenecks are shown directly in the source code.
Allinea Forge for parallel
Software Development
What’s in Allinea Forge?
239
Integration and a common interface with the debugger Allinea
DDT in the Allinea Forge tool suite makes moving between tools
a breeze, saving you time and energy throughout each stage of
the development cycle.
Low-overhead profiling:
for production and test workloads at any scale
Existing performance tools can give a powerful view, when you
run them and spend the time analyzing their output, but when
was the last time you ran a profiler on your production code?
We built Allinea MAP with less than 5% runtime overhead – and
the data files created by MAP remain small, for any size and du-
ration of run – so you can run it every day, on every change, giv-
ing you fascinating and powerful insights into the performance
of real codes under real conditions.
Fundamental to MAP is that it shows time alongside your source
code – so that bottlenecks are clearer to see, and the top-down
stack view of time across all processes means its easy to navi-
gate through the code to the parts that matter.
I/O profiling
As systems get larger, more and more codes are being affected
by poor I/O performance. Often, this goes unnoticed or misla-
belled as poor application scaling. Allinea MAP shows you exact-
ly where your file I/O bandwidth is being used, helping to diag-
nose overloaded shared filesystems, poor read/write patterns
and even system misconfiguration issues.
Profiling threads and OpenMP code
Getting performance from multithreaded code can be a chal-
lenge – but Allinea MAP makes it easy to see where thread syn-
chronization is costing cycles and where threads are spending
their time.
With views of CPU core activity, and code profiling by actual per-
core walltime, Allinea MAP is the thread profiler that threaded
code has been waiting for – it’s unique in profiling threads accu-
rately and quickly on real workloads.
MPI Profiling
At the core of Allinea MAP is a scalable architecture that lets it
profile hundreds, or thousands of parallel processes – such as
those in HPC’s main communication library, MPI. It gathers per-
formance data from each process, and merges the information
to present why and where your MPI or multiprocess code is slow.
MAP works with almost every MPI implementation so that your
users have the fastest and most pain free experience of profiling
their MPI code.
Memory Profiling
As your application progresses, Allinea MAP can show you
the real memory usage across all processes in the application
and all compute nodes/servers. The memory usage helps
you identify imbalance, or changes caused by phases in your
application – and MAP shows this alongside your source code.
240
The visible high-water mark of usage helps to track down ap-
plications that rely on 3rd party libraries which temporarily
consume memory and push memory usage over the edge. Ap-
plications that use increasing memory over time – memory
leaks – can then be address with Allinea DDT’s in-built memory
debugging.
Additionally, the time spent in memory accesses is one of the
key metrics profiled so that poor memory access patterns and
cache use are found easily.
Accelerator Metrics
Allinea MAP supports the latest NVIDIA CUDA GPUs and helps
you to profile CUDA GPUs and the CPUs together. Profiling ena-
bles you to see how your CPU waits for GPU completion - and
view CUDA GPU time in global memory access, GPU utilization
and even GPU temperature.
For further information:
� Read more about MAP’s features for profiling codes that use
CUDA.
� Watch a video on Allinea Forge helping developers to use
CUDA and OpenACC effectively at NVIDIA’s GTC conference.
Energy Profiling
Energy consumption and peak power usage is increasingly
important for high-performance applications and their users.
With Allinea MAP’s Energy Pack,developers can optimize for
time and energy.
Allinea Forge for parallel
Software Development
What’s in Allinea Forge?
241
The latest Sandy Bridge and above Intel processors are support-
ed (including Haswell and Broadwell chips) – via their in-built In-
tel RAPL power measurement capability – for CPU power meas-
urement. GPU power measurement is available on any NVIDIA
GPU with power monitoring support. Node-level measurement
is also available for systems supporting the Intel Energy Checker
API or the Cray HSS energy counters (XK6 and XC30 and above).
Compare the performance of different clusters
and architectures
Allinea MAP is cross-platform profiler supporting the major
Linux platforms. It provides its data in an open XML format,
making it ideal for post-processing to characterize and compare
the performance of key codes on different hardware platforms.
Even without access to the original source code, Allinea MAP
tracks and reports on CPU, memory and MPI performance met-
rics over time, giving you everything you need to evaluate and
compare potential new platforms.
Allinea MAP supports all of today’s leading technical-computing
platforms – which means you can be productive on any system.
It’s an MPI profiler, a multi-threaded profiler and an OpenMP
profiler – and supports mixed hybrid programming models.
Architectures
Intel Xeon Phi , NVIDIA CUDA , OpenPOWER, ARM 8 (64 bit),
Intel Xeon
Models
MPI, OpenMP, CUDA , OpenACC, UPC, PGAS Languages,
pthread-based multithreading, SHMEM, OpenSHMEM
Languages
Fortran, C, C++11, C, PGAS Languages
Free up support staff time to solve key challenges
HPC consultants and support staff have a deep understanding
of performance and optimization tools. Yet again and again
they tell us much of their time is spent diagnosing the same ba-
sic mistakes new programmers make over and over.
We designed Allinea MAP so that new developers of MPI, OpenMP
and regular code can see the cause of common performance
problems at once, freeing up experts to dive deeper into complex
and leadership-class optimization problems.
242
ACID
In computer science, ACID (Atomicity, Consistency, Isolation, Dura-
bility) is a set of properties that guarantee that database transac-
tions are processed reliably. In the context of databases, a single
logical operation on the data is called a transaction. For example,
a transfer of funds from one bank account to another, even involv-
ing multiple changes such as debiting one account and crediting
another, is a single transaction.
ACML (“AMD Core Math Library“)
A software development library released by AMD. This library
provides useful mathematical routines optimized for AMD pro-
cessors. Originally developed in 2002 for use in high-performance
computing (HPC) and scientific computing, ACML allows nearly
optimal use of AMD Opteron processors in compute-intensive
applications. ACML consists of the following main components:
� A full implementation of Level 1, 2 and 3 Basic Linear Algebra
Subprograms (→ BLAS), with optimizations for AMD Opteron
processors.
� A full suite of Linear Algebra (→ LAPACK) routines.
� A comprehensive suite of Fast Fourier transform (FFTs) in sin-
gle-, double-, single-complex and double-complex data types.
� Fast scalar, vector, and array math transcendental library
routines.
� Random Number Generators in both single- and double-pre-
cision.
AMD offers pre-compiled binaries for Linux, Solaris, and
Windows available for download. Supported compilers include
gfortran, Intel Fortran Compiler, Microsoft Visual Studio, NAG,
PathScale, PGI compiler, and Sun Studio.
BLAS (“Basic Linear Algebra Subprograms“)
Routines that provide standard building blocks for performing
basic vector and matrix operations. The Level 1 BLAS perform
scalar, vector and vector-vector operations, the Level 2 BLAS
perform matrix-vector operations, and the Level 3 BLAS perform
matrix-matrix operations. Because the BLAS are efficient,
portable, and widely available, they are commonly used in
the development of high quality linear algebra software, e.g.
→ LAPACK. Although a model Fortran implementation of the BLAS
in available from netlib in the BLAS library, it is not expected to
perform as well as a specially tuned implementation on most
high-performance computers – on some machines it may give
much worse performance – but it allows users to run → LAPACK
software on machines that do not offer any other implemen-
tation of the BLAS.
Ceph
In computing, Ceph is an → object storage based free software
storage platform that stores data on a single distributed comput-
er cluster, and provides interfaces for object-, block- and file-level
storage. Ceph aims primarily to be completely distributed without
a single point of failure, scalable to the exabyte level, and freely
available.
Ceph replicates data and makes it fault-tolerant, using commod-
ity hardware and requiring no specific hardware support. As a
result of its design, the system is both self-healing and self-man-
aging, aiming to minimize administration time and other costs.
See the main article “OpenStack for HPC Cloud Environments”
for further details.
Glossary
243
Cg (“C for Graphics”)
A high-level shading language developed by Nvidia in close col-
laboration with Microsoft for programming vertex and pixel
shaders. It is very similar to Microsoft’s → HLSL. Cg is based on
the C programming language and although they share the same
syntax, some features of C were modified and new data types
were added to make Cg more suitable for programming graph-
ics processing units. This language is only suitable for GPU pro-
gramming and is not a general programming language. The Cg
compiler outputs DirectX or OpenGL shader programs.
CISC (“complex instruction-set computer”)
A computer instruction set architecture (ISA) in which each in-
struction can execute several low-level operations, such as
a load from memory, an arithmetic operation, and a memory
store, all in a single instruction. The term was retroactively
coined in contrast to reduced instruction set computer (RISC).
The terms RISC and CISC have become less meaningful with the
continued evolution of both CISC and RISC designs and imple-
mentations, with modern processors also decoding and split-
ting more complex instructions into a series of smaller internal
micro-operations that can thereby be executed in a pipelined
fashion, thus achieving high performance on a much larger sub-
set of instructions.
cluster
Aggregation of several, mostly identical or similar systems to
a group, working in parallel on a problem. Previously known
as Beowulf Clusters, HPC clusters are composed of commodity
hardware, and are scalable in design. The more machines are
added to the cluster, the more performance can in principle be
achieved.
control protocol
Part of the → parallel NFS standard
CUDA driver API
Part of → CUDA
CUDA SDK
Part of → CUDA
CUDA toolkit
Part of → CUDA
CUDA (“Compute Uniform Device Architecture”)
A parallel computing architecture developed by NVIDIA. CUDA
is the computing engine in NVIDIA graphics processing units
or GPUs that is accessible to software developers through in-
dustry standard programming languages. Programmers use
“C for CUDA” (C with NVIDIA extensions), compiled through a
PathScale Open64 C compiler, to code algorithms for execution
on the GPU. CUDA architecture supports a range of computation-
al interfaces including → OpenCL and → DirectCompute. Third
party wrappers are also available for Python, Fortran, Java and
Matlab. CUDA works with all NVIDIA GPUs from the G8X series
onwards, including GeForce, Quadro and the Tesla line. CUDA
provides both a low level API and a higher level API. The initial
CUDA SDK was made public on 15 February 2007, for Microsoft
Windows and Linux. Mac OS X support was later added in ver-
sion 2.0, which supersedes the beta released February 14, 2008.
CUDA is the hardware and software architecture that enables
NVIDIA GPUs to execute programs written with C, C++, Fortran,
→ OpenCL, → DirectCompute, and other languages. A CUDA pro-
gram calls parallel kernels. A kernel executes in parallel across
244
a set of parallel threads. The programmer or compiler organizes
these threads in thread blocks and grids of thread blocks. The
GPU instantiates a kernel program on a grid of parallel thread
blocks. Each thread within a thread block executes an instance
of the kernel, and has a thread ID within its thread block,
program counter, registers, per-thread private memory, inputs,
and output results.
A thread block is a set of concurrently executing threads that
can cooperate among themselves through barrier synchroniza-
tion and shared memory. A thread block has a block ID within
its grid. A grid is an array of thread blocks that execute the same
kernel, read inputs from global memory, write results to global
memory, and synchronize between dependent kernel calls. In
the CUDA parallel programming model, each thread has a per-
thread private memory space used for register spills, function
calls, and C automatic array variables. Each thread block has a
per-block shared memory space used for inter-thread communi-
cation, data sharing, and result sharing in parallel algorithms.
Grids of thread blocks share results in global memory space af-
ter kernel-wide global synchronization.
CUDA’s hierarchy of threads maps to a hierarchy of processors
on the GPU; a GPU executes one or more kernel grids; a stream-
ing multiprocessor (SM) executes one or more thread blocks;
and CUDA cores and other execution units in the SM execute
threads. The SM executes threads in groups of 32 threads called
a warp. While programmers can generally ignore warp ex-
ecution for functional correctness and think of programming
one thread, they can greatly improve performance by having
threads in a warp execute the same code path and access mem-
ory in nearby addresses. See the main article “GPU Computing”
for further details.
DirectCompute
An application programming interface (API) that supports gen-
eral-purpose computing on graphics processing units (GPUs)
on Microsoft Windows Vista or Windows 7. DirectCompute is
part of the Microsoft DirectX collection of APIs and was initially
released with the DirectX 11 API but runs on both DirectX 10
and DirectX 11 GPUs. The DirectCompute architecture shares a
range of computational interfaces with → OpenCL and → CUDA.
Docker
An open-source project that automates the deployment of
applications inside software containers, by providing an
additional layer of abstraction and automation of operating-
Glossary
Thread
Thread Block
Grid 0
Grid 1
per-Thread PrivateLocal Memory
per-BlockShared Memory
per-Application
ContextGlobal
Memory
...
...
245
system-level virtualization on Linux. Docker uses the resource
isolation features of the Linux kernel such as cgroups and ker-
nel namespaces, and a union-capable filesystem such as aufs
and others to allow independent “containers” to run within a
single Linux instance, avoiding the overhead of starting and
maintaining virtual machines.
The Linux kernel’s support for namespaces mostly isolates an
application’s view of the operating environment, including pro-
cess trees, network, user IDs and mounted file systems, while
the kernel’s cgroups provide resource limiting, including the
CPU, memory, block I/O and network. Since version 0.9, Docker
includes the libcontainer library as its own way to directly use
virtualization facilities provided by the Linux kernel, in addi-
tion to using abstracted virtualization interfaces via libvirt, LXC
(Linux Containers) and systemd-nspawn.
As actions are done to a Docker base image, union filesystem
layers are created and documented, such that each layer fully
describes how to recreate an action. This strategy enables
Docker’s lightweight images, as only layer updates need to be
propagated (compared to full VMs, for example). See the main
article “OpenStack for HPC Cloud Environments” for further de-
tails.
ETL (“Extract, Transform, Load”)
A process in database usage and especially in data warehousing
that involves:
� Extracting data from outside sources
� Transforming it to fit operational needs (which can include
quality levels)
� Loading it into the end target (database or data warehouse)
The first part of an ETL process involves extracting the data from
the source systems. In many cases this is the most challenging
aspect of ETL, as extracting data correctly will set the stage for
how subsequent processes will go. Most data warehousing proj-
ects consolidate data from different source systems. Each sepa-
rate system may also use a different data organization/format.
Common data source formats are relational databases and flat
files, but may include non-relational database structures such
as Information Management System (IMS) or other data struc-
tures such as Virtual Storage Access Method (VSAM) or Indexed
Sequential Access Method (ISAM), or even fetching from outside
sources such as through web spidering or screen-scraping. The
streaming of the extracted data source and load on-the-fly to
the destination database is another way of performing ETL
when no intermediate data storage is required. In general, the
goal of the extraction phase is to convert the data into a single
format which is appropriate for transformation processing.
An intrinsic part of the extraction involves the parsing of ex-
tracted data, resulting in a check if the data meets an expected
pattern or structure. If not, the data may be rejected entirely or
in part.
The transform stage applies a series of rules or functions to the
extracted data from the source to derive the data for loading
into the end target. Some data sources will require very little or
even no manipulation of data. In other cases, one or more of the
following transformation types may be required to meet the
business and technical needs of the target database.
The load phase loads the data into the end target, usually the
data warehouse (DW). Depending on the requirements of the
organization, this process varies widely. Some data warehouses
246
may overwrite existing information with cumulative informa-
tion, frequently updating extract data is done on daily, weekly
or monthly basis. Other DW (or even other parts of the same DW)
may add new data in a historicized form, for example, hourly.
To understand this, consider a DW that is required to maintain
sales records of the last year. Then, the DW will overwrite any
data that is older than a year with newer data. However, the
entry of data for any one year window will be made in a histo-
ricized manner. The timing and scope to replace or append are
strategic design choices dependent on the time available and
the business needs. More complex systems can maintain a his-
tory and audit trail of all changes to the data loaded in the DW.
As the load phase interacts with a database, the constraints de-
fined in the database schema – as well as in triggers activated
upon data load – apply (for example, uniqueness, referential in-
tegrity, mandatory fields), which also contribute to the overall
data quality performance of the ETL process.
FFTW (“Fastest Fourier Transform in the West”)
A software library for computing discrete Fourier transforms
(DFTs), developed by Matteo Frigo and Steven G. Johnson at the
Massachusetts Institute of Technology. FFTW is known as the
fastest free software implementation of the Fast Fourier trans-
form (FFT) algorithm (upheld by regular benchmarks). It can
compute transforms of real and complex-valued arrays of arbi-
trary size and dimension in O(n log n) time.
floating point standard (IEEE 754)
The most widely-used standard for floating-point computa-
tion, and is followed by many hardware (CPU and FPU) and soft-
ware implementations. Many computer languages allow or re-
quire that some or all arithmetic be carried out using IEEE 754
formats and operations. The current version is IEEE 754-2008,
which was published in August 2008; the original IEEE 754-1985
was published in 1985. The standard defines arithmetic formats,
interchange formats, rounding algorithms, operations, and ex-
ception handling. The standard also includes extensive recom-
mendations for advanced exception handling, additional opera-
tions (such as trigonometric functions), expression evaluation,
and for achieving reproducible results. The standard defines
single-precision, double-precision, as well as 128-byte quadru-
ple-precision floating point numbers. In the proposed 754r ver-
sion, the standard also defines the 2-byte half-precision number
format.
FraunhoferFS (FhGFS)
A high-performance parallel file system from the Fraunhofer
Competence Center for High Performance Computing. Built on
scalable multithreaded core components with native → Infini-
Band support, file system nodes can serve → InfiniBand and
Ethernet (or any other TCP-enabled network) connections at the
same time and automatically switch to a redundant connection
path in case any of them fails. One of the most fundamental
concepts of FhGFS is the strict avoidance of architectural bottle
necks. Striping file contents across multiple storage servers is
only one part of this concept. Another important aspect is the
distribution of file system metadata (e.g. directory information)
across multiple metadata servers. Large systems and metadata
intensive applications in general can greatly profit from the lat-
ter feature.
FhGFS requires no dedicated file system partition on the servers
– it uses existing partitions, formatted with any of the standard
Linux file systems, e.g. XFS or ext4. For larger networks, it is also
Glossary
247
possible to create several distinct FhGFS file system partitions
with different configurations. FhGFS provides a coherent mode,
in which it is guaranteed that changes to a file or directory by
one client are always immediately visible to other clients.
Global Arrays (GA)
A library developed by scientists at Pacific Northwest National
Laboratory for parallel computing. GA provides a friendly API
for shared-memory programming on distributed-memory com-
puters for multidimensional arrays. The GA library is a predeces-
sor to the GAS (global address space) languages currently being
developed for high-performance computing. The GA toolkit has
additional libraries including a Memory Allocator (MA), Aggre-
gate Remote Memory Copy Interface (ARMCI), and functionality
for out-of-core storage of arrays (ChemIO). Although GA was ini-
tially developed to run with TCGMSG, a message passing library
that came before the → MPI standard (Message Passing Inter-
face), it is now fully compatible with → MPI. GA includes simple
matrix computations (matrix-matrix multiplication, LU solve)
and works with → ScaLAPACK. Sparse matrices are available but
the implementation is not optimal yet. GA was developed by Jar-
ek Nieplocha, Robert Harrison and R. J. Littlefield. The ChemIO li-
brary for out-of-core storage was developed by Jarek Nieplocha,
Robert Harrison and Ian Foster.
The GA library is incorporated into many quantum chemistry
packages, including NWChem, MOLPRO, UTChem, MOLCAS, and
TURBOMOLE. The GA toolkit is free software, licensed under a
self-made license.
Globus Toolkit
An open source toolkit for building computing grids developed
and provided by the Globus Alliance, currently at version 5.
GMP (“GNU Multiple Precision Arithmetic Library”)
A free library for arbitrary-precision arithmetic, operating on
signed integers, rational numbers, and floating point numbers.
There are no practical limits to the precision except the ones
implied by the available memory in the machine GMP runs on
(operand dimension limit is 231 bits on 32-bit machines and 237
bits on 64-bit machines). GMP has a rich set of functions, and
the functions have a regular interface. The basic interface is
for C but wrappers exist for other languages including C++, C#,
OCaml, Perl, PHP, and Python. In the past, the Kaffe Java virtual
machine used GMP to support Java built-in arbitrary precision
arithmetic. This feature has been removed from recent releases,
causing protests from people who claim that they used Kaffe
solely for the speed benefits afforded by GMP. As a result, GMP
support has been added to GNU Classpath. The main target ap-
plications of GMP are cryptography applications and research,
Internet security applications, and computer algebra systems.
GotoBLAS
Kazushige Goto’s implementation of → BLAS.
grid (in CUDA architecture)
Part of the → CUDA programming model
GridFTP
An extension of the standard File Transfer Protocol (FTP) for use
with Grid computing. It is defined as part of the → Globus toolkit,
under the organisation of the Global Grid Forum (specifically,
by the GridFTP working group). The aim of GridFTP is to
provide a more reliable and high performance file transfer for
Grid computing applications. This is necessary because of the
increased demands of transmitting data in Grid computing – it is
248
frequently necessary to transmit very large files, and this needs
to be done fast and reliably. GridFTP is the answer to the prob-
lem of incompatibility between storage and access systems.
Previously, each data provider would make their data available
in their own specific way, providing a library of access func-
tions. This made it difficult to obtain data from multiple sources,
requiring a different access method for each, and thus dividing
the total available data into partitions. GridFTP provides a uni-
form way of accessing the data, encompassing functions from
all the different modes of access, building on and extending the
universally accepted FTP standard. FTP was chosen as a basis
for it because of its widespread use, and because it has a well
defined architecture for extensions to the protocol (which may
be dynamically discovered).
Hadoop
An open-source software framework written in Java for distrib-
uted storage and distributed processing of very large data sets
on computer clusters built from commodity hardware. All the
modules in Hadoop are designed with a fundamental assump-
tion that hardware failures are common and should be auto-
matically handled by the framework. See the main article “Big
Data Analytics with Apache Hadoop” for further details.
Hierarchical Data Format (HDF)
A set of file formats and libraries designed to store and orga-
nize large amounts of numerical data. Originally developed at
the National Center for Supercomputing Applications, it is cur-
rently supported by the non-profit HDF Group, whose mission
is to ensure continued development of HDF5 technologies, and
the continued accessibility of data currently stored in HDF. In
keeping with this goal, the HDF format, libraries and associated
tools are available under a liberal, BSD-like license for general
use. HDF is supported by many commercial and non-commer-
cial software platforms, including Java, MATLAB, IDL, and Py-
thon. The freely available HDF distribution consists of the li-
brary, command-line utilities, test suite source, Java interface,
and the Java-based HDF Viewer (HDFView). There currently exist
two major versions of HDF, HDF4 and HDF5, which differ signifi-
cantly in design and API.
HLSL (“High Level Shader Language“)
The High Level Shader Language or High Level Shading Lan-
guage (HLSL) is a proprietary shading language developed by
Microsoft for use with the Microsoft Direct3D API. It is analo-
gous to the GLSL shading language used with the OpenGL stan-
dard. It is very similar to the NVIDIA Cg shading language, as it
was developed alongside it.
HLSL programs come in three forms, vertex shaders, geometry
shaders, and pixel (or fragment) shaders. A vertex shader is ex-
ecuted for each vertex that is submitted by the application, and
is primarily responsible for transforming the vertex from object
space to view space, generating texture coordinates, and calcu-
lating lighting coefficients such as the vertex’s tangent, binor-
mal and normal vectors. When a group of vertices (normally 3,
to form a triangle) come through the vertex shader, their out-
put position is interpolated to form pixels within its area; this
process is known as rasterisation. Each of these pixels comes
through the pixel shader, whereby the resultant screen colour
is calculated. Optionally, an application using a Direct3D10 in-
terface and Direct3D10 hardware may also specify a geometry
shader. This shader takes as its input the three vertices of a tri-
angle and uses this data to generate (or tessellate) additional
triangles, which are each then sent to the rasterizer.
Glossary
249
InfiniBand
Switched fabric communications link primarily used in HPC and
enterprise data centers. Its features include high throughput,
low latency, quality of service and failover, and it is designed to
be scalable. The InfiniBand architecture specification defines a
connection between processor nodes and high performance
I/O nodes such as storage devices. Like → PCI Express, and many
other modern interconnects, InfiniBand offers point-to-point
bidirectional serial links intended for the connection of pro-
cessors with high-speed peripherals such as disks. On top of
the point to point capabilities, InfiniBand also offers multicast
operations as well. It supports several signalling rates and, as
with PCI Express, links can be bonded together for additional
throughput.
The SDR serial connection’s signalling rate is 2.5 gigabit per sec-
ond (Gbit/s) in each direction per connection. DDR is 5 Gbit/s
and QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125
Gbit/s per lane. For SDR, DDR and QDR, links use 8B/10B encod-
ing – every 10 bits sent carry 8 bits of data – making the use-
ful data transmission rate four-fifths the raw rate. Thus single,
double, and quad data rates carry 2, 4, or 8 Gbit/s useful data,
respectively. For FDR and EDR, links use 64B/66B encoding – ev-
ery 66 bits sent carry 64 bits of data.
Implementers can aggregate links in units of 4 or 12, called 4X or
12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s
of useful data. As of 2009 most systems use a 4X aggregate, im-
plying a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) con-
nection. Larger systems with 12X links are typically used for
cluster and supercomputer interconnects and for inter-switch
connections.
The single data rate switch chips have a latency of 200 nano-
seconds, DDR switch chips have a latency of 140 nanoseconds
and QDR switch chips have a latency of 100 nanoseconds. The
end-to-end latency range ranges from 1.07 microseconds MPI
latency to 1.29 microseconds MPI latency to 2.6 microseconds.
As of 2009 various InfiniBand host channel adapters (HCA) exist
in the market, each with different latency and bandwidth char-
acteristics. InfiniBand also provides RDMA capabilities for low
CPU overhead. The latency for RDMA operations is less than 1
microsecond.
See the main article “InfiniBand” for further description of In-
finiBand features
Intel Integrated Performance Primitives (Intel IPP)
A multi-threaded software library of functions for multimedia
and data processing applications, produced by Intel. The library
supports Intel and compatible processors and is available for
Windows, Linux, and Mac OS X operating systems. It is available
separately or as a part of Intel Parallel Studio. The library takes
advantage of processor features including MMX, SSE, SSE2,
SSE3, SSSE3, SSE4, AES-NI and multicore processors. Intel IPP is
divided into four major processing groups: Signal (with linear
array or vector data), Image (with 2D arrays for typical color
spaces), Matrix (with n x m arrays for matrix operations), and
Cryptography. Half the entry points are of the matrix type, a
third are of the signal type and the remainder are of the im-
age and cryptography types. Intel IPP functions are divided
into 4 data types: Data types include 8u (8-bit unsigned), 8s
(8-bit signed), 16s, 32f (32-bit floating-point), 64f, etc. Typically, an
application developer works with only one dominant data type
for most processing functions, converting between input to
250
processing to output formats at the end points. Version 5.2 was
introduced June 5, 2007, adding code samples for data compres-
sion, new video codec support, support for 64-bit applications
on Mac OS X, support for Windows Vista, and new functions for
ray-tracing and rendering. Version 6.1 was released with the
Intel C++ Compiler on June 28, 2009 and Update 1 for version 6.1
was released on July 28, 2009.
Intel Threading Building Blocks (TBB)
A C++ template library developed by Intel Corporation for writ-
ing software programs that take advantage of multi-core pro-
cessors. The library consists of data structures and algorithms
that allow a programmer to avoid some complications arising
from the use of native threading packages such as POSIX →
threads, Windows → threads, or the portable Boost Threads in
which individual → threads of execution are created, synchro-
nized, and terminated manually. Instead the library abstracts
access to the multiple processors by allowing the operations
to be treated as “tasks”, which are allocated to individual cores
dynamically by the library’s run-time engine, and by automat-
ing efficient use of the CPU cache. A TBB program creates, syn-
chronizes and destroys graphs of dependent tasks according
to algorithms, i.e. high-level parallel programming paradigms
(a.k.a. Algorithmic Skeletons). Tasks are then executed respect-
ing graph dependencies. This approach groups TBB in a family
of solutions for parallel programming aiming to decouple the
programming from the particulars of the underlying machine.
Intel TBB is available commercially as a binary distribution with
support and in open source in both source and binary forms.
Version 4.0 was introduced on September 8, 2011.
iSER (“iSCSI Extensions for RDMA“)
A protocol that maps the iSCSI protocol over a network that
provides RDMA services (like → iWARP or → InfiniBand). This
permits data to be transferred directly into SCSI I/O buffers
without intermediate data copies. The Datamover Architecture
(DA) defines an abstract model in which the movement of data
between iSCSI end nodes is logically separated from the rest of
the iSCSI protocol. iSER is one Datamover protocol. The inter-
face between the iSCSI and a Datamover protocol, iSER in this
case, is called Datamover Interface (DI).
iWARP (“Internet Wide Area RDMA Protocol”)
An Internet Engineering Task Force (IETF) update of the RDMA
Consortium’s → RDMA over TCP standard. This later standard is
zero-copy transmission over legacy TCP. Because a kernel imple-
mentation of the TCP stack is a tremendous bottleneck, a few
vendors now implement TCP in hardware. This additional hard-
ware is known as the TCP offload engine (TOE). TOE itself does
not prevent copying on the receive side, and must be combined
with RDMA hardware for zero-copy results. The main compo-
nent is the Data Direct Protocol (DDP), which permits the ac-
tual zero-copy transmission. The transmission itself is not per-
formed by DDP, but by TCP.
Glossary
SDRSingle Data Rate
DDRDouble Data Rate
QDRQuadruple Data Rate
FDRFourteen Data Rate
EDREnhanced Data Rate
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s
4X 8 Gbit/s 16 Gbit/s 32 Gbit/s 56 Gbit/s 100 Gbit/s
12X 24 Gbit/s 48 Gbit/s 96 Gbit/s 168 Gbit/s 300 Gbit/s
251
JOIN
An SQL join clause combines records from two or more tables
in a relational database. It creates a set that can be saved as a
table or used as it is. A JOIN is a means for combining fields from
two tables (or more) by using values common to each. ANSI-
standard SQL specifies five types of JOIN: INNER, LEFT OUTER,
RIGHT OUTER, FULL OUTER and CROSS. As a special case, a table
(base table, view, or joined table) can JOIN to itself in a self-join.
Kernel (in CUDA architecture)
Part of the → CUDA programming model
LAM/MPI
A high-quality open-source implementation of the → MPI speci-
fication, including all of MPI-1.2 and much of MPI-2. Superseded
by the → OpenMPI implementation-
LAPACK (“linear algebra package”)
Routines for solving systems of simultaneous linear equations,
least-squares solutions of linear systems of equations, eigen-
value problems, and singular value problems. The original goal
of the LAPACK project was to make the widely used EISPACK and
→ LINPACK libraries run efficiently on shared-memory vector
and parallel processors. LAPACK routines are written so that as
much as possible of the computation is performed by calls to
the → BLAS library. While → LINPACK and EISPACK are based on
the vector operation kernels of the Level 1 BLAS, LAPACK was
designed at the outset to exploit the Level 3 BLAS. Highly effi-
cient machine-specific implementations of the BLAS are avail-
able for many modern high-performance computers. The BLAS
enable LAPACK routines to achieve high performance with
portable software.
layout
Part of the → parallel NFS standard. Currently three types of
layout exist: file-based, block/volume-based, and object-based,
the latter making use of → object-based storage devices
LINPACK
A collection of Fortran subroutines that analyze and solve linear
equations and linear least-squares problems. LINPACK was de-
signed for supercomputers in use in the 1970s and early 1980s.
LINPACK has been largely superseded by → LAPACK, which has
been designed to run efficiently on shared-memory, vector su-
percomputers. LINPACK makes use of the → BLAS libraries for
performing basic vector and matrix operations.
The LINPACK benchmarks are a measure of a system‘s float-
ing point computing power and measure how fast a computer
solves a dense N by N system of linear equations Ax=b, which is a
common task in engineering. The solution is obtained by Gauss-
ian elimination with partial pivoting, with 2/3•N³ + 2•N² floating
point operations. The result is reported in millions of floating
point operations per second (MFLOP/s, sometimes simply called
MFLOPS).
LNET
Communication protocol in → Lustre
Logical object volume (LOV)
A logical entity in → Lustre
Lustre
An object-based → parallel file system
252
Management server (MGS)
A functional component in → Lustre
MapReduce
A framework for processing highly distributable problems
across huge datasets using a large number of computers
(nodes), collectively referred to as a cluster (if all nodes use the
same hardware) or a grid (if the nodes use different hardware).
Computational processing can occur on data stored either in a
filesystem (unstructured) or in a database (structured).
“Map” step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes. A
worker node may do this again in turn, leading to a multi-level
tree structure. The worker node processes the smaller problem,
and passes the answer back to its master node.
“Reduce” step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output – the answer to the problem it was originally trying to
solve.
MapReduce allows for distributed processing of the map and
reduction operations. Provided each mapping operation is in-
dependent of the others, all maps can be performed in parallel
– though in practice it is limited by the number of independent
data sources and/or the number of CPUs near each source. Simi-
larly, a set of ‘reducers’ can perform the reduction phase - pro-
vided all outputs of the map operation that share the same key
are presented to the same reducer at the same time. While this
process can often appear inefficient compared to algorithms
that are more sequential, MapReduce can be applied to signifi-
cantly larger datasets than “commodity” servers can handle – a
large server farm can use MapReduce to sort a petabyte of data
in only a few hours. The parallelism also offers some possibility
of recovering from partial failure of servers or storage during
the operation: if one mapper or reducer fails, the work can be
rescheduled – assuming the input data is still available.
metadata server (MDS)
A functional component in → Lustre
metadata target (MDT)
A logical entity in → Lustre
MKL (“Math Kernel Library”)
A library of optimized, math routines for science, engineering,
and financial applications developed by Intel. Core math func-
tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers,
Fast Fourier Transforms, and Vector Math. The library supports
Intel and compatible processors and is available for Windows,
Linux and Mac OS X operating systems.
MPI, MPI-2 (“message-passing interface”)
A language-independent communications protocol used to
program parallel computers. Both point-to-point and collective
communication are supported. MPI remains the dominant mod-
el used in high-performance computing today. There are two
versions of the standard that are currently popular: version 1.2
(shortly called MPI-1), which emphasizes message passing and
has a static runtime environment, and MPI-2.1 (MPI-2), which in-
cludes new features such as parallel I/O, dynamic process man-
agement and remote memory operations. MPI-2 specifies over
500 functions and provides language bindings for ANSI C, ANSI
Glossary
253
Fortran (Fortran90), and ANSI C++. Interoperability of objects de-
fined in MPI was also added to allow for easier mixed-language
message passing programming. A side effect of MPI-2 stan-
dardization (completed in 1996) was clarification of the MPI-1
standard, creating the MPI-1.2 level. MPI-2 is mostly a superset
of MPI-1, although some functions have been deprecated. Thus
MPI-1.2 programs still work under MPI implementations com-
pliant with the MPI-2 standard. The MPI Forum reconvened in
2007, to clarify some MPI-2 issues and explore developments for
a possible MPI-3.
MPICH2
A freely available, portable → MPI 2.0 implementation,
maintained by Argonne National Laboratory
MPP (“massively parallel processing”)
So-called MPP jobs are computer programs with several parts
running on several machines in parallel, often calculating simu-
lation problems. The communication between these parts can
e.g. be realized by the → MPI software interface.
MS-MPI
Microsoft → MPI 2.0 implementation shipped with Microsoft
HPC Pack 2008 SDK, based on and designed for maximum com-
patibility with the → MPICH2 reference implementation.
MVAPICH2
An → MPI 2.0 implementation based on → MPICH2 and devel-
oped by the Department of Computer Science and Engineer-
ing at Ohio State University. It is available under BSD licens-
ing and supports MPI over InfiniBand, 10GigE/iWARP and
RDMAoE.
NetCDF (“Network Common Data Form”)
A set of software libraries and self-describing, machine-inde-
pendent data formats that support the creation, access, and
sharing of array-oriented scientific data. The project homep-
age is hosted by the Unidata program at the University Cor-
poration for Atmospheric Research (UCAR). They are also the
chief source of NetCDF software, standards development,
updates, etc. The format is an open standard. NetCDF Clas-
sic and 64-bit Offset Format are an international standard
of the Open Geospatial Consortium. The project is actively
supported by UCAR. The recently released (2008) version 4.0
greatly enhances the data model by allowing the use of the
→ HDF5 data file format. Version 4.1 (2010) adds support for
C and Fortran client access to specified subsets of remote
data via OPeNDAP. The format was originally based on the
conceptual model of the NASA CDF but has since diverged
and is not compatible with it. It is commonly used in clima-
tology, meteorology and oceanography applications (e.g.,
weather forecasting, climate change) and GIS applications. It
is an input/output format for many GIS applications, and for
general scientific data exchange. The NetCDF C library, and
the libraries based on it (Fortran 77 and Fortran 90, C++, and
all third-party libraries) can, starting with version 4.1.1, read
some data in other data formats. Data in the → HDF5 format
can be read, with some restrictions. Data in the → HDF4 for-
mat can be read by the NetCDF C library if created using the
→ HDF4 Scientific Data (SD) API.
NetworkDirect
A remote direct memory access (RDMA)-based network interface
implemented in Windows Server 2008 and later. NetworkDirect
uses a more direct path from → MPI applications to networking
254
hardware, resulting in very fast and efficient networking. See the
main article “Windows HPC Server 2008 R2” for further details.
NFS (Network File System)
A network file system protocol originally developed by Sun Mi-
crosystems in 1984, allowing a user on a client computer to ac-
cess files over a network in a manner similar to how local stor-
age is accessed. NFS, like many other protocols, builds on the
Open Network Computing Remote Procedure Call (ONC RPC)
system. The Network File System is an open standard defined in
RFCs, allowing anyone to implement the protocol.
Sun used version 1 only for in-house experimental purposes.
When the development team added substantial changes to NFS
version 1 and released it outside of Sun, they decided to release
the new version as V2, so that version interoperation and RPC
version fallback could be tested. Version 2 of the protocol (de-
fined in RFC 1094, March 1989) originally operated entirely over
UDP. Its designers meant to keep the protocol stateless, with
locking (for example) implemented outside of the core protocol
Version 3 (RFC 1813, June 1995) added:
� support for 64-bit file sizes and offsets, to handle files larger
than 2 gigabytes (GB)
� support for asynchronous writes on the server, to improve
write performance
� additional file attributes in many replies, to avoid the need
to re-fetch them
� a READDIRPLUS operation, to get file handles and attributes
along with file names when scanning a directory
� assorted other improvements
At the time of introduction of Version 3, vendor support for TCP
as a transport-layer protocol began increasing. While several
Glossary
vendors had already added support for NFS Version 2 with TCP
as a transport, Sun Microsystems added support for TCP as a
transport for NFS at the same time it added support for Version 3.
Using TCP as a transport made using NFS over a WAN more fea-
sible.
Version 4 (RFC 3010, December 2000; revised in RFC 3530, April
2003), influenced by AFS and CIFS, includes performance im-
provements, mandates strong security, and introduces a state-
ful protocol. Version 4 became the first version developed with
the Internet Engineering Task Force (IETF) after Sun Microsys-
tems handed over the development of the NFS protocols.
NFS version 4 minor version 1 (NFSv 4.1) has been approved by
the IESG and received an RFC number since Jan 2010. The NFSv
4.1 specification aims: to provide protocol support to take ad-
vantage of clustered server deployments including the ability
to provide scalable parallel access to files distributed among
multiple servers. NFSv 4.1 adds the parallel NFS (pNFS) capabil-
ity, which enables data access parallelism. The NFSv 4.1 protocol
defines a method of separating the filesystem meta-data from
the location of the file data; it goes beyond the simple name/
data separation by striping the data amongst a set of data serv-
ers. This is different from the traditional NFS server which holds
the names of files and their data under the single umbrella of
the server.
In addition to pNFS, NFSv 4.1 provides sessions, directory del-
egation and notifications, multi-server namespace, access con-
trol lists (ACL/SACL/DACL), retention attributions, and SECIN-
FO_NO_NAME. See the main article “Parallel Filesystems” for
further details.
255
Current work is being done in preparing a draft for a future ver-
sion 4.2 of the NFS standard, including so-called federated file-
systems, which constitute the NFS counterpart of Microsoft’s
distributed filesystem (DFS).
NoSQL
A NoSQL (originally referring to “non SQL” or “non relational” ) da-
tabase provides a mechanism for storage and retrieval of data
which is modeled in means other than the tabular relations used
in relational databases. Such databases have existed since the
late 1960s, but did not obtain the “NoSQL” moniker until a surge
of popularity in the early twenty-first century.
Motivations for this approach include: simplicity of design, sim-
pler “horizontal” scaling to clusters of machines (which is a prob-
lem for relational databases), and finer control over availability.
The data structures used by NoSQL databases (e.g. key-value,
wide column, graph, or document) are different from those used
by default in relational databases, making some operations fast-
er in NoSQL. The particular suitability of a given NoSQL database
depends on the problem it must solve. Sometimes the data struc-
tures used by NoSQL databases are also viewed as “more flex-
ible” than relational database tables.
NoSQL databases are increasingly used in big data and real-time
web applications. NoSQL systems are also sometimes called “Not
only SQL” to emphasize that they may support SQL-like query lan-
guages.
Many NoSQL stores compromise consistency (in the sense of the
CAP theorem) in favor of availability, partition tolerance, and
speed. Barriers to the greater adoption of NoSQL stores include
the use of low-level query languages (instead of SQL, for instance
the lack of ability to perform ad-hoc → JOINs across tables), lack
of standardized interfaces, and huge previous investments in ex-
isting relational databases.
Instead, most NoSQL databases offer a concept of “eventual
consistency” in which database changes are propagated to all
nodes “eventually” (typically within milliseconds) so queries for
data might not return updated data immediately or might result
in reading data that is not accurate, a problem known as stale
reads. Additionally, some NoSQL systems may exhibit lost writes
and other forms of data loss. Fortunately, some NoSQL systems
provide concepts such as write-ahead logging to avoid data loss.
For distributed transaction processing across multiple databas-
es, data consistency is an even bigger challenge that is difficult
for both NoSQL and relational databases. Even current relational
databases do not allow referential integrity constraints to span
databases. There are few systems that maintain both → ACID
transactions and X/Open XA standards for distributed transac-
tion processing.
NUMA (“non-uniform memory access”)
A computer memory design used in multiprocessors, where the
memory access time depends on the memory location relative
to a processor. Under NUMA, a processor can access its own local
memory faster than non-local memory, that is, memory local to
another processor or memory shared between processors.
object storage server (OSS)
A functional component in → Lustre
256
object storage target (OST)
A logical entity in → Lustre
object-based storage device (OSD)
An intelligent evolution of disk drives that can store and serve
objects rather than simply place data on tracks and sectors.
This task is accomplished by moving low-level storage func-
tions into the storage device and accessing the device through
an object interface. Unlike a traditional block-oriented device
providing access to data organized as an array of unrelated
blocks, an object store allows access to data by means of stor-
age objects. A storage object is a virtual entity that groups data
together that has been determined by the user to be logically
related. Space for a storage object is allocated internally by the
OSD itself instead of by a host-based file system. OSDs man-
age all necessary low-level storage, space management, and
security functions. Because there is no host-based metadata
for an object (such as inode information), the only way for an
application to retrieve an object is by using its object identifier
(OID). The SCSI interface was modified and extended by the OSD
Technical Work Group of the Storage Networking Industry Asso-
ciation (SNIA) with varied industry and academic contributors,
resulting in a draft standard to T10 in 2004. This standard was
ratified in September 2004 and became the ANSI T10 SCSI OSD
V1 command set, released as INCITS 400-2004. The SNIA group
continues to work on further extensions to the interface, such
as the ANSI T10 SCSI OSD V2 command set.
OLAP cube (“Online Analytical Processing”)
A set of data, organized in a way that facilitates non-predeter-
mined queries for aggregated information, or in other words,
Glossary
online analytical processing. OLAP is one of the computer-
based techniques for analyzing business data that are collec-
tively called business intelligence. OLAP cubes can be thought
of as extensions to the two-dimensional array of a spreadsheet.
For example a company might wish to analyze some financial
data by product, by time-period, by city, by type of revenue and
cost, and by comparing actual data with a budget. These addi-
tional methods of analyzing the data are known as dimensions.
Because there can be more than three dimensions in an OLAP
system the term hypercube is sometimes used.
OpenCL (“Open Computing Language”)
A framework for writing programs that execute across hetero-
geneous platforms consisting of CPUs, GPUs, and other proces-
sors. OpenCL includes a language (based on C99) for writing ker-
nels (functions that execute on OpenCL devices), plus APIs that
are used to define and then control the platforms. OpenCL pro-
vides parallel computing using task-based and data-based par-
allelism. OpenCL is analogous to the open industry standards
OpenGL and OpenAL, for 3D graphics and computer audio,
respectively. Originally developed by Apple Inc., which holds
trademark rights, OpenCL is now managed by the non-profit
technology consortium Khronos Group.
OpenMP (“Open Multi-Processing”)
An application programming interface (API) that supports
multi-platform shared memory multiprocessing programming
in C, C++ and Fortran on many architectures, including Unix and
Microsoft Windows platforms. It consists of a set of compiler di-
rectives, library routines, and environment variables that influ-
ence run-time behavior.
257
Jointly defined by a group of major computer hardware and
software vendors, OpenMP is a portable, scalable model that
gives programmers a simple and flexible interface for develop-
ing parallel applications for platforms ranging from the desk-
top to the supercomputer.
An application built with the hybrid model of parallel program-
ming can run on a computer cluster using both OpenMP and
Message Passing Interface (MPI), or more transparently through
the use of OpenMP extensions for non-shared memory systems.
OpenMPI
An open source → MPI-2 implementation that is developed and
maintained by a consortium of academic, research, and indus-
try partners.
OpenStack
A free and open-source software platform for cloud computing,
mostly deployed as an infrastructure-as-a-service (IaaS). The
software platform consists of interrelated components that
control hardware pools of processing, storage, and network-
ing resources throughout a data center. Users either manage it
through a web-based dashboard, through command-line tools,
or through a RESTful API. OpenStack.org released it under the
terms of the Apache License.
OpenStack began in 2010 as a joint project of Rackspace Host-
ing and NASA. As of 2016 it is managed by the OpenStack Foun-
dation, a non-profit corporate entity established in September
2012 to promote OpenStack software and its community.
The OpenStack community collaborates around a six-month,
time-based release cycle with frequent development mile-
stones. During the planning phase of each release, the commu-
nity gathers for an OpenStack Design Summit to facilitate de-
veloper working sessions and to assemble plans. See the main
article “OpenStack for HPC Cloud Environments” for further
details.
parallel NFS (pNFS)
A → parallel file system standard, optional part of the current →
NFS standard 4.1. See the main article “Parallel Filesystems” for
further details.
PCI Express (PCIe)
A computer expansion card standard designed to replace the
older PCI, PCI-X, and AGP standards. Introduced by Intel in 2004,
PCIe (or PCI-E, as it is commonly called) is the latest standard
for expansion cards that is available on mainstream comput-
ers. PCIe, unlike previous PC expansion standards, is structured
around point-to-point serial links, a pair of which (one in each
direction) make up lanes; rather than a shared parallel bus.
PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0
x1 256 MB/s 512 MB/s 1 GB/s 2 GB/s
x2 512 MB/s 1 GB/s 2 GB/s 4 GB/s
x4 1 GB/s 2 GB/s 4 GB/s 8 GB/s
x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s
x16 4 GB/s 8 GB/s 16 GB/s 32 GB/s
x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s
258
Glossary
These lanes are routed by a hub on the main-board acting as
a crossbar switch. This dynamic point-to-point behavior allows
more than one pair of devices to communicate with each other
at the same time. In contrast, older PC interfaces had all devices
permanently wired to the same bus; therefore, only one device
could send information at a time. This format also allows “chan-
nel grouping”, where multiple lanes are bonded to a single de-
vice pair in order to provide higher bandwidth. The number of
lanes is “negotiated” during power-up or explicitly during op-
eration. By making the lane count flexible a single standard can
provide for the needs of high-bandwidth cards (e.g. graphics
cards, 10 Gigabit Ethernet cards and multiport Gigabit Ethernet
cards) while also being economical for less demanding cards.
Unlike preceding PC expansion interface standards, PCIe is a
network of point-to-point connections. This removes the need
for “arbitrating” the bus or waiting for the bus to be free and
allows for full duplex communications. This means that while
standard PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the
same data transfer rate, PCIe x4 will give better performance if
multiple device pairs are communicating simultaneously or if
communication within a single device pair is bidirectional.
Specifications of the format are maintained and developed by a
group of more than 900 industry-leading companies called the
PCI-SIG (PCI Special Interest Group). In PCIe 1.x, each lane carries
approximately 250 MB/s. PCIe 2.0, released in late 2007, adds a
Gen2-signalling mode, doubling the rate to about 500 MB/s. On
November 18, 2010, the PCI Special Interest Group officially pub-
lishes the finalized PCI Express 3.0 specification to its members
to build devices based on this new version of PCI Express, which
allows for a Gen3-signalling mode at 1 GB/s.
On November 29, 2011, PCI-SIG has annonced to proceed to
PCI Express 4.0 featuring 16 GT/s, still on copper technology.
Additionally, active and idle power optimizations are to be in-
vestigated. Final specifications are expected to be released in
2014/15.
PETSc (“Portable, Extensible Toolkit for Scientific Computation”)
A suite of data structures and routines for the scalable (parallel)
solution of scientific applications modeled by partial differential
equations. It employs the → Message Passing Interface (MPI) stan-
dard for all message-passing communication. The current ver-
sion of PETSc is 3.2; released September 8, 2011. PETSc is intended
for use in large-scale application projects, many ongoing compu-
tational science projects are built around the PETSc libraries.
Its careful design allows advanced users to have detailed control
over the solution process. PETSc includes a large suite of parallel
linear and nonlinear equation solvers that are easily used in ap-
plication codes written in C, C++, Fortran and now Python. PETSc
provides many of the mechanisms needed within parallel appli-
cation code, such as simple parallel matrix and vector assembly
routines that allow the overlap of communication and computa-
tion. In addition, PETSc includes support for parallel distributed
arrays useful for finite difference methods.
process
→ thread
PTX (“parallel thread execution”)
Parallel Thread Execution (PTX) is a pseudo-assembly language
used in NVIDIA’s CUDA programming environment. The ‘nvcc’
compiler translates code written in CUDA, a C-like language, into
259
PTX, and the graphics driver contains a compiler which translates
the PTX into something which can be run on the processing cores.
RDMA (“remote direct memory access”)
Allows data to move directly from the memory of one computer
into that of another without involving either one‘s operating
system. This permits high-throughput, low-latency networking,
which is especially useful in massively parallel computer clus-
ters. RDMA relies on a special philosophy in using DMA. RDMA
supports zero-copy networking by enabling the network adapt-
er to transfer data directly to or from application memory, elimi-
nating the need to copy data between application memory and
the data buffers in the operating system. Such transfers require
no work to be done by CPUs, caches, or context switches, and
transfers continue in parallel with other system operations.
When an application performs an RDMA Read or Write request,
the application data is delivered directly to the network, reduc-
ing latency and enabling fast message transfer. Common RDMA
implementations include → InfiniBand, → iSER, and → iWARP.
RISC (“reduced instruction-set computer”)
A CPU design strategy emphasizing the insight that simplified
instructions that “do less“ may still provide for higher perfor-
mance if this simplicity can be utilized to make instructions ex-
ecute very quickly → CISC.
ScaLAPACK (“scalable LAPACK”)
Library including a subset of → LAPACK routines redesigned
for distributed memory MIMD (multiple instruction, multiple
data) parallel computers. It is currently written in a Single-
Program-Multiple-Data style using explicit message passing
for interprocessor communication. ScaLAPACK is designed for
heterogeneous computing and is portable on any computer
that supports → MPI. The fundamental building blocks of the
ScaLAPACK library are distributed memory versions (PBLAS) of
the Level 1, 2 and 3 → BLAS, and a set of Basic Linear Algebra
Communication Subprograms (BLACS) for communication tasks
that arise frequently in parallel linear algebra computations. In
the ScaLAPACK routines, all interprocessor communication oc-
curs within the PBLAS and the BLACS. One of the design goals of
ScaLAPACK was to have the ScaLAPACK routines resemble their
→ LAPACK equivalents as much as possible.
service-oriented architecture (SOA)
An approach to building distributed, loosely coupled applica-
tions in which functions are separated into distinct services
that can be distributed over a network, combined, and reused.
See the main article “Windows HPC Server 2008 R2” for further
details.
single precision/double precision
→ floating point standard
SMP (“shared memory processing”)
So-called SMP jobs are computer programs with several
parts running on the same system and accessing a shared
memory region. A usual implementation of SMP jobs is
→ multi-threaded programs. The communication between the
single threads can e.g. be realized by the → OpenMP software
interface standard, but also in a non-standard way by means of
native UNIX interprocess communication mechanisms.
260
Glossary
SMP (“symmetric multiprocessing”)
A multiprocessor or multicore computer architecture where two
or more identical processors or cores can connect to a single
shared main memory in a completely symmetric way, i.e. each
part of the main memory has the same distance to each of the
cores. Opposite: → NUMA
storage access protocol
Part of the → parallel NFS standard
STREAM
A simple synthetic benchmark program that measures sustain-
able memory bandwidth (in MB/s) and the corresponding com-
putation rate for simple vector kernels.
streaming multiprocessor (SM)
Hardware component within the → Tesla GPU series
subnet manager
Application responsible for configuring the local → InfiniBand
subnet and ensuring its continued operation.
superscalar processors
A superscalar CPU architecture implements a form of parallel-
ism called instruction-level parallelism within a single proces-
sor. It thereby allows faster CPU throughput than would other-
wise be possible at the same clock rate. A superscalar processor
executes more than one instruction during a clock cycle by si-
multaneously dispatching multiple instructions to redundant
functional units on the processor. Each functional unit is not a
separate CPU core but an execution resource within a single CPU
such as an arithmetic logic unit, a bit shifter, or a multiplier.
Tesla
NVIDIA‘s third brand of GPUs, based on high-end GPUs from the
G80 and on. Tesla is NVIDIA‘s first dedicated General Purpose
GPU. Because of the very high computational power (measured
in floating point operations per second or FLOPS) compared to
recent microprocessors, the Tesla products are intended for the
HPC market. The primary function of Tesla products are to aid
in simulations, large scale calculations (especially floating-point
calculations), and image generation for professional and scien-
tific fields, with the use of → CUDA. See the main article “NVIDIA
GPU Computing” for further details.
thread
A thread of execution is a fork of a computer program into two
or more concurrently running tasks. The implementation of
threads and processes differs from one operating system to an-
other, but in most cases, a thread is contained inside a process.
On a single processor, multithreading generally occurs by multi-
tasking: the processor switches between different threads. On
a multiprocessor or multi-core system, the threads or tasks will
generally run at the same time, with each processor or core run-
ning a particular thread or task. Threads are distinguished from
processes in that processes are typically independent, while
threads exist as subsets of a process. Whereas processes have
separate address spaces, threads share their address space,
which makes inter-thread communication much easier than
classical inter-process communication (IPC).
261
thread (in CUDA architecture)
Part of the → CUDA programming model
thread block (in CUDA architecture)
Part of the → CUDA programming model
thread processor array (TPA)
Hardware component within the → Tesla GPU series
10 Gigabit Ethernet
The fastest of the Ethernet standards, first published in 2002
as IEEE Std 802.3ae-2002. It defines a version of Ethernet with a
nominal data rate of 10 Gbit/s, ten times as fast as Gigabit Eth-
ernet. Over the years several 802.3 standards relating to 10GbE
have been published, which later were consolidated into the
IEEE 802.3-2005 standard. IEEE 802.3-2005 and the other amend-
ments have been consolidated into IEEE Std 802.3-2008. 10
Gigabit Ethernet supports only full duplex links which can be
connected by switches. Half Duplex operation and CSMA/CD
(carrier sense multiple access with collision detect) are not sup-
ported in 10GbE. The 10 Gigabit Ethernet standard encompasses
a number of different physical layer (PHY) standards. As of 2008
10 Gigabit Ethernet is still an emerging technology with only 1
million ports shipped in 2007, and it remains to be seen which of
the PHYs will gain widespread commercial acceptance.
warp (in CUDA architecture)
Part of the → CUDA programming modelOLAP cube
Notes
262
263
Notes
264
265
Notes
Always keep in touch with the latest newsVisit us on the Web!
Here you will find comprehensive information about HPC, IT solutions for the datacenter, services and
high-performance, efficient IT systems.
Subscribe to our technology journals, E-News or the transtec newsletter and always stay up to date.
www.transtec.de/en/hpc
268
transtec Germany
Tel +49 (0) 7121/2678 - 400
www.transtec.de
transtec Switzerland
Tel +41 (0) 44 / 818 47 00
www.transtec.ch
transtec United Kingdom
Tel +44 (0) 1295 / 756 500
www.transtec.co.uk
ttec Netherlands
Tel +31 (0) 24 34 34 210
www.ttec.nl
transtec France
Tel +33 (0) 3.88.55.16.00
www.transtec.fr
Texts and concept:
Layout and design:
Dr. Oliver Tennert | [email protected]
Silke Lerche | [email protected]
Fanny Schwarz | [email protected]
Sebastian Friedmann | [email protected]
© transtec AG, March 2016
The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission.
No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners.