Optimizing Standard Cell Library Characterization with ... · with Cadence Virtuoso Liberate and...

Technical Report

Optimizing Standard Cell Library Characterization with Cadence Virtuoso Liberate and NetApp Clustered Data ONTAP 8.2 Bikash Roy Choudhury, NetApp

Harsh Vardhan, Cadence Design Systems

Rajnish Gupta, Cadence Design Systems

February 2014 | TR-4270

Abstract

Cell library characterization is an integral part of the chip design and manufacturing

process. Standard cell libraries are developed and distributed either by a library

vendor who uses the process and device characteristics and electrical information

provided by the foundries, or by the foundries themselves. Large fabless

semiconductor companies may choose to design and characterize their own libraries

for business and competitive reasons as well. The characterization step provides

timing, power, and performance behavior across a wide range of voltage, temperature,

and process variations that make up the electrical behavior and specifications of the

device that is being designed.

A foundry manufactures only the physical system on a chip and does not get involved

in any end-use application design unless it is creating a functional block that is

intended for reuse and licensing by its customers who will use them as a preverified

subcircuit. During the design process, the layout is checked with the schematics of the

chip, which must be modeled according to the standard requirements of all the

common foundries.

The Cadence® Virtuoso® Liberate™ tool is used to characterize and I/O model the

standard cell libraries during the design phase. A lot of distributed computing power is

required to simulate and process the input data into a single database file, from a

shared storage infrastructure, which is used later in the place-and-route phase.

The NetApp® scale-out architecture with the NetApp clustered Data ONTAP® 8.2

operating system provides all the storage efficiencies for the Virtuoso Liberate files in

the characterization process. It also improves job completion times by up to 15% with

NFSv4.1/pNFS and with adequate storage optimization and best practices. This report

highlights some performance tuning on the NetApp storage and Linux® clients in the

compute farm to improve the efficiency of the application license costs and time to

market. These optimizations do not change the cell characterization workflow, and

they have very little impact on the existing infrastructure.

2 Optimizing Standard Cell Library Characterization with Cadence Virtuoso Liberate and NetApp Clustered Data ONTAP 8.2

TABLE OF CONTENTS

1 Introduction ............................................................................................................................................ 4

2 Target Audience and Objectives .......................................................................................................... 5

3 Virtuoso Liberate Cell Library Characterization ................................................................................. 5

3.1 Virtuoso Liberate Cell Library Characterization Workflow—Arc Flow ............................................................. 5

3.2 Virtuoso Liberate Tool in a Clustered Data ONTAP 8.2 Environment ............................................................ 6

4 Clustered Data ONTAP 8.2 for Cell Library Characterization Workloads ........................................ 6

4.1 Performance ................................................................................................................................................... 6

4.2 High Availability and Reliability ....................................................................................................................... 7

4.3 Capacity ......................................................................................................................................................... 8

4.4 Storage Efficiency .......................................................................................................................................... 8

4.5 Agile Infrastructure ......................................................................................................................................... 8

4.6 Data Protection ............................................................................................................................................... 9

4.7 Manageability ................................................................................................................................................. 9

4.8 Cost ................................................................................................................................................................ 9

5 Cadence Virtuoso Liberate Validation with NetApp Clustered Data ONTAP 8.2 .......................... 10

5.1 Performance Validation Objectives .............................................................................................................. 10

5.2 Virtuoso Liberate Test Lab Details ............................................................................................................... 10

5.3 Test Plan ...................................................................................................................................................... 10

5.4 Virtuoso Liberate Lab Test Results .............................................................................................................. 11

5.5 Virtuoso Liberate Performance Test Observations ....................................................................................... 11

6 Best Practices for Virtuoso Liberate Tool with Clustered Data ONTAP 8.2 .................................. 12

6.1 Storage Cluster Node Architecture ............................................................................................................... 12

6.2 Storage Cluster Node Sizing and Optimization ............................................................................................ 13

6.3 File-System Optimization.............................................................................................................................. 14

6.4 Storage Network Optimization ...................................................................................................................... 15

6.5 Flash Cache Optimization ............................................................................................................................ 16

6.6 Network File System (NFSv3) Optimization ................................................................................................. 17

6.7 Parallel Network File System (pNFS) ........................................................................................................... 18

7 Other Features in Clustered Data ONTAP 8.2 for Virtuoso Liberate Workloads .......................... 24

7.1 SnapVault ..................................................................................................................................................... 24

7.2 Storage QoS ................................................................................................................................................. 25

7.3 Nondisruptive Operation (NDO) ................................................................................................................... 27

8 Compute Farm Optimization .............................................................................................................. 27

8.1 RHEL 6.5 Clients in the Compute Farm ....................................................................................................... 28

8.2 Best Practices for Compute Nodes .............................................................................................................. 28


9 Summary .............................................................................................................................................. 30

10 Conclusion ........................................................................................................................................... 31

LIST OF FIGURES

Figure 1) Cell library characterization overview. ............................................................................................................ 5

Figure 2) Cell library characterization workflow. ............................................................................................................ 5

Figure 3) Cell library characterization with clustered Data ONTAP in a data center. ..................................................... 6

Figure 4) Workload balancing for cell library characterization. ....................................................................................... 7

Figure 5) Virtuoso Liberate performance test results. .................................................................................................. 11

Figure 6) Clustered Data ONTAP logical stack layout. ................................................................................................ 13

Figure 7) pNFS implementation. .................................................................................................................................. 19


1 Introduction

Standard cells consist of preconfigured and laid-out functional block elements that provide a particular

operation (such as the output of a two-input AND function being true only if both of the two inputs it

contains are true). These cells can be simple or complex digital gates, flip-flops, and I/O cells. They are

named standard cells because they typically must conform to a standard height (although their width can

vary) to simplify the task of placing them on the chip, connecting them automatically by the place and

route software, and to provide the most efficient packing density. These cells are designed for a specific

function and usually come in libraries. Each of these libraries consists of a few hundred to a few thousand

different cells.

Standard cell libraries are required for the following reasons:

Design complexity keeps increasing because of area and yield optimization. Standardizing cell libraries by foundries reduces manufacturing costs and enhances yields.

Complete custom chip design is no longer possible. Standard cell libraries expedite the chip design process through logical cell layout models.

Standardization allows efficient automated software routing and connectivity among these cells for maximum packing density and improved performance. The optimized wire lengths prevent performance slowdown caused by poor connections or routing.

The characterization step provides timing, power, and performance behavior across a wide range of

voltage, temperature, and process variations that make up the electrical behavior and specifications of

the device being designed. Standard cell library characterization is an important component in the chip

design cycle for the following reasons:

Functionality behavior; electrical characteristics extraction; and simulation of gates, flip-flops, and so on, in a chip must be fast and simple.

The variability of performance across a wide range of processes, voltage, and temperatures must be analyzed and accounted for to guarantee performance specifications.

Accurate timing and power analysis is essential to guarantee device behavior under different operating conditions.

The Cadence Virtuoso Liberate tool is designed to quickly generate the timing, noise, and power profile of

the individual gates, which allows analysis of timing and power behavior at the chip level. This requires a

lot of input files to simulate the library layout models and generates a lot of output files, depending on the

number of cell libraries characterized for various functions. The Virtuoso Liberate tool uses a network of

compute nodes to access files of various sizes from a shared storage infrastructure such as NetApp over

a file-based protocol such as Network File System (NFS).

The NetApp clustered file system in Data ONTAP 8.2 offers scale-up and scale-out storage architectures

that provide storage for the large and complex cell library characterizations that are required by chip

designers. It addresses the growing storage needs of these customers while efficiently handling the

different workloads that are generated during the entire cell library characterization workflow. NetApp

clustered Data ONTAP 8.2 provides the following key drivers to shorten the chip design process with a

faster time to market and improved return on investment (ROI):

Performance

High availability and reliability

Capacity

Storage efficiency

Agile infrastructure

Data protection

Manageability

Low cost


2 Target Audience and Objectives

Virtuoso Liberate tool is one of the most popular cell characterization tools used by chip manufacturers.

This technical paper is intended for cell library engineers, storage administrators, and architects. This

paper provides the following information:

Best practices and sizing required with clustered Data ONTAP 8.2 to support the performance, capacity, availability, and manageability requirements for Virtuoso Liberate workloads

How to use NetApp’s scale-out clustered file system solution for the Virtuoso Liberate application and for validating the performance improvements during the cell library characterization process

3 Virtuoso Liberate Cell Library Characterization

Virtuoso Liberate cell library characterization tools are critical in the chip design process because they

provide library views for signal integrity, timing, and power analysis for cell layout on a chip. During this

characterization process, Simulation Program with Integrated Circuit Emphasis (SPICE) netlists that

contain connectivity information about a cell are provided as input to each of the cells in the library, and

the output load is validated and analyzed for signal integrity (noise), timing, and power. Figure 1 provides

an overview of the input and output parameters during a cell characterization process.

Figure 1) Cell library characterization overview.

3.1 Virtuoso Liberate Cell Library Characterization Workflow—Arc Flow

An arc is a path from the input to the output pin of a cell. Many cell libraries are modeled with noise,

timing, and power arcs to test specific functions in a chip design. The Virtuoso Liberate tool creates,

characterizes, and validates the noise, timing, and leakage arcs in the cell libraries. This tool works

primarily in three different phases to achieve this task. Figure 2 illustrates the various phases of the cell

characterization processes.

Figure 2) Cell library characterization workflow.

Pre-analysis phase: The master node first starts to read all the SPICE netlists and other sources of input. It then spawns the slave nodes in the compute farm and starts to submit the jobs. The slaves read the data from the volume located in the NetApp storage. During this phase, additional data structures are generated, and the results are reported to the master node.

Analysis phase: In this phase, the slave nodes start to read all the .spi, .sp, and .eldo input files

from the NetApp storage. Every slave node in the compute farm performs circuit simulation on the input files. This simulation is done locally on each of the compute nodes. During the circuit simulation, a lot of temporary files and a few persistent files are generated. All these files are written back onto the NetApp storage.


Assembly phase: In the final phase, all the data created for a cell by the slaves is read from the

NetApp storage by one of the slave nodes. A single new database file (.ldb) in the Liberty format is

created and written to the NetApp storage. Finally, all the other files on the NetApp storage are

deleted. The .ldb file is used for timing analysis throughout the phases of the chip design.

3.2 Virtuoso Liberate Tool in a Clustered Data ONTAP 8.2 Environment

It is not uncommon for tools to run on a high amount of compute nodes—in the magnitude of thousands

of cores—with job schedulers such as Load Sharing Facility (LSF) or Sun Grid Engine (SGE) in cell

library characterization environments. Apart from all the optimization that happens at the Virtuoso

Liberate application layer, it is imperative to also optimize and tune the compute nodes, the network, and

the storage layers to complement the faster job completion times. Depending on the size and type of

standard and custom cells, a large number of files is generated that needs storage along with efficiency

and performance. Figure 3 provides a typical storage integration with the different cell library

characterization workflows in a data center.

Figure 3) Cell library characterization with clustered Data ONTAP in a data center.

4 Clustered Data ONTAP 8.2 for Cell Library Characterization

Workloads

NetApp clustered Data ONTAP provides advanced technologies for software-defined storage that

abstracts data from the underlying hardware by virtualizing the storage infrastructure with storage virtual

machines (SVMs) to allow an efficient, scalable, nondisruptive environment. Some of these virtualization

capabilities may be similar to past NetApp vFiler® unit functionality. Others go beyond anything else

available today. Clustered Data ONTAP is built on the same trusted hardware that NetApp has been

selling for years. We bring together the different hardware platforms, connect them, and give them the

intelligence to communicate with each other in a clustered environment. The following sections detail the

key benefits that clustered Data ONTAP provides for Virtuoso Liberate workloads.

4.1 Performance

Cell or circuit design environments mostly use NFS to mount volumes from storage on compute nodes.

With NFS, scaling the number of nodes in the compute farm is very easy. With clustered Data ONTAP,

however, the storage can also scale seamlessly to provide the enhanced I/O operations per second

(IOPS), bandwidth, performance, and efficiency that are required by different chip design tools.


The following are strong requirements in chip design production scenarios to provide top-notch

performance:

Larger memory footprint

Greater number of cores for concurrent processing

Higher capacity limits

Users may require 1,000,000 IOPS from multiple volumes on the storage for different standard and

custom cell library characterization projects. In Data ONTAP operating in 7-Mode, this IOPS requirement

is constrained, with symlinks and volumes limited to a single controller or a high-availability (HA) pair.

However, with clustered Data ONTAP, the symlinks are replaced by cluster namespace junctions that can

have all the volumes that are part of a single project spread out on the different nodes in the cluster.

Every node in the cluster contributes to the IOPS requirement for that project.

Figure 4) Workload balancing for cell library characterization.

Figure 4 illustrates how the IOPS requirement is spread across different controllers. “Proj11” has six

volumes that are spread out on four FAS nodes in a production cluster. Each node is capable of doing

more than 250,000 IOPS from the cache. The 1,000,000-IOPS requirement for Proj11 can be achieved

by spreading the flexible volumes in the cluster namespace within an SVM. These volumes can grow and

shrink in size, and also can be moved seamlessly, without disrupting the application, to any cluster node

that is capable of providing the desired performance.

The PCI Express (PCIe)-based NetApp Flash Cache™ intelligent caching in each controller in the cluster

setup continues to boost metadata and random and sequential read performance for electronic design

automation (EDA) workloads.

4.2 High Availability and Reliability

With scale-out architectures, it is very important to have volumes that are highly available and accessible

at all times by the cell library characterization applications. Clustered Data ONTAP 8.2 provides a high

level of availability at the following levels for all cell design workloads.

Storage controller

Network

NFS protocol

The cluster storage can be set up to fail over to its surviving partner node in the HA pair, to another

network port in the cluster space, and to NFS access through a different node in the cluster in case the

NFS clients cannot reach the desired volumes. A cell design scenario typically consists of a single large

aggregate on each controller.

NetApp RAID-DP® technology provides more data resiliency against single- or double-disk failures.

Nondisruptive upgrades (NDUs) for clustered Data ONTAP versions and disk shelf firmware provide

nondisruptive operations to the chip design application. This allows clustered Data ONTAP to provide

five-nines (99.999%) reliability.


4.3 Capacity

Clustered Data ONTAP 8.2 supports larger aggregates and flexible volumes for various hardware

platforms. The number of supported flexible volumes can be higher on a single FAS controller for high-

end platforms. For further details, refer to the Clustered Data ONTAP 8.2 Release Notes at

https://library.netapp.com/ecm/ecm_get_file/ECMP1196821 and the System Configuration Guide at

https://hwu.netapp.com/Resources/generatedPDFs/8.2_Clustered_Data_ONTAP-FAS.pdf.

Flexible volumes that host different chip designs can nondisruptively move to an aggregate on a different

controller for capacity load balancing. The flexible volume can move from an aggregate on midrange

platforms to aggregates in high-end platforms to provide a higher capacity limit. This provides more

autonomy for applications and services, and dynamically responds to the shift in workloads.

4.4 Storage Efficiency

Clustered Data ONTAP 8.2 provides almost all the storage efficiencies—including NetApp Snapshot™

copies, thin provisioning, space-efficient cloning, deduplication, and data compression—that Data ONTAP

operating in 7-Mode provides for all EDA tier 1 applications:

Thin provisioning. Thin provisioning makes a huge impact while provisioning storage space for volumes that are part of individual projects. Thick-provisioned volumes are guaranteed to use 100% of the space from the start to the finish of a project even if the project files do not require the entire space. There is very little space left to provision for newer projects; Project X cannot borrow space from Project Y. Therefore, thin provisioning is enabled by default for cell design volumes mounted over NFS.

As the files generated from different projects continue to be created, updated, and deleted, the free

space is managed at the aggregate level. Statistics have proven that at any given point in time, the

actual user data fills up about 30% to 60% of the aggregate space, from the start to the finish of a

project. Almost 33% of the unused aggregate space is available to accommodate any new chip

design projects. The NetApp OnCommand® Workflow Automation (WFA) tool provides alarms that

are triggered when the aggregates are filled to the configurable limit (normally 80%). Administrators

can then move the volumes nondisruptively to an aggregate on a different controller that has less

space utilized.

Space efficiency. Clustered Data ONTAP 8.2 and the NetApp WAFL® (Write Anywhere File Layout)

file system still provide a 4,000-block size. Cell library characterization applications have a combination of random reads and writes along with sequential write workloads for the log files generated during the workflow. This involves many small and large file sizes. Unlike other storage vendors, we do not mirror the small files, thus improving the space efficiency of the storage while storing the design files. Also, deduplication and compression are preserved at the destination when data is moved by NetApp SnapVault

® technology from the primary storage.

4.5 Agile Infrastructure

With clustered Data ONTAP 8.2, volumes and IP addresses are no longer tied up with the physical

hardware. The SVM with the cluster namespace spans multiple controllers in the cluster. Storage can be

tiered with different types of disks, such as SSDs, SAS, or SATA, depending on the service-level offerings

for different chip design workloads. Other infrastructure features include:

Provisioning or scaling out. New SVMs that consists of cell library volumes can be created on the existing hardware for different applications and tools. The existing SVMs can grow seamlessly by new hardware being added to the existing cluster. SVMs can be provisioned on the fly for individual departments, companies, or applications.

Multi-tenancy. The physical clustered nodes can be used by many tenants. SVMs provide a secure logical boundary between tenants. The bottom line is that data constituents such as volumes are decoupled from the hardware plane to provide more agility to the storage infrastructure.

Unified storage. Clustered Data ONTAP 8.2 offers unified storage that natively supports NFS, CIFS, FCP, and iSCSI:

Because EDA workloads are mostly on NFS, different versions of NFS, such as NFSv3 and NFSv4.1/pNFS, can coexist and access the same file system that is exported from the storage.

https://library.netapp.com/ecm/ecm_get_file/ECMP1196821

https://hwu.netapp.com/Resources/generatedPDFs/8.2_Clustered_Data_ONTAP-FAS.pdf


Storage quality of service (QoS). Clustered Data ONTAP 8.2 introduces storage QoS, in which IOPS and bandwidth limits can be set on files, volumes, and SVMs to isolate test and development and rogue workloads from production. Storage QoS provides the following functionalities:

Enables the consolidation of mixed workloads without affecting the performance of different chip design volumes or files in a multi-tenant environment

Isolates and throttles resource-intensive workloads to deliver consistent performance

Simplifies workload management

Nondisruptive operation (NDO). Chip design volumes and logical interface (LIF) movement within the SVM allow nondisruptive lifecycle operations that are completely transparent to the applications. NDOs can be applicable in the following scenarios:

Unplanned events:

Infrastructure resiliency against hardware and software failures

Planned events:

Capacity and performance load balancing

Software upgrades and hardware technical refreshes

These features make the infrastructure more agile and enable IT-managed data centers to provide IT as

a service.

4.6 Data Protection

Clustered Data ONTAP provides a high level of data protection through file-system-consistent Snapshot

copies, NetApp SnapMirror® technology, and SnapVault. Snapshot copies and SnapVault are the most

commonly used tools for data protection in the cell library characterization space. In clustered Data

ONTAP 8.2, SnapVault performs a logical replication at the volume level that can be done within an SVM,

across SVMs, and across clusters. Because a common use case for SnapVault is often for remote or off-

site backups, the remote sites can have single-node and two-node switchless clusters to help EDA

customers scale with minimal cost and complexity.

4.7 Manageability

Manageability becomes a lot easier with SVMs and cluster namespaces in the clustered scale-out

architecture, compared with managing different islands of storage as used to be the case with Data

ONTAP operating in 7-Mode. Clustered Data ONTAP 8.2 offers a single virtualized pool of all storage. A

single logical pool can be provisioned across many arrays.

In traditional Data ONTAP systems operating in 7-Mode, SnapMirror was used to move volumes for more

capacity, for more compute power, or for archiving purposes. This is no longer the case with clustered

Data ONTAP. Volumes can be moved nondisruptively in the namespace with clustered Data ONTAP. The

clustered storage can be set up and configured to provision storage and set policies for different types of

workloads and nondisruptive operations by using the On Command Unified Manager and Workflow

Automation (WFA) tools.

4.8 Cost

Clustered Data ONTAP 8.2 can provide a virtual pool of storage across different FAS platforms. A four-

node cluster can have a midrange FAS3270 with SAS disks to handle entry-level projects. Later, these

projects can be moved into a high-end FAS6290 with SSD and SAS disks, along with PCIe-based Flash

Cache, for high performance during the mid- to final stages. Finally, with FAS3270 and SATA disks, the

project files can be moved to archiving. During the entire life of the project, the chip design volumes can

move across different tiers of storage that are set up according to price and performance. This is the

unique aspect of SVM in clustered Data ONTAP: The namespace spans different tiers of storage that are

set up with respect to price and service-level objective (SLO) for different phases of the cell design

workloads.


5 Cadence Virtuoso Liberate Validation with NetApp Clustered Data

ONTAP 8.2

The various chip design houses constantly run cell library characterization on cells or circuits during the

chip design cycle based on standard guidelines developed by the foundries. With the complex chip

requirements, the cell layout is always checked with the schematics on a silicon chip. The complex cell

library characterization process leads to high-performance demands, with low latency and faster job

completion times.

The Virtuoso Liberate application performance was validated at the Cadence lab at San Jose, California.

For this performance validation, NetApp, through its partnership with Cadence, ran the tool on real-

production 28nm standard 385-cell data from a large foundry customer. The test results from the

Cadence lab demonstrated consistent improvement with clustered Data ONTAP 8.2 optimization,

network, and compute nodes.

5.1 Performance Validation Objectives

The primary objectives of the Virtuoso Liberate performance validation with clustered Data ONTAP 8.2

were the following:

Validate that the Virtuoso Liberate job-completion time (wall-clock time) on clustered Data ONTAP 8.2 is on par with or better than the baseline for jobs performed on Data ONTAP 8.1.2 operating in 7-Mode over NFSv3.

Explore new technology such as pNFS for cell library characterization workloads. Validate that the Virtuoso Liberate job-completion time with clustered Data ONTAP 8.2 over pNFS is comparable to or better than that of Data ONTAP 8.1 operating in 7-Mode over NFSv3.

Enable reduction in the job-completion time, allowing users to move on to other cell library characterizations and make better use of their license costs. This improves the ROI of the Virtuoso Liberate tool.

5.2 Virtuoso Liberate Test Lab Details

The Cadence lab consists of three physical compute nodes. Each compute node has 32 cores and

768GB of physical memory, with a 10GB connection to the storage through a 10GB switch plane. The

tests were able to scale up to 96 cores. All of these compute nodes ran on the Red Hat Enterprise Linux®

(RHEL) 6.5 kernel 2.6.32-431.14.1.el6.x86_64. LSF was used as the scheduler to submit the

jobs in the compute farm. Virtuoso Liberate 13.1 ISR1_e86 was used for this test.

On the storage side, we had a four-node FAS6280 cluster with 10GB data ports. Each cluster node had

two 10GB data ports aggregated, to provide 20GB of network bandwidth that can provide up to

2.4GB/sec throughput. Each of the cluster nodes had four shelves of 10,000-RPM SAS disks. Each

cluster node had 512GB of a PCIe-based Flash Cache card. The customer was running on clustered

Data ONTAP 8.2P4.

For the Data ONTAP operating in 7-Mode setup, a single FAS6280 HA pair was used with two 10GB data

ports aggregated, four shelves of 10,000-RPM SAS disks, and a 512GB PCIe-based Flash Cache card

on each controller. This was running on Data ONTAP 8.1.2 operating in 7-Mode.

Both the clustered and the 7-Mode storage had three RAID groups, with 66 disks on each of the

aggregates.

5.3 Test Plan

The cell libraries from a major foundry were used to run the various tests on NetApp storage to generate

the workload with the Virtuoso Liberate tool. The tool ran on 385 cells to put stress on the storage. Four

different sets of tests were performed, including:

Cell library characterization tests over NFSv3 with Data ONTAP 8.1.2 operating in 7-Mode. This is a typical Cadence environment that the sample customer uses in production. This test was considered to be the baseline for the rest of the tests that followed.


Cell library characterization tests over NFSv3 with clustered Data ONTAP 8.2P4 without any optimization.

Cell library characterization tests over NFSv3 with clustered Data ONTAP 8.2P4 fully optimized for Red Hat clients and storage.

Cell library characterization tests over NFSv4.1/pNFS with delegations on clustered Data ONTAP 8.2 fully optimized.

5.4 Virtuoso Liberate Lab Test Results

The tests were performed in isolation from the standard production environment of Cadence. The

Virtuoso Liberate tests were performed with 23 slaves, each using four cores. A standard cell 28nm

library with 385 cells was used for the testing.

Certain procedures and conditions were observed after every test cycle:

While host-side and server (storage)-side caching is pretty common in any production EDA design environment, these tests were performed with no caching on the host and the storage. After every test run, the storage buffer cache was flushed, and every test was run on a different directory to avoid any host-side caching.

Cadence’s production environment already has a 20GbE network. Customers usually do not have 20GbE network connections in their environment. The baseline performance numbers were obtained in an already-optimized environment in Cadence’s production setup.

Cadence’s production environment also uses RAID groups with 66 disks or more, which is also part of the recommendation for having an optimized performance. The baseline performance numbers again had the advantage of using such an optimized environment in Cadence’s production setup.

Figure 5) Virtuoso Liberate performance test results.

5.5 Virtuoso Liberate Performance Test Observations

Figure 5 indicates that tests run with Data ONTAP operating in 7-Mode performed better than did tests

run with unoptimized clustered Data ONTAP, considering the fact the network and disk subsystem were

already optimized for better performance in the Cadence environment. However, there are no further

feature and functionality improvements in Data ONTAP operating in 7-Mode. Data ONTAP operating in 7-

Mode does not provide the scale-out and agility features that clustered Data ONTAP provides, as

described in section 4.

Most of the recent and future installed base in EDA and semiconductor customers has clustered Data

ONTAP. Figure 5 illustrates that NFSv4.1/pNFS provides the best results, with up to 15% improvement

over unoptimized clustered Data ONTAP. This is a huge improvement, considering the optimization that it

provides with regard to the license costs associated with the Virtuoso Liberate application. Overall,


clustered Data ONTAP with NFSv4.1/pNFS provides performance that is comparable to Data ONTAP

operating in 7-Mode. It also offers a lot of enterprise-level features and functionality, including scalability,

storage efficiency and QoS, high reliability, NDO, and data protection.

Red Hat has officially announced that NFSv4.1/pNFS is generally available in the RHEL 6.5 release.

While Cadence’s platform team is validating this new release, the Virtuoso Liberate R&D team is

supporting this release with the latest version of the Virtuoso Liberate tool. All the newer host-side

hardware comes with new device drivers and software, so NetApp highly recommends running the hosts

in the compute farm on the latest version of Linux.

Based on this performance validation, it was observed that RHEL 6.5 is a stable Linux kernel that

provided better performance with NFSv4.1/pNFS than with NFSv3.

6 Best Practices for Virtuoso Liberate Tool with Clustered Data

ONTAP 8.2

The Virtuoso Liberate tool is one of the most common tools used for cell library characterization during

chip design cycles. An increasing number of customers are deploying clustered Data ONTAP 8.2 for

storage, supporting the characterization phase. Scale-out clustered FAS storage has to be properly

architected to handle the Virtuoso Liberate workload. The aggregates and volumes that store the cell

libraries and inputs such as the netlists, and so on, must optimally laid out in the cluster nodes.

The best practices and recommendations in this section provide guidance to optimize clustered Data

ONTAP, the network layer, and the compute nodes for Virtuoso Liberate workloads. It is imperative to

also validate some of the key clustered Data ONTAP 8.2 features and functions to improve the overall

efficiency of the Virtuoso Liberate application.

6.1 Storage Cluster Node Architecture

NetApp highly recommends implementing the right storage platform in a clustered Data ONTAP setup,

along with adequate storage sizing and configuration to accommodate cell library characterization

workloads for standard and custom cells that have different SLOs. If the workload is performance driven

and has the highest SLO, NetApp recommends storage controllers with multiple cores and a large

memory footprint. Faster serial-attached SCSI (SAS) disks should always be used for designs that require

a faster response time.

Choosing the Right Hardware for Virtuoso Liberate Workloads in a Clustered Scale-Out Architecture

The Virtuoso Liberate cluster setup can provide different SLOs for standard and custom cells and other

dependencies that can coexist in the same or different SVMs. The choice of hardware can be different based on

the price-to-capacity (GB) and price-to-performance ratios for various SLOs:

If the Virtuoso Liberate workload requires that performance be at the highest level, NetApp strongly

recommends FAS6290 controllers with a minimum of 900GB second-generation SAS disks and a minimum of

6GB backplane.

If the cluster setup is designed to accommodate library database (.ldb) files for archiving, NetApp

recommends a minimum of FAS3270 controllers with SATA disks.

NetApp recommends having a minimum of 1TB PCIe-based Flash Cache 2 cards on each controller.

A four- or eight-node, or larger, cluster with different types of disks (SSD, SAS, and SATA) can be configured

based on the SLOs for different workloads.

Note: NetApp highly recommends engaging with the appropriate NetApp Sales Account team to evaluate your

business requirements before architecting the cluster scale-out setup in your environment.

NetApp clustered Data ONTAP consists of a data and a network stack as part of the operating system.

Figure 6 shows these logical stacks, which are transparent to the administrator and the users. Each of

these stacks consists of different components that allow the I/O requests to communicate back and forth


to the clients while accessing the data. The figure illustrates the different components in the network stack

and the data stack.

Figure 6) Clustered Data ONTAP logical stack layout.

Data stack. This consists of WAFL, RAID, data, metadata, the memory manager, and the lock manager. The aggregates, RAID group sizes, and volumes all exist in the data layer. These components are very similar to what was always in Data ONTAP operating in 7-Mode. PCIe-based Flash Cache is also part of the data stack.

Network stack. This consists of a user space and a kernel space. The kernel space includes all the

networking (interfaces, ifgrps, and so on) and the protocols (NFS, CIFS, and so on). All of the

export rule evaluation, NIS, LDAP, and DNS lookup happens in the user space.

Clustered Data ONTAP 8.2 can help improve Virtuoso Liberate performance compared with clustered

Data ONTAP versions earlier than 8.2. Following is a list of improvements offered by version 8.2 that can

make a real difference in performance, on top of all the optimizations that occurred in releases of

clustered Data ONTAP earlier than 8.2.

Why Use Clustered Data ONTAP 8.2 for Virtuoso Liberate Workloads?

Provides high levels of network parallelism

Mitigates the impact of large file deletions

Improves write performance with CP smoothing

Enhances storage QoS for performance efficiency

Delivers SnapVault for data protection

Improves coalescing free space after large deletions:

Contiguous segment cleaning

Offers multiprocessor support for NFSv4.x:

Fast path for local data path

Zero-copy support such as NFSv3

6.2 Storage Cluster Node Sizing and Optimization

After the right hardware is configured and clustered Data ONTAP 8.2 is installed on the cluster setup, the

following sizing efforts must be made to the RAID disk subsystem for optimal performance. All the cluster

nodes are multipathed in a storage failover configuration. In the event of hardware failure, the surviving

partner node takes over the disks from the failing node to provide continuous accessibility to the data

volumes.


How to configure clustered Data ONTAP for creating aggregates, volumes, and NFS protocol is not within

the scope of this paper. Refer to TR-4067, “Clustered Data ONTAP NFS Best Practice and

Implementation Guide,” for clustered Data ONTAP 8.2 and NFS configuration details.

6.3 File-System Optimization

After the volumes have been created in the SVM, NetApp recommends certain best-practice

configurations on the aggregate and volumes to address the following issues in a Virtuoso Liberate

environment:

Fragmentation after constant writes and deletions to the file system during the assembly phase

Fragmentation of free space for writes to complete a full stripe

The file system can be kept healthy at all times with the help of some maintenance and housekeeping

activities on the storage as it ages and grows in size, including:

Defragmenting the file system. Reallocate is a low-priority process that helps constantly

defragment the file system, and it can run in the background. However, NetApp recommends implementing measures to keep the aggregate utilization under 80%. If the aggregate runs close to 90% capacity, the following considerations apply:

Some amount of free space is required to temporarily move the data blocks to free space and to rewrite to those blocks in full and complete stripes in contiguous locations on the disk, thereby optimizing the reads that follow.

Insufficient space in the aggregate allows reallocate to run in the background, but

defragmentation of the file system is never completed.

An NDO to move the production chip design volume to another controller that is part of the cluster setup for capacity balancing must occur.

New shelves must be added to the original controller to provide more space to the aggregate that is running low on space.

Perform reallocate start -vserver vs1_eda_lib -path /vol/VOL06 -force

true for all the volumes in that aggregate.

Aggregate and RAID Group Sizing Best Practices

Disk spindles help to improve write performance by reducing CPU utilization and the time it takes to look for free space, while writing full stripes of data on the storage controller as the file system ages from constant deletions and insertions. For optimal sizing:

A RAID group should have 28 SAS disks.

A minimum of 6 RAID groups should be present in a single aggregate.

Spread the volumes across all the high-end platform (FAS6290) cluster nodes that are part of a single project that has a high-performance requirement. All of these volumes are connected by the dynamic cluster namespace junctions:

Spreading out the volumes across the high-end platform prevents putting all your eggs (volumes) in one basket (aggregate). In this way, all the project volumes can meet the IOPS and bandwidth requirements from all the controller nodes that these volumes are part of and not just saturate a single controller.

Spreading out the volumes also allows moving the volumes nondisruptively in the cluster namespace for workload balancing to meet the high-performance SLO.

During the verification phase, a lot of transient data is generated that has large writes and reads accompanied by a lot of deletions. Isolating the volume that writes the transient data on a different cluster node from other volumes that are part of the project helps to prevent a single controller from being the bottleneck.


This forces all the existing volumes to spread out on the new disk spindles that were added to the aggregate. Otherwise, the new writes coming into that aggregate go only to the new disks.

Defragmenting free space. Continuous segment cleaning, which was introduced in clustered Data ONTAP 8.1.1 and further optimized in clustered Data ONTAP 8.2, helps coalesce the deleted blocks in the free pool to use for subsequent writes.

Thin provisioning. The volumes in the cluster namespace can be thin provisioned by disabling

space-guarantee. This provides flexibility to provision space for chip design or different project

volumes that can autogrow in increments of 1GB.

NetApp recommends enabling the following storage options to optimize the entire life of the file system.

File-System Optimization Best Practices

The following settings cannot be put into place from the cluster shell. They can be made only at the CLI mode:

bumblebee::*> vol modify -vserver vs1_eda_lib -volume vol6 -min-readahead

false

(volume modify)

Volume modify successful on volume: VOL06

bumblebee::*> aggr modify -aggregate aggr1_fas6280c_svl09_1 -free-space-

realloc on

bumblebee::*> reallocate start -vserver vs1_eda_lib -path /vol/VOL06 -

space-optimized true -interval 3

bumblebee::*> vol modify -volume VOL06 -read-realloc space-optimized

(volume modify)


NetApp recommends always setting up an alarm that triggers as soon as the aggregate reaches 80% capacity. The critical chip design volumes that need more space can automatically use WFA or manually be moved to another aggregate on a different controller.

NetApp recommends thin provisioning the volumes. This can be done when the volumes are created, or they can be modified later. It can also be implemented by using OnCommand System Manager 3.0 from a GUI:

bumblebee::*> vol modify -vserver vs1_eda_lib -volume VOL06 -space-

guarantee none

(volume modify)


Adequate sizing is required for the number of files in each directory and the path name lengths:

Longer path names lead to a higher number of NFS LOOKUP operations.

Default quotas cannot be implemented for users and groups:

Include an explicit quota entry for users and groups.

6.4 Storage Network Optimization

After you create the aggregates and volumes based on the recommended sizes to support the cell library

workload, you must then configure the network. At that time, the cluster, management, and data ports are

all physically connected and configured on all the cluster nodes to the cluster switches. Configuring the

network includes:

Data port aggregation. Before the LIFs and routing tables are configured for each SVM, it is very important to aggregate at least two 10GbE data ports for handling the cell library workloads.


Depending on the number of chip design and tool volumes that each controller has, NetApp recommends aggregating a larger number of data ports than required to achieve the desired SLO.

LIF failover. As mentioned in section 4.5, LIF IP addresses are no longer tied to physical network ports. They are part of the SVM. When LIF IP addresses are created, NetApp recommends configuring a failover path in case the home port goes offline. If a data port failure occurs, the LIF can fail over nondisruptively to another controller. This allows the application to continue accessing the volume even though the LIF moved to a different controller in the SVM.

Storage Network Optimization Best Practices

Aggregate at least two 10GbE data ports on each cluster node that interface with the compute farm:

bumblebee::*> network port ifgrp create -node fas6280c-svl07 -ifgrp e7e -

distr-func ip -mode multimode

bumblebee::*> network port ifgrp add-port -node fas6280c-svl07 -ifgrp e7e

-port e0d

bumblebee::*> network port ifgrp add-port -node fas6280c-svl07 -ifgrp e7e

-port e0f

Use the following option to configure the LIF failover for any LIF configured in the SVM, clusterwide:

bumblebee::*> net int modify -vserver vs1_eda_lib -failover-group

clusterwide -lif vs1_eda_lib_data3 -home-node fas6280c-svl09 -home-port

e9e -address 172.31.22.172 -netmask 255.255.255.0 -routing-group

d172.31.22.0/24

Always follow a ratio of 1 volume to 1 LIF. That means that every volume has its own LIF. If the

volume moves to a different controller, the LIF should move along with it.

6.5 Flash Cache Optimization

A caching tier on the storage supplements the number of I/O requirements for the cell library workload.

Flash Cache enables the read workload: metadata and random and sequential reads. Random data

access is a function of the disks. A higher number of disk spindles helps to generate a greater amount of

read I/O. Storage platforms with PCIe-based Flash Cache improve read performance:

Flash Cache serves additional I/O requests from the flash-based cache, while disk spindles help improve the performance of write-intensive workloads.

All the movement of the data between the base memories (DRAM), Flash Cache, and disks happens transparently to the application.

NetApp recommends enabling flexscale.lopri_blocks in Flash Cache. The Virtuoso Liberate

tool tends to read the data from the storage as soon as it is written. This option allows the I/O requests to be served from the Flash Cache cache instead of from the disks, which improves application performance. This option also allows caching sequential data from the disks.

To cache random read workloads, flexscale.normal_data_blocks should be enabled.


Flash Cache Optimization Best Practices

The following settings cannot be made from the cluster shell. They can be made only at the node shell CLI of each controller. These commands allow you to get from the cluster shell to the node shell on each controller:

bumblebee::*> system node run -node fas6280c-svl09

Type 'exit' or 'Ctrl-D' to return to the CLI

fas6280c-svl09> priv set diag

Warning: These diagnostic commands are for use by NetApp

Personnel only.

fas6280c-svl09*>

Enable Flash Cache:

options flexscale.enable on

Enable caching of metadata and random read data:

options flexscale.normal_data_blocks on

Enable caching of sequential read data:

options flexscale.lopri_blocks on

Type exit in the node shell to get back to the cluster shell:

fas6280c-svl09*> exit

logout

bumblebee::*>

6.6 Network File System (NFSv3) Optimization

Almost all of the cell library characterization workload accesses the file system from the back-end storage

controllers over the Network File System version 3 (NFSv3) protocol:

NFSv3 is a stateless protocol and is geared primarily toward performance-driven workloads such as the Virtuoso Liberate environment with asynchronous writes.

Communication between the NFSv3 client and the storage happens over Remote Procedure Calls.

Red Hat Enterprise Linux (RHEL) 5.x is the most common Linux vendor–supported version that is used by most of the semiconductor companies in Virtuoso Liberate compute farm environments.

NFS runs in the kernel space of the network stack in the clustered Data ONTAP code. Minimal tuning is required for NFS running on the network stack.

As one of the benefits of clustered Data ONTAP 8.2, a fast path for the local data path is available for NFSv3.

With the large number of compute nodes accessing files from a single controller, the TCP receives window size or the receive buffer may quickly become exhausted. The storage will not accept any further TCP windows over the wire until the receive buffer is freed up. NetApp therefore recommends increasing the TCP receive buffer value.

NetApp recommends enabling NFS failover groups to provide another layer of protection at the protocol level.


NFSv3 Optimization Best Practices

The command force-spinnp-readdir enables making effective readdir calls from the data

stack; increasing the TCP buffers also optimizes performance. The buffer size also must be increased:

nfs modify -vserver vs1_eda_lib -force-spinnp-readdir true -tcp-max-xfer-

size 65536

The following steps must be followed to configure the NFS failover groups. The example shows

how the LIFs vs1_eda_lib_data3 and vs1_eda_lib_data4, which are assigned to an NFS

failover group, move the NFS traffic over port e7e on node fas6280-svl07.

bumblebee::*> network interface failover-groups create -failover-group

lib_failover_group -node fas6280c-svl07 -port e7e

bumblebee::*> network interface failover-groups show -failover-group

lib_failover_group -instance

Failover Group Name: lib_failover_group

Node: fas6280c-svl07

Port: e7e

1 entries were displayed.

bumblebee::*> network interface modify -vserver vs1_eda_lib -lif

vs1_eda_lib_data3,vs1_eda_lib_data4 -failover-group lib_failover_group

2 entries were modified.

6.7 Parallel Network File System (pNFS)

NFSv3 has been very popular and is the de facto protocol required by most of the cell library

characterization applications. The NFSv3 protocol has generally met the performance needs of most cell

design applications to date. NFSv4 was never a performance player. That version of NFS was intended

mainly to be more of a security and reliability play, with features like Kerberos, access control lists, and

delegations. However, with NFSv4.1 file delegations and pNFS, we can achieve performance along with

the security and reliability that the NFSv4.x protocol provides.

pNFS is an extension to the minor version NFSv4.1. Unlike NFSv3, NFSv4, and NFSv4.1, which have

metadata and data going on a single I/O path, pNFS isolates the metadata from the data. pNFS primarily

consists of two main components:

Metadata server (MDS): Handles all metadata operations, such as GETATTR, ACCESS, LOOKUP,

SETATTR, and file layout information.

Data server (DS): Stores all the inode information and the real data. The clients get a direct path to access the DS.

Figure 7 shows how a pNFS client communicates with the MDS and the DS. The diagram on the right

also illustrates how pNFS is implemented in clustered Data ONTAP 8.2. pNFS is purely a clustered Data

ONTAP implementation. Data ONTAP operating in 7-Mode does not support pNFS. For more details on

pNFS, refer to TR-4063, “Parallel Network File System Configuration and Best Practices for Clustered

Data ONTAP 8.2.”


Figure 7) pNFS implementation.

The diagram on the left side of Figure 7 shows the generic pNFS implementation, in which a pNFS client

communicates with the metadata server to get file location information. File layout information is sent to

the client from the MDS, which hands out the location of the file in the DSs and also the information about

the network path to get to that location. The control protocol provides synchronization between the MDS

and the DS.

The diagram on the right side of Figure 7 illustrates that every node in clustered Data ONTAP is an MDS

and a DS. For any LIF IP address in the SVM that is mounted by the pNFS client, that cluster node

becomes the MDS for that client. If there is a data volume located in that cluster node or any other node

in the cluster setup, it becomes the DS. Eventually the pNFS clients can reach the data volumes that are

located in the cluster namespace through a local data path. NetApp products implement pNFS only over

files. There is no block or object implementation for pNFS available at this time.

Enable NFSv4.1/pNFS with a Delegation in Clustered Data ONTAP 8.2

Clustered Data ONTAP 8.2 supports NFSv4.1 file delegations. The following options must be enabled on the SVM to use NFSv4.1/pNFS with read and write delegations:

nfs modify -vserver vs1_eda_lib -v4.1-pnfs enabled -v4.1-read-delegation

disabled -v4.1-write-delegation disabled

NFSv4.1/pNFS has a client dependency. RHEL 6.5 is now the generally available version that supports

pNFS over files. Clustered Data ONATP 8.2 is optimized to perform better with RHEL 6.4 over NFSv3

and NFSv4.1/pNFS with file delegations enabled. Section 5 provides more information about the

performance validation that was done in the Virtuoso Liberate lab. Cadence is currently validating and

qualifying RHEL 6.5 for Virtuoso Liberate applications.

NFSv4.1/pNFS is a step in the right direction to handle the growing performance demands of the cell

library characterization workloads specifically in the assembly phase. Newer versions of the Virtuoso

Liberate tool support a 64-bit architecture. A parallel file system such as pNFS and clustered Data

ONTAP can achieve the high concurrency and performance requirements of the Virtuoso Liberate

application and complete jobs faster over NFSv3. The performance details are discussed in the following

sections.

Virtuoso Liberate workloads always consist of large amounts of metadata. In a traditional NFSv3 setup, a

single controller gets bottlenecked because of large metadata operations. Because pNFS isolates the


metadata from the data, an innovative way to spread the metadata over all the controllers in a cluster

setup was tested by using an on-box DNS round-robin. This also allows distribution of the NFSv4.1 locks

across all the cluster nodes rather than bottlenecking a single node where the cell libraries are located.

On-Box DNS Round-Robin

Clustered Data ONTAP 8.2 provides the ability to leverage the named service on each node to service

DNS requests from clients. Clustered Data ONTAP 8.2 also issues data LIF IP addresses based on an

algorithm that calculates CPU and node throughput to provide the least utilized data LIF for proper load

balancing across the cluster for mount requests. When a mount is successful, the client continues to use

that connection until remount. This differs from round-robin DNS, because the external DNS server

services all requests and has no insight into how busy a node in the cluster is. Instead, the DNS server

simply issues an IP address based on which IP is next in the list.

Additionally, round-robin DNS issues IP addresses with a time to live (TTL). This caches the DNS request

in Microsoft® Windows® for 24 hours by default. On-box DNS issues a TTL of 0, which means that DNS is

never cached on the client and a new IP is always issued based on load.

On-Box DNS Round-Robin Configuration

The following steps illustrate how to configure the on-box DNS round-robin on clustered Data ONTAP 8.2

and Microsoft Windows Server® 2008 R2.

Clustered Data ONTAP 8.2

The following example consists of a four-node FAS6280 cluster. The SVM – vs1_lib_eda spans all

four cluster nodes. There are four LIF IP addresses configured for this SVM. Each cluster node has an

LIF IP address configured on its home port:

FAS6280-svl07: 172.31.22.170

FAS6280-svl08: 172.31.22.171

FAS6280-svl09: 172.31.22.172

FAS6280-svl10: 172.31.22.173

The Windows Server 2008 R2 DNS IP address is 172.31.22.151.


Enable the LIFs to Query the DNS Server

Check whether the DNS is configured correctly on the SVM. Refer to TR-4067, “Clustered Data ONTAP NFS Best Practice and Implementation Guide,” to configure DNS on clustered Data ONTAP 8.2:

bumblebee::*> vserver services dns create -vserver vs1_eda_lib -domains

eda.local.com -state enabled -timeout 2 -attempts 1 -name-servers

172.31.22.151,172.31.21.151

bumblebee::*> dns show

(vserver services dns show)

Name

Vserver State Domains Servers

--------------- --------- ----------------------------------- --------------

--

vs1_eda_lib enabled eda.local.com, 172.31.22.151,

eda-win-1.eda.local.com 172.31.21.151

Configure the LIFs to query the DNS server. This is a new feature in clustered Data ONTAP 8.2:

bumblebee::*> net int show -vserver vs1_eda_lib -fields address

(network interface show)

vserver lif address

----------- ----------------- -------------

vs1_eda_lib vs1_eda_lib_data1 172.31.22.170





bumblebee::*> net int show -vserver vs1_eda_lib -fields dns-zone,listen-for-

dns-query

(network interface show)

vserver lif dns-zone listen-for-dns-query

----------- ----------------- ----------------- --------------------

vs1_eda_lib vs1_eda_lib_data1 lib.eda.local.com true





Windows Server 2008 R2 for DNS

From the DNS Manager, create a new delegation for the host name lib. Do not append the fully

qualified domain name. The DNS server appends that automatically. Figure 8 shows a screen shot of this action.


Figure 8) Configuring the delegation on the DNS server.

Add the entire four IP addresses (listed in this example) one by one to host the delegated zone.


Resolving lib to Different IP Addresses by Using the On-Box DNS Round-Robin

C:\Users\Administrator>nslookup lib

Server: localhost

Address: 127.0.0.1

Non-authoritative answer:

Name: lib.eda.local.com

Address: 172.31.22.171


Server: localhost

Address: 127.0.0.1



Address: 172.31.22.173


Server: localhost

Address: 127.0.0.1



Address: 172.31.22.172


Server: localhost

Address: 127.0.0.1



Address: 172.31.22.171


Server: localhost

Address: 127.0.0.1



Address: 172.31.22.173

C:\Users\Administrator>


Mounts on the Compute Nodes

Because of the on-box DNS round-robin, compute nodes mount different IP addresses each time they

mount the chip design volumes by using the hostname lib.

Node 1:

lib:/VOL6 on /vol6-pnfs type nfs

(rw,bg,rsize=65536,wsize=65536,hard,intr,proto=tcp,timeo=600,vers=4,minorvers

ion=1,clientaddr=172.17.44.232,addr=172.31.22.170)

Node 2:

lib:/VOL6 on /vol6-pnfs type nfs

(rw,bg,rsize=65536,wsize=65536,hard,intr,proto=tcp,timeo=600,vers=4,minorvers

ion=1,addr=172.31.22.173,clientaddr=172.31.22.160)

NFSv4.1/pNFS Best Practices

Read and write file delegations should be enabled for NFSv4.1 to promote aggressive caching.

pNFS provides data locality. The volume can be accessed over a direct path from anywhere in the cluster.

There is no requirement to have a ratio of 1 LIF to 1 volume for NFSv4.1/pNFS with delegations, compared with the recommendation in section 6.4 for NFSv3.

If a volume is moved for capacity or workload balancing, there is no requirement to move or migrate the LIF around in the cluster namespace to provide local access to the volumes.

NFSv4.1 is a stateful protocol, unlike NFSv3. If there is ever a requirement to migrate an LIF, the I/O operations stall for 45 seconds to migrate the lock states over to the new location.

7 Other Features in Clustered Data ONTAP 8.2 for Virtuoso Liberate

Workloads

7.1 SnapVault

SnapVault was introduced for the first time in clustered Data ONTAP 8.2. SnapVault performs a logical

replication at the volume level, asynchronously. In addition:

SnapVault does not generate too much metadata while replicating the data.

The SnapVault destination can have an asymmetric number of Snapshot copies from the origin.

Users can “browse and restore” single files by using ndmpcopy and also specific volumes.

Storage efficiency is preserved with SnapVault. The SnapVault destination preserves the deduplication and compression of data when replicated from the origin.

SnapVault does SVM and cluster peering to mirror data across different SVMs and clusters, respectively.

SnapVault SVM Peering Configuration

The following commands configure SVM peering to start the SnapVault process:


bumblebee::*> vserver peer show

There are no Vserver peer relationships.

bumblebee::*> vserver peer create -vserver vs1 -peer-vserver vs1_eda_lib -

applications snapmirror

Info: 'vserver peer create' command is successful.

bumblebee::*> vserver peer show

Peer Peer

Vserver Vserver State

----------- ----------- ------------

vs1 vs1_eda_lib peered

vs1_eda_lib vs1 peered


bumblebee::*> vol create -vserver vs1 -volume VOL06VAULT -aggregate

aggr1_fas6280c_svl07_1 -size 3t -type dp

(volume create)

[Job 16024] Job succeeded: Successful

bumblebee::*> snapmirror create -S vs1_eda_lib:VOL06 vs1:VOL06VAULT -type

XDP

Operation succeeded: snapmirror create the relationship with destination

vs1:VOL06VAULT.

bumblebee::*> snapmirror show

Progress

Source Destination Mirror Relationship Total Last

Path Type Path State Status Progress Healthy

Updated

----------- ---- ------------ ------- -------------- --------- ------- -----

---

vs1_eda_lib:VOL06

XDP vs1:VOL06VAULT

Uninitialized

Idle - true -

7.2 Storage QoS

Storage QoS provides another level of storage efficiency in which IOPS and bandwidth limits can be set

for workloads that are not critical or when setting up SLOs on different workloads. In EDA environments,

storage QoS plays an important role:

Rogue workloads can be isolated with proper IOPS and bandwidth limits set in a different QoS policy group for users who generate these kinds of workloads in a production environment. This can be done at an SVM, volume, or specific file level.

In an IT-managed cloud infrastructure, storage QoS helps to run multiple tenants with different service-level offerings. New tenants can be added to the existing one as long as the storage platform has the headroom to handle all the workload requirements. Different workloads, such as builds,


verifications, cell library characterization, and other EDA tools, can coexist on the same storage controller. Each of the individual workloads can have different performance SLOs assigned to them.

Storage QoS Configuration

A QoS policy group must be created for different SVMs in the cluster. In the following example, two

QoS policy groups are created; business_critical and non_critical have different IOPS

and bandwidth settings:

bumblebee::*> qos policy-group create -policy-group business_critical -vserver

vs1_eda_lib -max-throughput 1.2GB/sec

bumblebee::*> qos policy-group create -policy-group non_critical -vserver

vs1_eda_lib -max-throughput 2000IOP

bumblebee::*> qos policy-group show

Name Vserver Class Wklds Throughput

---------------- ----------- ------------ ----- ------------

business_critical

vs1_eda_lib user-defined - 0-1.20GB/S

non_critical vs1_eda_lib user-defined - 0-2000IOPS


Volume vol06 is then set with the QoS policy group non_critical:

bumblebee::*> vol modify -vserver vs1_eda_lib -volume CMSGE -qos-policy-group

non_critical

(volume modify)

Volume modify successful on volume: CMSGE

The file writerandom.2g.88.log has been set to a non_critical QoS policy group. You

cannot set a QoS policy group on a file when the volume that holds that file already has a QoS

policy group set on it. The QoS policy group on the volume must be removed before the policy can

be set on a particular file in that volume:

bumblebee::*> file modify -vserver vs1_eda_lib -volume VOL06 -file

//OpenSPARCT1/Cloud_free_trial_demo/OpenSparc-

T1/model_dir/farm_cpu_test/writerandom.2g.88.log -qos-policy-group non_critical

bumblebee::*> qos workload show

Workload Wid Policy Group Vserver Volume LUN Qtree File

-------------- ----- ------------ -------- -------- ------ ------ -------------

CMSGE-wid12296 12296 non_critical vs1_eda_lib

CMSGE - - -

file-writerandom-wid11328

11328 non_critical vs1_eda_lib

VOL06

- -

/OpenSPARCT1/Cloud_free_trial_demo/OpenSparc-

T1/model_dir/farm_cpu_test/writerandom.2g.88.log



7.3 Nondisruptive Operation (NDO)

NDO completely changes the way that clustered Data ONTAP keeps data alive and available to the

application and the users who access the data. Disruptive scenarios were tested in the Cadence lab

under Virtuoso Liberate workloads to determine whether there was disruption to users at the application

layer:

When a data port was taken offline, the LIF IP address instantly failed over to another node in the cluster. This did not cause any outage for the user accessing the data under the load.

The chip library volume was moved to a different cluster node under the active Virtuoso Liberate load for capacity- and workload-balancing reasons. The volume and the LIF were moved to the new location in the cluster namespace without disrupting the user’s running jobs on the chip design volume.

Nondisruptive Operation with Volume Move

In this example, the volume VOL06 is moved from an aggregate in FAS6280-svl10 to an aggregate

in FAS6280-svl07 while the Virtuoso Liberate workload is in progress. There is no disruption to the application when the volumes are moved on the storage.

bumblebee::*> vol move start -vserver vs1_eda_lib -volume VOL06 -

destination-aggregate aggr1_fas6280c_svl07_1

(volume move start)

[Job 17268] Job is queued: Move "VOL06" in Vserver "vs1_eda_lib" to

aggregate "aggr1_fas6280c_svl07_1". Use the "volume move show -vserver

vs1_eda_lib -volume VOL06" command to view the status of this operation.

job show <job_id> can be used to check the status of the “vol move.”

bumblebee::*> job show 17268

Owning

Job ID Name Vserver Node State

------ -------------------- ---------- -------------- ----------

17268 Volume Move bumblebee fas6280c-svl10 Success

Description: Move "VOL06" in Vserver "vs1_eda_lib" to aggregate

"aggr1_fas6280c_svl07_1"

NDO can also be performed during hardware technical refreshes when all the volumes on an entire node can be evacuated to another cluster node and moved back nondisruptively to the new controllers after the refresh process.

Nondisruptive upgrades (NDUs) can also be performed on clustered Data ONTAP versions and the shelf and disk firmware without causing any outage to the application.

8 Compute Farm Optimization

The engineering compute farms in an chip design environment consist of tens of thousands of cores,

which translates to hundreds to thousands of physical compute nodes. Virtualization is usually not

deployed in these farms. Linux is the most commonly used operating system in the compute farm. Linux

clients in the compute farm provide the number of cores that are required to process the number of jobs

submitted.

For better client-side performance with clustered Data ONTAP 8.2, the cell library characterization

application and the schedulers, such as Sun Grid Engine (SGE) or Load Sharing Facility (LSF), must be

run on RHEL 5.8 and later or RHEL 6.5 and later. Cadence has validated and certified RHEL 6.5 for cell

library characterization tools.


8.1 RHEL 6.5 Clients in the Compute Farm

Why Deploy the RHEL 6.5 Kernel in the Compute Farm

RHEL 6.5 has more optimization in the TCP stack to handle NFS requests, compared with the earlier

version. RHEL 6.4 is also a generally available version for pNFS support for files. Versions earlier than

RHEL 6.4 are not qualified to run NFSv4.1/pNFS.

Clustered Data ONTAP 8.2 and RHEL 6.5 have been tested and validated by NetApp’ s NFS Engineering

QA team along with Red Hat NFS Engineering. A lot of bugs have been jointly scrubbed and fixed on the

RHEL 6.5 kernel and on clustered Data ONTAP 8.2, as well. For more details, refer to TR-3183, “Using

Red Hat Client with NetApp Storage over NFS.”

The RHEL 6.5 generally available kernel does not have all the bugzilla fixes. You must perform a yum

update, which downloads the latest .bz stream from Red Hat Network that has the fixes for most of the

known issues. After the kernel has been updated, the new kernel should show as 2.6.32-

431.14.1.el6.x86_64.

How to Deploy the RHEL 6.5 Kernel in the Compute Farm

It will not be easy to upgrade hundreds of compute nodes to RHEL 6.5. RHEL 6.5 can be added as a new

deployment in the compute farm after the release becomes qualified to be used in your environment. The

file system exported from clustered Data ONTAP 8.2 can be mounted on various Linux kernels in your

compute farm. All your pre-RHEL 6.5 clients can continue mounting the file system over NFSv3, and the

same file system can also be mounted over NFSv4.1/pNFS on the RHEL 6.5 clients. This means that

NFSv3 and NFSv4.1/pNFS can coexist in the compute farm for one or more exported file systems.

EDA tool vendors such as Cadence have qualified and provide support for RHEL 6.5. After RHEL 6.5 is

qualified by Cadence for its cell library characterization applications to support x86_64-bit kernels,

NFSv4.1/pNFS will be just another protocol like NFSv3 to be used in the Virtuoso Liberate environment.

The benefits of having both NFSv3 and NFSV4.1/pNFS protocols coexisting in the compute farm include:

No change is required for the existing compute nodes that mount the file systems over NFSv3. There is no disruption to the existing clients in the compute farm as more nodes on RHEL 6.4 are added to scale the number of jobs. The same file system can also be mounted over NFSv3 or NFSv4.1/pNFS from the new pNFS-supported RHEL 6.4 clients.

Based on the performance validation documented in section 5, NFSv4.1/pNFS definitely provides a significant performance improvement in job completion times. Critical chip designs can be isolated from the rest to provide faster job completion time and better SLO.

8.2 Best Practices for Compute Nodes

Considering the high volume of nodes in the compute farm, it is unrealistic to make significant changes

dynamically on each of the clients. Based on the Virtuoso Liberate workload evaluation, the following

recommendations on the Linux clients make a big contribution to improving the job completion times for

various chip design activities.


Compute Node Optimization for NFSv3 Mounts

Turn off hyperthreading on the BIOS setting of each of the Linux nodes.

Use the recommended mount options while mounting over NFSv3 on the Linux compute nodes: vers=3,rw,bg,hard,rsize=65536,wsize=65536,proto=tcp,intr,timeo=600.

Set sunrpc.tcp_slot_table_entries = 128; this improves TCP window size. This option

is fine for pre-RHEL 6.4 kernels that mount over NFSv3. RHEL 6.4, however, includes changes to the TCP slot table entries. Therefore, the following lines must be included when mounting file systems on a RHEL 6.5 kernel over NFSv3. The following lines are not required when mounting over NFSv4.1, however. NetApp storage may have its network buffers depleted by a flood of RPC requests from Linux clients over NFSv3 :

Create a new file: /etc/modprobe.d/sunrpc-local.conf

Add the following entry: options sunrpc tcp_max_slot_table_entries=128

If the compute nodes are using 10GbE connections the following tuning options are required. The following changes do not apply for clients that use 1GbE connections:

Disable irqbalance on the nodes:

[root@ibmx3650-svl51 ~]# service irqbalance stop

Stopping irqbalance: [ OK ]

[root@ibmx3650-svl51 ~]# chkconfig irqbalance off

Set net.core.netdev_max_backlog = 300000; avoid dropped packets on a 10GBE

connection.


Compute Node Optimization for NFSv4.1/pNFS Mounts for RHEL 6.5 Clients

Turn off hyperthreading on the BIOS setting of each of the Linux nodes.

Use the recommended mount options while mounting over NFSv4.1 on the Linux compute nodes:

vers=4,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,minorversion=1

Set the ntp or time server on all the compute nodes:

[root@ibmx3650-svl50 /]# ntpdate -q 172.17.0.11

server 172.17.0.11, stratum 3, offset 40.629293, delay 0.02606

31 Jul 12:55:59 ntpdate[1567]: step time server 172.17.0.11 offset 40.629293

sec

[root@ibmx3650-svl50 /]# ntpdate 172.17.0.11

31 Jul 12:56:52 ntpdate[1568]: step time server 172.17.0.11 offset 40.629315

sec

[root@ibmx3650-svl50 /]# chkconfig ntpdate on

[root@ibmx3650-svl50 /]# service ntpd restart

Shutting down ntpd: [FAILED]

Starting ntpd: [ OK ]

Set the tuned-adm profile latency performance for compute-intensive workloads. The

following parameters are changed at the kernel level:

cat /sys/block/sdd/queue/scheduler set to [deadline] ; default is

[cfq]

cat /etc/sysconfig/cpuspeed. ‘governor’ is set to ‘performance’;

default is the ‘governor’ is set to nothing. This uses the performance

governor for p-states through cpuspeed.

In RHEL 6.5 and later, the profile requests a cpu_dma_latency value of 1.

If the compute nodes are using 10GBE connections, the following tuning options are required. The following changes do not apply for clients that use 1GBE connections:

Disable irqbalance on the nodes:

[root@ibmx3650-svl51 ~]# service irqbalance stop

Stopping irqbalance: [ OK ]

[root@ibmx3650-svl51 ~]# chkconfig irqbalance off

Set net.core.netdev_max_backlog = 300000; avoid dropped packets on a 10GBE

connection.

9 Summary

Cell or circuit design is getting more complicated with respect to size and yield optimization on the silicon

layers. As more and more <20nm chips are designed and manufactured for different consumer products,

characterization of standard cell libraries becomes critical to profile the different characteristics and

behavior of functions of a chip design across a broad range of operating conditions.

It is imperative to expedite the characterization process of the standard cell libraries to improve the overall

chip design process time. Storing, accessing, and managing all the cell libraries in a shared storage

infrastructure require low latency, high reliability, efficiency, and a single pane of manageability of the cell

library data. All the validations and best practices listed in this report clearly indicate that NetApp

clustered Data ONTAP 8.2, with the recommended storage optimizations and sizing, can accelerate the

cell library characterization process.


The main objective for integrating Cadence’s Virtuoso Liberate tool with the NetApp clustered file system

is to improve the job completion time at the application layer. Another very important factor that is

conducive to the chip design process is the fact that NetApp clustered Data ONTAP 8.2, with adequate

storage optimization and sizing, can improve the performance of workloads generated from the various

design and characterization tools that coexist in the chip design and manufacturing process. With QoS,

workloads can be tied to different SLOs for tools running on the same node or in different cluster nodes in

a scale-out architecture.

NetApp clustered Data ONTAP 8.2 allows load balancing of the cell library volume in the cluster

namespace by moving it to different controller nodes in the cluster that is under load without any

disruption to the application.

10 Conclusion

There is a high level of complexity in the cell library characterization process where foundry process,

voltage, and temperature are validated and modeled for smaller silicon surface areas that are designed to

perform different functions. As Cadence keeps optimizing the Virtuoso Liberate tool in every new release,

it is important that the compute nodes, the storage, and the protocol all contribute to the overall job

completion time.

NetApp originally set out to improve the job completion time for the Virtuoso Liberate application with

clustered Data ONTAP 8.2. With all the validations and optimizations at the compute nodes, NFS

protocol, and storage layers, we ended up achieving improvement of up to 15% for cell library

characterization with NFSv4.1/pNFS. These improvements provide a huge benefit on top of what we

already provide with Data ONTAP operating in 7-Mode.

This result leads to the conclusion that sizing adequately, optimizing the storage along with regular file-

system maintenance, and choosing the right protocol can improve cell library characterization

performance significantly. This translates into two important factors that drive business in the EDA

industry:

Improved ROI with optimized license costs

Faster time to market


NetApp provides no representations or warranties regarding the accuracy, reliability, or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS, and the use of this information or the implementation of any recommendations or techniques herein is a customer’s responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.

© 2014 NetApp, Inc. All rights reserved. No portions of this document may be reproduced without prior written consent of NetApp, Inc. Specifications are subject to change without notice. NetApp, the NetApp logo, Go further, faster, Data ONTAP, Flash Cache, OnCommand, RAID-DP, SnapMirror, Snapshot, SnapVault, vFiler, and WAFL are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. Cadence, the Cadence logo, and Virtuoso are registered trademarks and Liberate is a trademark of Cadence Design Systems, Inc. Linux is a registered trademark of Linus Torvalds. Microsoft, Windows, and Windows Server are registered trademarks of Microsoft Corporation. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. TR-4270-0214

Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact product and feature versions described in this document are supported for your specific environment. The NetApp IMT defines the product components and versions that can be used to construct configurations that are supported by NetApp. Specific results depend on each customer's installation in accordance with published specifications.

http://support.netapp.com/matrix/mtx/login.do

Optimizing Standard Cell Library Characterization with ... · with Cadence Virtuoso Liberate and...

Documents

Transcript of Optimizing Standard Cell Library Characterization with ... · with Cadence Virtuoso Liberate and...