0
Luncheon Webinar SeriesDecember 18th, 2015
How to get started with DataStage (aka IBM InfoSphere
Information Server) running natively on Hadoop
presented by Beate Porstonsored By:
How to get started with DataStage (aka IBM InfoSphere Information Server) running natively on Hadoop
Questions and suggestions regarding presentation topics? - send to
Downloading the presentation
• http://www.dsxchange.net/20151218dsx.html
• Replay will be available within one day with email with details
Pricing and configuration - send to [email protected] Subject line : Pricing
For those that stay through the entire presentation, we have a extra give away!
Bonus Offer – Free premium membership for your DataStage Management! Submit
your management’s email address and we will offer him access on your behalf.
• Email [email protected] subject line “Managers special”.
• Join us all at Linkedin http://tinyurl.com/DSXmembers
1
© 2015 IBM Corporation2
How to get started with DataStage
v11.5 running natively on Hadoop
December, 2015
Beate Porst ([email protected])Product ManagerIBM InfoSphere Information Server
Agenda
• Quick Introduction into InfoSphere Information Server v11.5
• Architecture and System topologies for Information Server on Hadoop
• Installation & Setup
• Performance Observations
• Q&A
.. powered by Information ServerIntegrating and transforming data and content to deliver
accurate, consistent, timely and complete information on a
single platform unified by a common metadata layer
Information Empowerment for your Data Ecosystem
Information Governance
Catalog
Understand & Collaborate• Catalog technical metadata &
align w/ business language
• Mange (big) data lineage
• New compliance reporting
DataQuality
Cleanse & Monitor• Analyze & validate
w/ enhanced classification
• Cleanse & standardize
• Define, manage & monitor data
rules + exceptions
DataIntegration
Transform & Deliver
• Massive scalability
• Power for any complexity
• Deliver in batch and/or real-
time with change capture
• common connectivity • shared metadata • security (new data privacy functions included)• common execution engine with flexible deployments (new native MPP runtime on Hadoop)
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Information Server Release History
5
9.1 9.1.2 11.3 11.58.78.5
GA: 9/25:
New
EOS:
9/2016
EOS:
4/2017
8.18.0
2012 2013 2014 2015
Information Server Recent Activity
9.1 9.1.2 11.3 11.3.1+FP1 11.3.1
Business Driven Governance- Policy and rules support for
information governance- Web-based blueprints- Integrated metadata mgmt
enhancements
Sustainable Quality- Data Quality Console- Standardization Rules
Designer- Data Rules Advancements
Agile integration- InfoSphere Data Click- Enhanced Workload Mgmt- ODM Integration- Hadoop Balanced
Optimization- HDFS Extensions
Business Driven Governance- IDA 8.5 - Additional Workflow Roles- Data Rules Metadata- Bulk metadata import
Sustainable Quality- Profiling Big Data- Exception Stage- New QS standardization
rulesets
Agile Integration- Big Data Features
* JSON support* JDBC connector
- DB2 on z/OS load optimization- Data Click new data
sources/targets
Business Driven Governance- Info Governance Catalog- Shop for Data- Smart Hover- Collect & Share- Lineage@Scale
Sustainable Quality- Governance Dashboard
integration- Performance Optimizations- Productivity Enhancements- Global Geocoding
Agile Integration- Self-service Data integration- Cloud Connectors- MDM Integration- Sort compress- Hadoop currency- Greenplum Connector
Business Driven Governance- Subscription Manager - Stewardship Center (w/BPM)- Term Custom Attributes- Customizable attribute display- Lineage Admin Console- Prebuilt Governance Content- IGC Data ClassificationSustainable Quality- Data Quality Exception Management
Updates- Exception SQL Views- Stewardship Center Data Remediation
Workflow- Data Classification- Global Geocoding support
Agile Integration- Cognos TM1 Connector and Metadata
Import- HDFS Secure Connector- IDAA pushdown support- Hypervisor support for v11.3.1- BigInsights v4 support
2015 2016
Summary Information Server v11.5
11.5 FP2
Platform Extensions- Native execution on Hadoop- In-place upgrade v11.3.1 v11.5
Business Driven Governance- Governance Catalog Extensible Framework- Column-level lineage for Hadoop files- Multi-language support- XML Schema Definition support- Data class definitions- Asset interchange for extended lineage content
Sustainable Quality- Enhanced Data Classification- Address Verification and Enrichment Advancements
Agile Integration- Data Integration running natively on Hadoop- Automatic HDFS metadata import- Comprehensive and fast HDFS Connectivity- Out of the Box Database Pushdown- Out of the Box ERP Pack support- Embedded sensitive data protection
FP1
V11.5 Detailed Capability ComparisonInfoSphere
Information
Governance
Catalog
InfoSphere
Information
Server
For
Data Integration
InfoSphere
Information
Server
For
Data Quality
InfoSphere
Information
Server
Enterprise Edition
BigInsights
BigIntegrate
BigInsights
BigQuality
Business Glossary 1
1
1
1
Metadata Management and Lineage 1
1
1
1
Logical and Physical Data Modeling
Data Cleansing and Enrichment
Data Quality Validation & Monitoring
Data Stewardship
SOA Deployment
Data Specification Mapping
Extract, transform, load (ETL)
Change Data Delivery 2
2
Self Serve Data Access
Data Masking 5
5
5
5
View reports in Cognos 3
3
3
3
3
3
IBM BigInsights included (see notes) 4
4
Runs natively in Hadoop
1 Limited to 250 assets (any combination of glossary terms, categories, information governance policies and information governance rules)2 One database Source or Capture Agent excluding z/OS and must be used with DataStage as target3 View only access for any pre-defined report provided for Information Server4 Maximum of 5-node cluster of IBM BigInsights Data Scientist v4.1 install in support of Information Server5 Requires additional entitlement for Optim ODPP
Separate add-on purchases: data replication, ERP connectors (SAP, SAS), Postal address verification / geo-coding
New offering
Key Use Cases for Data Integration on Hadoop
HDFS
MDM
HDFS
warehouse
Enhanced 360º view
Data Reservoir & Logical Warehouse
Exploratory Analysis
Warehouse Offloading
HDFS
warehouse
HDFS
Modernize
warehouse
architecture
through the
Data Reservoir
improving
efficiency
(TCO) and
extending
analytics
Enhance insight
of key business
entities (e.g.
customer) by
integrating and
correlating new
data sources
and building an
integrated view
Improve
efficiency of
existing
warehouse
investments by
offloading
“dark data” or
augmenting it
with sandboxes
Discover &
explore new
insights more
rapidly and in a
more agile &
iterative manner
Integrate | Transform
Cleanse | Govern
Integrate | Transform
Cleanse | Govern
Integrate | Transform
Cleanse | Govern
Integrate | Transform
Cleanse | Govern
Information Server – BigIntegrateIngest, transform, process and deliver any data into & within Hadoop
Satisfy the most complex transformation requirements with the most scalable runtime available in batch or real-time
Connect• Connect to wide range of traditional enterprise data sources as
well as Hadoop data sources
• Native connectors with highest level of performance and scalability for key data sources
Design & Transform• Transform and aggregate
any data volume
• Benefit from hundreds of built-in transformation functions
• Leverage metadata-driven productivity and enable collaboration
Manage & Monitor• Use a simple, web-based dashboard to manage
your runtime environment
Information Server – BigQualityAnalyze, cleanse and monitor your big data
Most comprehensive data quality capabilities that run natively on HadoopAnalyze
• Discovers data of interest to the org based on business defined data classes
• Analyzes data structure, content and quality
• Automates your data analysis processCleanse
• Investigate, standardize, match and survive data at scale and with the full power of common data integration processes
Monitor
• Assess and monitor the quality of your data in any place and across systems
• Align quality indicators to business policies
• Engage data steward team when issues exceed thresholds of the business
12
Information Server on Hadoop Offering
• The most scalable Transformation and Data Integration and Quality engine now runs natively on Hadoop
• Runs 10x-20x faster than MapReduce
• Get enterprise-class transformation and cleansing for your Hadoop data
• Use the power of your Hadoop cluster to integrate, transform & cleanse data without writing a single line of code
• Hadoop distribution currency:
– BigInsights 4.0 & 4.1
– HortonWorks 2.2 & 2.3
– Cloudera 5.3 & 5.4
13
Optimize your Integration/Transformation and Data Quality workload based on
data locality and resources availability
Design your integration, data preparation or cleansing once and run it on your
Hadoop Cluster, on your traditional engine or optimize to run on your database
Native Hadoop Runtime
Information Server on Hadoop Features• Full support for Information Analyzer, QualityStage, DataStage and DataClick jobs
• Support for Kerberos enabled cluster
• Full Edge/Client node support for Engine Tier install
• Automatic binary distribution (if not detected) to data nodes or NFS mount
• Data locality support for HDFS file reads (e.g. BDFS, DataSet etc.)
• Container size estimation
• Visibility in DS Job log (Hadoop tracking URL) & YARN Job browser
• Support for Hadoop Node Labels
• Support for YARN scheduler queues
• Support for ODP distributions (BigInsights, HortonWorks, Pivotal etc.) and Cloudera
RUNTIME ARCHITECTURE & DEPLOYMENT OPTIONS
16
Hadoop Cluster
DataNode
System Topology
DataNode DataNode/opt/IBM/InformationServer/opt/IBM/InformationServer /opt/IBM/InformationServer
IS Service Tier
IS Engine Tier Hadoop Edge Node
/opt/IBM/InformationServer
IS Metadata
Repository TierIS Client Tier
IS Engine Tier Installed on Hadoop Edge Node
All other IS Tiers can be on the Edge Node or outside the cluster
Information Server binaries live on all DataNodes that will run DataStage jobs
Information Server binaries are copied to DataNodes at job run time using HDFS if binaries don’t already exist
Grid Deployments on and off Hadoop
18
Stand-alone
Information Server Grid
Information Server
Grid on Hadoop
Deployment ModelsInformation Server on Hadoop:
19
Typical Hadoop Environment
3 different deployment models for Information Server
within a typical Hadoop Environment
One Information Server Instance – Multiple EnginesOn and off Hadoop
20
PX Engine “On Hadoop”
DS Project B
PX Engine “Stand-alone”
DS Project A
Services & Repository
Requirement:
• needs to be v11.5 (no
version mix between
components)
Hadoop Cluster
DataStage Job Runtime Architecture on Hadoop
DataNode
Section Leader
Player 1 Player 2 Player N
DataNodeDataNode
Section Leader
Player 1 Player 2 Player N
YARN Containers
/opt/IBM/InformationServer/opt/IBM/InformationServer /opt/IBM/InformationServer
IS
Application
Master
IS Service Tier
IS Engine Tier Hadoop Edge Node
ConductorIS YARN
Client
/opt/IBM/InformationServer
IS Metadata
Repository TierIS Client Tier
Submit Job
Jobs are submitted from an IS Client (1)
Conductor asks IS YARN Client for an Application Master(AM) to run the job (2)
IS YARN Client manages IS AM pool, starts new ones when necessary (3)
Conductor passes IS AM resource requirements and commands to start Section Leaders (4)
IS AM gets containers from YARN Resource Manager(not pictured)
YARN Node Managers(NM) on DataNodes start YARN containers with Section Leaders (5)
Section Leaders connect back to Conductor and start players (6)
1
2
3
4
55
6 6
INSTALLATION & SETUP
22
Hadoop Cluster
DataNode
Installation – Edge Node Provisioning
DataNode DataNode
Hadoop Edge Node
Provisioned through Ambari(pictured), Cloudera Manager, or manually.
Required Clients to install are HDFS and YARN
Validate by running yarn and hdfs commands
Hadoop Cluster
DataNode
Installation – Information Server on Hadoop
DataNode DataNode
IS Service Tier
IS Engine Tier Hadoop Edge Node
/opt/IBM/InformationServer
IS Metadata
Repository TierIS Client Tier
Information Server Tiers are installed in the typical fashion through the IBM Information Server install.
Validate Engine Tier Install
Make sure a simple job with Transform can compile and run locally
Run with default config file on local node
Don’t run on run on Hadoop yet!APT_YARN_CONFIG
Hadoop Cluster
DataNode
Creating local Information Server Binary Paths
DataNode DataNode
IS Service Tier
IS Engine Tier Hadoop Edge Node
/opt/IBM/InformationServer
IS Metadata
Repository TierIS Client Tier
Currently a Manual step since jobs don’t run as root
Be careful to create with correct permissions
Cluster settings affect who the owner should be
/opt/IBM/InformationServer /opt/IBM/InformationServer /opt/IBM/InformationServer
Setting up Users on Hadoop
• Gather the User & Group names that will run Jobs
• Create HDFS permissions for those users
– sudo -u hdfs hadoop fs -mkdir /user/InfoSphere_Information_Server_user_name
– sudo -u hdfs hadoop fs -chown InfoSphere_Information_Server_user_name
:InfoSphere_Information_Server_user_group
/user/InfoSphere_Information_Server_user_name
– E.g., to create a user folder for the user dsadm, issue:
• sudo -u hdfs hadoop fs -mkdir /user/dsadm
• sudo -u hdfs hadoop fs -chown dsadm:dstage /user/dsadm
• Additional settings might be required if not running on an Edge node27
Hadoop Cluster
DataNode
Starting the Information Server YARN Client
DataNode DataNode
IS Service Tier
IS Engine Tier Hadoop Edge Node
/opt/IBM/InformationServer
IS Metadata
Repository TierIS Client Tier
Can be started manually using PXEngine/etc/yarn_conf/start-pxyarn.sh
Will be started automatically with first job run on Hadoop
Will start 2 ApplicationMasters by default
Tuneable with APT_YARN_AM_POOL_SIZE
Troubleshoot with PXEngine/logs/yarn_logs/yarn_client_out.0
/opt/IBM/InformationServer /opt/IBM/InformationServer /opt/IBM/InformationServer
IS YARN
Client
IS
Application
Master
IS
Application
Master
Create Static Configuration File with All Cluster Nodes
• This will localize binaries on all nodes with first job runnode "conductor_node"{fastname "myconductor.mycompany.com"pools "conductor" "export"resource disk "/data" {pool "" "export" "conductor_node"}resource scratchdisk "/scratch" {}
}node "node0"{fastname “compute1.mycompany.com"pools ""resource disk "/data" {pool "" "export" "node0"}resource scratchdisk "/scratch" {}
}node "node1"{fastname “compute2.mycompany.com"pools ""resource disk "/data" {pool "" "export" "node1"}resource scratchdisk "/scratch" {}
}
Validate Running on Hadoop
Make sure a simple job with Transform can run on Hadoop
Run with static config file on all nodes APT_YARN_CONFIG = /opt/IBM/InformationServer/Server/PXEngine/etc/yarn_conf/yarnconfig.cfg
APT_YARN_MODE=trueIn yarnconfig.cfg
Hadoop Cluster
DataNode
How Binary Localization Works?
DataNode DataNode
IS Service Tier
IS Engine Tier Hadoop Edge Node
/opt/IBM/InformationServer
IS Metadata
Repository TierIS Client Tier
Cached in HDFS by IS YARN Client on startup
Localized by jobs from HDFS cache if they don’t exist at job run time
Requires ~4GB of space in /tmp
Tuneable with APT_YARN_BINARY_COPY_MODE
/opt/IBM/InformationServer /opt/IBM/InformationServer /opt/IBM/InformationServer
IS YARN
Client
IS
Application
Master
IS
Application
Master
Dynamic Configuration Files• Dynamic configuration files take advantage of resource management and HDFS for DataSets
– Predefined dynamic config file: /opt/IBM/InformationServer/Server/dynamic_config
node "conductor_node"{
fastname "myconductor.mycompany.com"pools "conductor" "export"resource disk "/data" {pool "" "export" "conductor_node"}resource scratchdisk "/scratch" {}
}node "node0"{
fastname "$host"pools ""resource disk "/data" {pool "" "export" "node0"}resource scratchdisk "/scratch" {}
}node "node1"{
fastname "$host"pools ""resource disk "/data" {pool "" "export" "node1"}resource scratchdisk "/scratch" {}
}
HDFS
Local Disk
The Information Server Yarn Config Fileyarnconfig.cfg Located in: /opt/IBM/InformationServer/Server/PXEngine/etc/yarn_conf/yarnconfig.cfg
APT_YARN_MODE=trueIf defined and set to 1 or true runs the given PX job on the local Hadoop install in YARN mode.
APT_YARN_CONTAINER_SIZE=64Defines the size in MBs of the containers that will be requested to run PX Section Leader and Player processes in. The default is 64MB if not set.
APT_YARN_CONTAINER_VCORES=0Defines the number of virtual cores that the containers will request to run PX Section Leader and Player processes in. The default is 0 which means "Don't set it".
APT_YARN_AM_CONTAINER_SIZE=256Defines the size in MBs of the container that will be requested to run PX Application Master process. The default is 256MB if not set.
APT_YARN_AM_POOL_SIZE=2The number of pre-started Application Masters, default is 2.
APT_YARN_NODE_LABEL_EXPR=Define the node label that Information Server jobs should use when being submitted tothe YARN scheduler.
APT_YARN_SCHEDULER_QUEUE=Define the default queue that Information Server jobs should use when being submitted to the YARN scheduler. The default is empty which will use the default scheduler queue.
DataStage Job Run time logs
YARN
Client
Connection
Hadoop
tracking
URL
Application
Master
Connection
YARN
Container
Allocation
Job
Processes
Running
DataStage Job Runtime Hadoop Console
DataStage
Application
Master
Information
Application
Run Time
Container
Allocated
Resources
Hadoop Cluster
DataNode
Using Hadoop Node Labels
DataNode DataNode/opt/IBM/InformationServer/opt/IBM/InformationServer
IS Service Tier
IS Engine Tier Hadoop Edge Node
/opt/IBM/InformationServer
IS Metadata
Repository TierIS Client Tier
Separate application workloads
Supported by Apache Hadoop 2.6, HDP 2.2, CDH 5.4, IOP 4.0
IIS node label can be controlled by Hadoop scheduler queue or passed with jobs
Unlabelled nodes available to any application dependent on queue configuration
Not supported for Fair Scheduler yet (YARN-2497)
Apache Hadoop 2.8 allows borrowing nodes to increase cluster utilization
DataNodeDataNode DataNode
IISNode IISNode
GPUNode GPUNode
Hadoop Cluster
DataNode
HDFS Data Replication
DataNode DataNode/opt/IBM/InformationServer/opt/IBM/InformationServer
IS Service Tier
IS Engine Tier Hadoop Edge Node
/opt/IBM/InformationServer
IS Metadata
Repository TierIS Client Tier IIS Job writes two partition
data files P1 and P2
One block will always reside local to the writing node
Other blocks replicated based on HDFS rack awareness algorithm
Number of replicas depends on HDFS configuration, Default=3
IIS Job that reads P1 and P2 requests to run local to the blocks
Job will read block from another node if locality isn’t possible
DataNodeDataNode DataNode
IISNode IISNode
GPUNode GPUNode
P1 P21 2 2
21
1
HADOOP / YARN Environment Settings
38
Parameter Description Default value
Recommended
value
yarn.log-aggregation-enable Manages YARN log files. Set this parameter to false if you want the log files stored
in the local file system.
true false
yarn.nodemanager.log.retain-
seconds
Specifies the duration in seconds that Hadoop retains container logs 10800
yarn.nodemanager.pmem-check-
enabled
Determines if physical memory limits exist for containers. If set to true, job is
stopped if a container uses more than the physical memory limit that you specify.
Set this parameter to false if you do not want jobs to fail when the containers
consume more memory than they are allocated.
true
yarn.nodemanager.resource.memo
ry-mb
Sets the amount of physical memory that can be allocated for containers. 8192 MB
yarn.nodemanager.vmem-check-
enabled
Determines if virtual memory limits exist for containers. If this parameter is set to
true, the job is stopped if a container is using more than the virtual limit that you
specify. Set this parameter to false if you do not want jobs to fail when the
containers consume more memory than they are allocated.
true
yarn.nodemanager.vmem-pmem-
ratio
Sets the ratio of virtual memory to physical memory limits for containers. If
yarn.nodemanager.vmem-check-enabled is set to true, jobs might be stopped by
YARN if the ratio of the virtual memory that a container consumes compared to
the physical memory is greater than the ratio that you specify.
2.1
yarn.resourcemanager.nodemanag
ers.heartbeat-interval-ms
Controls the start time for parallel jobs. For clusters that have fewer than 50
nodes, 1000 ms is often too long and leads to a longer start time for parallel jobs.
You can set this value to 50 milliseconds to ensure parallel jobs start in a timely
manner.
1000 ms 50
milliseconds.
39
Parameter Description Default value
Recommended
value
yarn.scheduler.capacity.
maximum-am-resource-
percent
Specifies the maximum percentage of resources for all queues in the cluster that
can be used to run application masters, and controls the number of concurrent
active applications.
Defaults vary
between
distrubutions of
Hadoop.
yarn.scheduler.capacity.q
ueue-path.maximum-
am-resource-percent
Specifies the maximum percentage of resources for a single queue in the cluster
that can be used to run application masters, and controls the number of
concurrent active applications.
Defaults vary
between
distrubutions of
Hadoop.
yarn.scheduler.incremen
t-allocation-mb
This value indicates how much the container size can be incremented. If you
submit tasks with resource requests lower than the minimum-allocation value, the
requests are set to the minimum-allocation value.
512 MB on
Cloudera
yarn.scheduler.minimum
-allocation-mb
This parameter helps conserve resources on the cluster by setting the minimum
amount of memory that can be requested for a container. The default container
size for parallel processes is 64 MB.
Note: If changing the yarn.scheduler.minimum-allocation-mb value with Ambari-
2.1, you must specify whether the changes should be applied to the MapReduce
specific resource settings. If you are significantly reducing the value of
yarn.scheduler.minimum-allocation-mb, do not change the MapReduce values
based on the new value, because it could cause MapReduce jobs to fail.
1024 MB for most
Hadoop
distributions
256 MB or l
PERFORMANCE OBSERVATIONS
40
Performance ObservationsRunning Information Server jobs natively on Hadoop / Yarn
• Running Information Server jobs natively under YARN scales out linearly!
– Throughput doubles if number of Hadoop data double
• YARN introduces some overhead for Job startup time
– Job startup time is slightly slower then a non-YARN start up
• Storing data on HDFS is up to 13% slower then native OS storage
• Observations when running a realistic DataStage workload on a YARN managed Hadoop cluster:
– Using Static configuration files
• performance running on/off Hadoop would be similar (for similar resources)• This is mostly because it doesn’t need to store DataStage specific files on HDFS as jobs will run on
statically defined nodes
– Using dynamic configuration files:
• We observed a performance penalty on Hadoop of up to 13% due to the HDFS usage• Storing data on HDFS is significantly slower than native OS storage due to things such as the
replication factor 41
Test System Topology
42
. . .
BigInsights Cluster
DB2 Server Data Node 1 Data Node N
Information Server
Services, Repository
Engine
• Number of Systems: 11
•The specs for each box are identical (IBM xSeries High Volume Racks x3630 M4)
‾ CPU: 32 cores ( 4 Sandy-Bridge EP each with 8 cores)
‾ Memory: 64 GB
‾ Disk: 14 X 1TB
‾ Network: interconnected with 10GbE
Data Warehouse
For the TPC-DI
Workload
Master Node
43
Scale Out Test
• DataStage throughput doubled when doubling the number of hadoop
data nodes.
44
TPC-DI Workload Performance in Different Modes
Q&A
45
Where to get more Information?
• Product Documentation: IBM Information Server Knowledge Center:
– http://www-
01.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.ishado
op.nav.doc/containers/cont_iisinfsrv_hadoop.html?lang=en
– Remember: BigIntegrate / BigQuality are only offerings – the actual product is
Information Server
• Tutorial on How to setup Information Server on Hadoop on a Cloudera CDH
5.4
– https://app.box.com/s/b0wonh8vv5bn8g8eaaj76cy7deui27cx
• Contact: Beate Porst ([email protected]) -- Product Manager Data
Integration
46
Q&A• What are IBM BigInsights BigIntegrate & IBM BigInsights BigQuality
– These are offerings (specific bundles/licenses/prices)for your Hadoop Data Integration & Data Quality
needs. These offerings are powered by InfoSphere Information Server now running natively on Hadoop /
Yarn.
• Which Hadoop Distributions are supported?
– ODP distributions (e.g. IBM BigInsights, HortonWorks, Pivotal), Cloudera running on Linux OS (X86).
• Can I connect (read/write) to data sources outside of Hadoop?
– Yes, you can connect to pretty much any data source accessible by Information Server. (from
mainframe to cloud)
• Where will data transformation / quality processes run?
– Processes will run on any /all of the Data Nodes in the Hadoop distribution on which the product is
installed. The number of data nodes utilized to run a particular job depends on the partioning level
associated with a job during Job start up (configuration file)
• Do I need to know how to write Java, HiveQL, Pig or any other programming language to create Data
Integration or quality processes
– No, data integration and quality processes are designed using an intuitive graphical design interface. You
compose your transformation logic out of pre-build operators (think of it as LEGO bricks) that you hook
together to form a final flow of data47
Q&A
• Will I be able to get Data Lineage or Impact Analysis for jobs running on Hadoop?
– Yes, Information Server on Hadoop utilize Information Server’s shared metadata feature which allows to
automatically capture design & operation metadata and deduce data lineage and dependency analysis
no matter where the job runs.
• Is Information Server on Hadoop using Map/Reduce?
– No, jobs are processed by the Information Server Parallel Execution Engine which is a highly scalable
MPP (cluster) engine. Each data node has a copy of the PX engine libraries and therefore a job can run
in parallel on multiple data nodes.
• Are BigIntegrate & BigQuality offerings the only option to license Information Server on Hadoop?
– No, any of the Information Server v11.5 offerings can be deployed on Hadoop.
• Is the Information Server Parallel Execution Engine (PX) faster than Spark?
– The IBM PX engine and Spark are both high-performant cluster computing MPP engines. Based on
internal tests, we have seen many use cases, specifically when processing large volumes of data where
IBM PX engine was more performant than Spark.
48
THANK YOU
How to get started with DataStage (aka IBM InfoSphere Information Server) running natively on Hadoop
Questions and suggestions regarding presentation topics? - send to
Downloading the presentation
• http://www.dsxchange.net/20151218dsx.html
• Replay will be available within one day with email with details
Pricing and configuration - send to [email protected] Subject line : Pricing
For those that stay through the entire presentation, we have a extra give away!
Bonus Offer – Free premium membership for your DataStage Management! Submit
your management’s email address and we will offer him access on your behalf.
• Email [email protected] subject line “Managers special”.
• Join us all at Linkedin http://tinyurl.com/DSXmembers
50
Top Related