PC 951 BigDataEdition En

Informatica PowerCenter (Version 9.5.1)

Big Data Edition

Informatica PowerCenter Big Data Edition

Version 9.5.1December 2012

Copyright (c) 2012 Informatica. All rights reserved.

This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use anddisclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form,by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S. and/or internationalPatents and other Patents Pending.

Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided inDFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013©(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable.

The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us inwriting.

Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange,PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange Informatica OnDemand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging and InformaticaMaster Data Management are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other companyand product names may be trade names or trademarks of their respective owners.

Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rightsreserved. Copyright © Sun Microsystems. All rights reserved. Copyright © RSA Security Inc. All Rights Reserved. Copyright © Ordinal Technology Corp. All rightsreserved.Copyright © Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright © MetaIntegration Technology, Inc. All rights reserved. Copyright © Intalio. All rights reserved. Copyright © Oracle. All rights reserved. Copyright © Adobe Systems Incorporated. Allrights reserved. Copyright © DataArt, Inc. All rights reserved. Copyright © ComponentSource. All rights reserved. Copyright © Microsoft Corporation. All rights reserved.Copyright © Rogue Wave Software, Inc. All rights reserved. Copyright © Teradata Corporation. All rights reserved. Copyright © Yahoo! Inc. All rights reserved. Copyright ©Glyph & Cog, LLC. All rights reserved. Copyright © Thinkmap, Inc. All rights reserved. Copyright © Clearpace Software Limited. All rights reserved. Copyright © InformationBuilders, Inc. All rights reserved. Copyright © OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved. Copyright Cleo Communications, Inc. All rightsreserved. Copyright © International Organization for Standardization 1986. All rights reserved. Copyright © ej-technologies GmbH. All rights reserved. Copyright © JaspersoftCorporation. All rights reserved. Copyright © is International Business Machines Corporation. All rights reserved. Copyright © yWorks GmbH. All rights reserved. Copyright ©Lucent Technologies. All rights reserved. Copyright (c) University of Toronto. All rights reserved. Copyright © Daniel Veillard. All rights reserved. Copyright © Unicode, Inc.Copyright IBM Corp. All rights reserved. Copyright © MicroQuill Software Publishing, Inc. All rights reserved. Copyright © PassMark Software Pty Ltd. All rights reserved.Copyright © LogiXML, Inc. All rights reserved. Copyright © 2003-2010 Lorenzi Davide, All rights reserved. Copyright © Red Hat, Inc. All rights reserved. Copyright © The Boardof Trustees of the Leland Stanford Junior University. All rights reserved. Copyright © EMC Corporation. All rights reserved. Copyright © Flexera Software. All rights reserved.

This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and other software which is licensed under the Apache License,Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing,software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See theLicense for the specific language governing permissions and limitations under the License.

This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software copyright ©1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of any kind, either express or implied, including but notlimited to the implied warranties of merchantability and fitness for a particular purpose.

The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine,and Vanderbilt University, Copyright (©) 1993-2006, all rights reserved.

This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) and redistribution ofthis software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html.

This product includes Curl software which is Copyright 1996-2007, Daniel Stenberg, <[email protected]>. All Rights Reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or withoutfee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

The product includes software copyright 2001-2005 (©) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to terms availableat http://www.dom4j.org/ license.html.

The product includes software copyright © 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http://dojotoolkit.org/license.

This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html.

This product includes software copyright © 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found at http://www.gnu.org/software/ kawa/Software-License.html.

This product includes OSSP UUID software which is Copyright © 2002 Ralf S. Engelschall, Copyright © 2002 The OSSP Project Copyright © 2002 Cable & WirelessDeutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php.

This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software are subjectto terms available at http:/ /www.boost.org/LICENSE_1_0.txt.

This product includes software copyright © 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available at http://www.pcre.org/license.txt.

This product includes software copyright © 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http:// www.eclipse.org/org/documents/epl-v10.php.

This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://www.stlport.org/doc/ license.html, http://www.asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://httpunit.sourceforge.net/doc/license.html, http://jung.sourceforge.net/license.txt , http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/license.html, http://www.libssh2.org,http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/license-agreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html; http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt. http://jotm.objectweb.org/bsd_license.html; . http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://developer.apple.com/library/mac/#samplecode/HelpHook/Listings/HelpHook_java.html; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/

software/tcltk/license.html, http://www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, http://www.slf4j.org/license.html; http://www.iodbc.org/dataspace/iodbc/wiki/iODBC/License; http://www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/license.html; http://www.openmdx.org/#FAQ; http://www.php.net/license/3_01.txt; http://srp.stanford.edu/license.txt; http://www.schneier.com/blowfish.html;http://www.jmock.org/license.html; http://xsom.java.net; and http://benalman.com/about/license/.

This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and DistributionLicense (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code LicenseAgreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php) the MIT License (http://www.opensource.org/licenses/mit-license.php) and the Artistic License (http://www.opensource.org/licenses/artistic-license-1.0).

This product includes software copyright © 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding this softwareare subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab. For furtherinformation please visit http://www.extreme.indiana.edu/.

This product includes software developed by Andrew Kachites McCallum. "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu (2002).

This Software is protected by U.S. Patent Numbers 5,794,246; 6,014,670; 6,016,501; 6,029,178; 6,032,158; 6,035,307; 6,044,374; 6,092,086; 6,208,990; 6,339,775;6,640,226; 6,789,096; 6,820,077; 6,823,373; 6,850,947; 6,895,471; 7,117,215; 7,162,643; 7,243,110, 7,254,590; 7,281,001; 7,421,458; 7,496,588; 7,523,121; 7,584,422;7676516; 7,720,842; 7,721,270; and 7,774,791, international Patents and other Patents Pending.

DISCLAIMER: Informatica Corporation provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the impliedwarranties of noninfringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this software or documentation is error free. Theinformation provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation issubject to change at any time without notice.

NOTICES

This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress SoftwareCorporation ("DataDirect") which are subject to the following terms and conditions:

1.THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOTLIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.

2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OFTHE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACHOF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.

Part Number: PC-BDE-95100-0000

Table of Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivInformatica Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Informatica Customer Portal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Informatica Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Informatica Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Informatica How-To Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Informatica Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Informatica Multimedia Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Informatica Global Customer Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1: Introduction to PowerCenter Big Data Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . 1PowerCenter Big Data Edition Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Big Data Access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Data Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

High-Performance Processing in the Native Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Native Environment Processing Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

High-Performance Processing in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Hive Environment Processing Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Big Data Processing Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 2: Installation and Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Installation and Configuration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

PowerCenter Big Data Edition Installation Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Before You Begin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Install and Configure PowerCenter Standard Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Install and Configure PowerExchange Adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Install and Configure Data Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Pre-Installation Tasks for a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Pre-Installation Tasks for a Cluster Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

PowerCenter Big Data Edition Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Installing in a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Installing in a Cluster Environment from the Primary NameNode Using SCP Protocol. . . . . . . . . . 11

Installing in a Cluster Environment from the Primary NameNode Using FTP, HTTP, or NFSProtocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Installing in a Cluster Environment from any Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

After You Install. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Hadoop Environment Variable Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Hadoop Pushdown Properties for the Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . 14

Table of Contents i

HDFS Security Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Set Up Address Validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

PowerCenter Big Data Edition Uninstallation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Uninstalling PowerCenter Big Data Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Chapter 3: Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Connections Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

HDFS Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Hive Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Creating a Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Chapter 4: Mappings in the Native Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Mappings in the Native Environment Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Data Processor Mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

HDFS Mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

HDFS Mapping Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Hive Mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Hive Mapping Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Social Media Mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Twitter Mapping Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Chapter 5: Mappings in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Mappings in a Hive Environment Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Datatypes in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Sources in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Flat File Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Hive Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Relational Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Targets in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Flat File Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

HDFS Flat File Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Hive Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Relational Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Transformations in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Functions in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Variable Ports in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Mappings in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Workflows that Run Mappings in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Configuring a Mapping to Run in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Hive Execution Plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Hive Execution Plan Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Viewing the Hive Execution Plan for a Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Monitoring a Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

ii Table of Contents

Logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Troubleshooting a Mapping in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Chapter 6: Profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Profiles Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Native and Hadoop Environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Supported Data Source and Run-time Environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Run-time Environment Setup and Validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Run-time Environment and Profile Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Profile Types on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Column Profiles on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Rule Profiles on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Data Domain Discovery on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Running a Single Data Object Profile on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Running Multiple Data Object Profiles on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Monitoring a Profile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Viewing Profile Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 7: Native Environment Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Native Environment Optimization Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Processing Big Data on a Grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Data Integration Service Grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

PowerCenter Integration Service Grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Grid Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Processing Big Data on Partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Partition Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Appendix A: Datatype Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Datatype Reference Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Hive Complex Datatypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Hive Datatypes and Transformation Datatypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Appendix B: Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Table of Contents iii

PrefaceThe Informatica for Hadoop User Guide provides information about how to configure Informatica products forHadoop.

Informatica Resources

Informatica Customer PortalAs an Informatica customer, you can access the Informatica Customer Portal site at http://mysupport.informatica.com. The site contains product information, user group information, newsletters,access to the Informatica customer support case management system (ATLAS), the Informatica How-To Library,the Informatica Knowledge Base, the Informatica Multimedia Knowledge Base, Informatica ProductDocumentation, and access to the Informatica user community.

Informatica DocumentationThe Informatica Documentation team takes every effort to create accurate, usable documentation. If you havequestions, comments, or ideas about this documentation, contact the Informatica Documentation team throughemail at [email protected]. We will use your feedback to improve our documentation. Let usknow if we can contact you regarding your comments.

The Documentation team updates documentation as needed. To get the latest documentation for your product,navigate to Product Documentation from http://mysupport.informatica.com.

Informatica Web SiteYou can access the Informatica corporate web site at http://www.informatica.com. The site contains informationabout Informatica, its background, upcoming events, and sales offices. You will also find product and partnerinformation. The services area of the site includes important information about technical support, training andeducation, and implementation services.

Informatica How-To LibraryAs an Informatica customer, you can access the Informatica How-To Library at http://mysupport.informatica.com.The How-To Library is a collection of resources to help you learn more about Informatica products and features. Itincludes articles and interactive demonstrations that provide solutions to common problems, compare features andbehaviors, and guide you through performing specific real-world tasks.

iv

http://mysupport.informatica.com

mailto:[email protected]


http://www.informatica.com


Informatica Knowledge BaseAs an Informatica customer, you can access the Informatica Knowledge Base at http://mysupport.informatica.com.Use the Knowledge Base to search for documented solutions to known technical issues about Informaticaproducts. You can also find answers to frequently asked questions, technical white papers, and technical tips. Ifyou have questions, comments, or ideas about the Knowledge Base, contact the Informatica Knowledge Baseteam through email at [email protected].

Informatica Multimedia Knowledge BaseAs an Informatica customer, you can access the Informatica Multimedia Knowledge Base at http://mysupport.informatica.com. The Multimedia Knowledge Base is a collection of instructional multimedia filesthat help you learn about common concepts and guide you through performing specific tasks. If you havequestions, comments, or ideas about the Multimedia Knowledge Base, contact the Informatica Knowledge Baseteam through email at [email protected].

Informatica Global Customer SupportYou can contact a Customer Support Center by telephone or through the Online Support. Online Support requiresa user name and password. You can request a user name and password at http://mysupport.informatica.com.

Use the following telephone numbers to contact Informatica Global Customer Support:

North America / South America Europe / Middle East / Africa Asia / Australia

Toll FreeBrazil: 0800 891 0202Mexico: 001 888 209 8853North America: +1 877 463 2435

Toll FreeFrance: 0805 804632Germany: 0800 5891281Italy: 800 915 985Netherlands: 0800 2300001Portugal: 800 208 360Spain: 900 813 166Switzerland: 0800 463 200United Kingdom: 0800 023 4632

Standard RateBelgium: +31 30 6022 797France: +33 1 4138 9226Germany: +49 1805 702 702Netherlands: +31 306 022 797United Kingdom: +44 1628 511445

Toll FreeAustralia: 1 800 151 830New Zealand: 09 9 128 901

Standard RateIndia: +91 80 4112 5738

Preface v






C H A P T E R 1

Introduction to PowerCenter BigData Edition

This chapter includes the following topics:

¨ PowerCenter Big Data Edition Overview, 1

¨ Big Data Access, 2

¨ Data Replication, 2

¨ High-Performance Processing in the Native Environment, 3

¨ High-Performance Processing in a Hive Environment, 4

¨ Big Data Processing Example, 5

PowerCenter Big Data Edition OverviewPowerCenter Big Data Edition includes functionality from the following Informatica products: PowerCenter, DataExplorer, Data Quality, Data Replication, Data Tranformation, PowerExchange for Hive, PowerExchange forHDFS, PowerExchange for Hadoop, and social media adapters.

In addition to basic functionality associated with the Informatica products, you can use the following functionalityassociated with big data:

Access big data sources

Access unstructured and semi-structured data, social media data, and data in Hive and HDFS.

Replicate data

Replicate large amounts of transactional data between heterogeneous databases and platforms.

Configure high-performance processing in the native environment

Distribute mapping, session, and workflow processing across nodes in a grid, enable partitioning to processpartitions of data in parallel, and process data through highly available application services in the domain.

Configure high-performance processing in a Hive environment

Distribute mapping and profile processing across cluster nodes in a Hive environment.

You can process data in the native environment or a Hive environment. In the native environment, an IntegrationService processes the data. You can run Model repository mappings and profiles on the Data Integration Service.You can run PowerCenter sessions and workflows on a PowerCenter Integration Service. In a Hive environment,nodes in a Hadoop cluster process the data.

1

Big Data AccessIn addition to relational and flat file data, you can access unstructured and semi-structured data, social mediadata, and data in a Hive or Hadoop Distributed File System (HDFS) environment.

You can access the following types of data:Transaction data

You can access different types of transaction data, including data from relational database managementsystems, online transaction processing systems, online analytical processing systems, enterprise resourceplanning systems, customer relationship managment systems, mainframe, and cloud.

Unstructured and semi-strutured data

You can use parser transformations to read and transform unstructured and semi-structured data. Forexample, you can use the Data Processor transformation in a workflow to parse a Microsoft Word file to loadcustomer and order data into relational database tables.

You can use HParser to transform complex data into flattened, usable formats for Hive, PIG, and MapReduceprocessing. HParser processes complex files, such as messaging formats, HTML pages and PDF documents.HParser also transforms formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT.

Social media data

You can use PowerExchange adapters for social media to read data from social media web sites likeFacebook, Twitter, and LinkedIn. You can also use the PowerExchange for DataSift to extract real-time datafrom different social media web sites and capture data from DataSift regarding sentiment and languageanalysis. You can use PowerExchange for Web Content-Kapow to extract data from any web site.

Data in Hive and HDFS

You can use other PowerExchange adapters to read data from or write data to Hadoop. For example, you canuse PowerExchange for Hive to read data from or write data to Hive. Also, you can use PowerExchange forHDFS to extract data from and load data to HDFS.

Data ReplicationYou can replicate large amounts of transactional data between heterogeneous databases and platforms with DataReplication. You might replicate data to distribute or migrate the data across your environment.

With Data Replication, you can perform the following types of data replication:

Low-latency data replication

You can perform low-latency batched replication to replicate data on an interval. You can also performcontinuous replication to replicate data in near real time.

For example, you can use continuous replication to send transactional changes to a staging database oroperational data store. You can then use PowerCenter to extract data from Data Replication target tables andthen transform the data before loading it to an active enterprise data warehouse.

Data replication for Hadoop processing

You can extract transactional changes into text files. You can then use PowerCenter to move the text files toHadoop to be processed.

2 Chapter 1: Introduction to PowerCenter Big Data Edition

High-Performance Processing in the NativeEnvironment

You can optimize the native environment to process big data fast and reliably. You can run an Integration Serviceon a grid to distribute the processing across nodes in the grid. You can process partitions of a session in parallel.You can also enable high availability.

You can enable the following features to optimize the native environment:

PowerCenter Integration Service on grid

You can run PowerCenter sessions and workflows on a grid. The grid is an alias assigned to a group of nodesthat run PowerCenter sessions and workflows. When you run a session or workflow on a grid, thePowerCenter Integration Service distributes the processing across multiple nodes in the grid.

Data Integration Service on grid

You can run Model repository mappings and profiles on a grid. The grid is an alias assigned to a group ofnodes that run mappings and profiles assigned to the Data Integration Service. When you run a mapping orprofile on a grid, the Data Integration Service distributes the processing across multiple nodes in the grid.

Partitioning

You can create partitions in a PowerCenter session to increase performance. When you run a partitionedsession, the PowerCenter Integration Service performs the extract, transformation, and load for each partitionin parallel.

High availability

You can enable high availability to eliminate single points of failure for PowerCenter application services.PowerCenter application services can continue running despite temporary network or hardware failures.

For example, if you run the PowerCenter Integration Service on a grid and one of the nodes becomesunavailable, the PowerCenter Integration Service recovers the tasks and runs them on a different node. If yourun the PowerCenter Integration Service on a single node and you enable high availability, you can configurebackup nodes in case the primary node becomes unavailable.

Native Environment Processing ArchitectureYou can run sessions, profiles, and workflows on an Integration Service grid. You can run PowerCenter sessionsand workflows on a PowerCenter Integration Service grid. You can run Model repository profiles and workflows ona Data Integration Service grid.

The following diagram shows the service process distribution when you run a PowerCenter workflow on aPowerCenter Integration Service grid with three nodes:

High-Performance Processing in the Native Environment 3

When you run the workflow on a grid, the PowerCenter Integration Service process distributes the tasks in thefollowing way:

¨ On Node 1, the master service process starts the workflow and runs workflow tasks other than the Session,Command, and predefined Event-Wait tasks. The Load Balancer dispatches the Session, Command, andpredefined Event-Wait tasks to other nodes.

¨ On Node 2, the worker service process starts a process to run a Command task and starts a DTM process torun Session task 1.

¨ On Node 3, the worker service process runs a predefined Event-Wait task and starts a DTM process to runSession task 2.

If the master service process becomes unavailable while running a workflow, the PowerCenter Integration Servicecan recover the workflow based on the workflow state and recovery strategy. If the workflow was enabled for highavailability recovery, the PowerCenter Integration Service restores the state of operation for the workflow andrecovers the workflow from the point of interruption.

If a worker service process becomes unavailable while running tasks of a workflow, the master service processcan recover tasks based on task state and recovery strategy.

High-Performance Processing in a Hive EnvironmentYou can run Model repository mappings and profiles in a Hive environment to process large amounts of data of 10terabytes or more. In the Hive environment, the Data Integration Service converts the mapping or profile intoMapReduce programs to enable the Hadoop cluster to process the data.

Hive Environment Processing ArchitectureYou can run Model repository mappings or profiles in a Hive environment.

To run a mapping or profile in a Hive environment, the Data Integration Service creates HiveQL queries based onthe transformation or profiling logic. The Data Integration Service submits the HiveQL queries to the Hive driver.The Hive driver converts the HiveQL queries to MapReduce jobs, and then sends the jobs to the Hadoop cluster.

The following diagram shows the architecture of how a Hadoop cluster processes MapReduce jobs sent from theHive driver:


The following events occur when the Hive driver sends MapReduce jobs to the Hadoop cluster:

1. The Hive driver sends the MapReduce jobs to the Job Tracker in the Hive environment.

2. The JobTracker retrieves a list of TaskTracker nodes that can process the MapReduce jobs from theNameNode.

3. The JobTracker assigns MapReduce jobs to TaskTracker nodes.

4. The Hive driver also connects to the Hive metadata database through the Hive metastore to determine whereto create temporary tables. The Hive driver uses temporary tables to process the data. The Hive driverremoves temporary tables after completing the task.

Big Data Processing ExampleEvery week, an investment banking organization manually calculates the popularity and risk of stocks, and thenmatches stocks to each customer based on the preferences of the customer. However, the organization nowwants you to automate this process.

You use the Developer tool to create a workflow that calculates the popularity and risk of each stock, matchesstocks to each customer, and then sends an email with a list of stock recommendations for all customers. Todetermine the popularity of a stock, you count the number of times that the stock was included in Twitter feeds andthe number of times customers inquired about the stock on the company stock trade web site.

The following diagram shows the components of the workflow:

You configure the workflow to complete the following tasks:

Big Data Processing Example 5

1. Extract and count the number of inquiries about stocks from weblogs.

Extracts the inquiries about each stock from the weblogs, and then counts the number of inquiries about eachstock. The weblogs are from the company stock trade web site.

2. Extract and count the number of tweets for each stock from Twitter.

Extracts tweets from Twitter, and then counts the number of tweets about each stock.

3. Extract market data and calculate the risk of each stock based on market data.

Extracts the daily high stock value, daily low stock value, and volatility of each stock from a flat file providedby a third-party vendor. The workflow calculates the risk of each stock based on the extracted market data.

4. Combine the inquiry count, tweet count, and risk for each stock.

Combines the inquiry count, tweet count, and risk for each stock from the weblogs, Twitter, and market data,respectively.

5. Extract historical stock transactions for each customer.

Extracts historical stock purchases of each customer from a database.

6. Calculate the average risk and average popularity of the stocks purchased by each customer.

Calculates the average risk and average popularity of all stocks purchased by each customer.

7. Match stocks to each customer based on their preferences.

Matches stocks that have the same popularity and risk as the average popularity and average risk of thestocks that the customer previously purchased.

8. Load stock recommendations into the data warehouse.

Loads the stock recommendations into data warehouse to retain a history of the recommendations.

9. Send an email with stock recommendations.

Consolidates the stock recommendations for all customers, and sends an email with the list ofrecommendations.

After you create the workflow, you configure it to run in a Hive environment because the workflow must process 15terabytes of data each time it creates recommendations for customers.


C H A P T E R 2

Installation and ConfigurationThis chapter includes the following topics:

¨ Installation and Configuration Overview, 7

¨ Before You Begin, 8

¨ PowerCenter Big Data Edition Installation, 11

¨ After You Install, 13

¨ PowerCenter Big Data Edition Uninstallation, 18

Installation and Configuration OverviewThe PowerCenter Big Data Edition installation is distributed as a RedHat Package Manager (RPM) installationpackage.

The RPM package includes the Informatica 9.5.1 engine and adapter components. The RPM package and thebinary files needed to run the PowerCenter Big Data Edition installation, are compressed into a tar.gz file.

PowerCenter Big Data Edition Installation ProcessYou can install PowerCenter Big Data Edition in a single node or cluster environment.

Installing in a Single Node EnvironmentYou can install PowerCenter Big Data Edition in a single node environment.

1. Extract the PowerCenter Big Data Edition tar.gz file to the machine.

2. Install PowerCenter Big Data Edition by running the installation shell script in a Linux environment.

Installing in a Cluster EnvironmentYou can install PowerCenter Big Data Edition in a cluster environment.

1. Extract the PowerCenter Big Data Edition tar.gz file to a machine.

2. Distribute the RPM package to all of the nodes within the Hadoop cluster. You can distribute the RPMpackage using any of the following protocols: File Transfer Protocol (FTP), Hypertext Transfer Protocol(HTTP), Network File System (NFS), or Secure Copy (SCP) protocol.

7

3. Install PowerCenter Big Data Edition by running the installation shell script in a Linux environment. You caninstall PowerCenter Big Data Edition from the primary NameNode or from any machine using theHadoopDataNodes file.

¨ Install from the primary NameNode. You can install PowerCenter Big Data Edition using FTP, HTTP, NFSor SCP protocol. During the installation, the installer shell script picks up all of the DataNodes from the$HADOOP_HOME/conf/slaves file and copies the PowerCenter Big Data Edition binary files to the /<PowerCenterBigDataEditionInstallationDirectory>/Informatica directory on each of the DataNodes. Youcan perform this step only if you are deploying Hadoop from the primary NameNode.

¨ Install from any machine. Add the IP addresses or machine host names, one for each line, for each of thenodes in the Hadoop cluster in the HadoopDataNodes file. During the PowerCenter Big Data Editioninstallation, the installation shell script picks up all of the nodes from the HadoopDataNodes file and copiesthe PowerCenter Big Data Edition binary files to the /<PowerCenterBigDataEditionInstallationDirectory>/Informatica directory on each of the nodes.

Before You BeginBefore you begin the PowerCenter Big Data Edition installation, install the PowerCenter components andPowerExchange adapters, and perform the pre-installation tasks.

Install and Configure PowerCenter Standard EditionBefore you install PowerCenter Big Data Edition, install and configure Informatica 9.5.1 PowerCenter StandardEdition.

The PowerCenter Standard Edition installation consists of a server component and a client component.

Informatica ServicesRun the Informatica services installation to configure the PowerCenter domain and create the Informatica services.

Informatica ClientsRun the Informatica client installation to create the PowerCenter Client.

For information, see the Informatica PowerCenter Installation and Configuration Guide.

Install and Configure PowerExchange AdaptersBased on your business needs, install and configure PowerExchange adapters.

Use PowerCenter Big Data Edition with PowerCenter and Informatica adapters for access to sources and targets.

You must install and configure PowerExchange for Hive to run Informatica mappings in a Hive environment. Forinformation, see the Informatica PowerExchange for Hive User Guide.

PowerCenter AdaptersUse PowerCenter adapters, such as PowerExchange for Hadoop, to define sources and targets in PowerCentermappings.

For more information about installing and configuring PowerCenter adapters, see the PowerExchange adapterdocumentation.

8 Chapter 2: Installation and Configuration

Informatica AdaptersYou can use the following Informatica adapters as part of PowerCenter Big Data Edition:

¨ PowerExchange for DataSift

¨ PowerExchange for Facebook

¨ PowerExchange for HDFS

¨ PowerExchange for Hive

¨ PowerExchange for LinkedIn

¨ PowerExchange for Teradata Parallel Transporter API

¨ PowerExchange for Twitter

¨ PowerExchange for Web Content-Kapow Katalyst

For more information, see the PowerExchange adapter documentation.

Install and Configure Data ReplicationBefore you install PowerCenter Big Data Edition, install and configure Data Replication.

To migrate data with minimal downtime and perform auditing and operational reporting functions, install andconfigure Data Replication. For information, see the Informatica Data Replication User Guide.

Pre-Installation Tasks for a Single Node EnvironmentBefore you begin the PowerCenter Big Data Edition installation in a single node environment, perform the pre-installation requirements.

¨ Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce. Informatica supports theCloudera (CDH Version 3 Update 4) and Apache (Hadoop 1.0.3) Hadoop distributions. Refer to http://hadoop.apache.org/ for more information. The Hadoop installation should include a Hive data warehouse that isconfigured to use a MySQL database as the MetaStore. You can configure Hive to use a local or remoteMetaStore server.

Note: Informatica does not support embedded MetaStore server setups.

¨ Install the required third party client software to perform both read and write operations in native mode. Forexample, install the Oracle client to connect to the Oracle database.

¨ Verify that the PowerCenter Big Data Edition administrator user can run sudo commands or have user rootprivileges.

¨ Verify that the temporary folder on the local node has at least 700 MB of disk space.

¨ Download the following file to the temporary folder: InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz¨ Extract the InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz file to the local node where you want to

run the PowerCenter Big Data Edition installation.

Before You Begin 9

Pre-Installation Tasks for a Cluster EnvironmentBefore you begin the PowerCenter Big Data Edition installation in a cluster environment, perform the pre-installation requirements.

¨ Verify that Hadoop is installed on every node within the cluster. Install Hadoop on every node within the cluster.Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce. Informatica supports theCloudera (CDH Version 3 Update 4) and Apache (Hadoop 1.0.3) Hadoop distributions. Refer to http://hadoop.apache.org/ for more information. The Hadoop installation should include a Hive data warehouse that isconfigured to use a MySQL database as the MetaStore. You can configure Hive to use a local or remoteMetaStore server.

Note: Informatica does not support embedded MetaStore server setups.

¨ Install the required third party client software to perform both read and write operations in native mode. Forexample, install the Oracle client to connect to the Oracle database. Install the third party client software on allof the nodes within the Hadoop cluster. For Informatica, this is required to run MapReduce jobs.

¨ Verify that the PowerCenter Big Data Edition administrator has user can run sudo commands or have user rootprivileges.

¨ Verify that the RPM package can be distributed by File Transfer Protocol (FTP), Hypertext Transfer Protocol(HTTP), Network File System (NFS), or Secure Copy (SCP) protocol to all of the nodes that are to be includedin the cluster.

¨ If you are installing PowerCenter Big Data Edition in a cluster environment that uses the FTP protocol, verifythat the FTP service is running.

¨ If you are installing PowerCenter Big Data Edition in a cluster environment that uses the HTTP protocol, verifythat the web server is running.

¨ If you are installing PowerCenter Big Data Edition in a cluster environment that uses the SCP protocol, verifythat the SCP service is running.

¨ If you are installing PowerCenter Big Data Edition in a cluster environment, set up password-less Secure Shell(SSH) connection between the machine where you want to run the PowerCenter Big Data Edition installationand all of the nodes on which PowerCenter Big Data Edition will be installed.

¨ Verify that the temporary folder in each of the nodes on which PowerCenter Big Data Edition will be installedhas at least 700 MB of temporary disk space.

¨ Download the following file to a temporary folder: InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz¨ Copy the following package to a shared directory: InformaticaHadoop-<InformaticaForHadoopVersion>.rpm

For example,

- For HTTP protocol: /var/www/html

- For FTP protocol: /var/ftp/pub

- For NFS: <Shared location on the node. The file location must be accessible by all the nodes in the cluster.>

Note: The RPM package must be stored on local disk and not on HDFS.

¨ Extract the InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz file to the machine from where you wantto distribute the RPM package and run the PowerCenter Big Data Edition installation.

¨ In the config file on the machine where you want to run the PowerCenter Big Data Edition installation, setDISTRIBUTOR_NODE to one of the following protocols.

- For FTP protocol, set DISTRIBUTOR_NODE=ftp://<Distributor Node IP Address>/pub

- For HTTP protocol, set DISTRIBUTOR_NODE=http://<Distributor Node IP Address>

- For NFS protocol, set DISTRIBUTOR_NODE=<Shared file location on the node. The file location must beaccessible by all the nodes in the cluster.>


PowerCenter Big Data Edition InstallationYou can install PowerCenter Big Data Edition in a single node environment. You can also install PowerCenter BigData Edition in a cluster environment from the primary NameNode or from any machine.

Install PowerCenter Big Data Edition in a single node enivironment or cluster environment:

¨ Install PowerCenter Big Data Edition in a single node environment.

¨ Install PowerCenter Big Data Edition in a cluster environment from the primary NameNode using SCP protocol.

¨ Install PowerCenter Big Data Edition in a cluster environment from the primary NameNode using FTP, HTTP,or NFS protocol.

¨ Install PowerCenter Big Data Edition in a cluster environment from any machine.

Install PowerCenter Big Data Edition from a shell command line.

Installing in a Single Node EnvironmentYou can install PowerCenter Big Data Edition in a single node environment.

1. Log in to the machine.

2. Run the following command from the PowerCenter Big Data Edition root directory to start the installation inconsole mode:

bash InformaticaHadoopInstall.sh3. Press y to accept the PowerCenter Big Data Edition terms of agreement.

4. Press Enter.

5. Press 1 to install PowerCenter Big Data Edition in a single node environment.

6. Press Enter.

7. Type the absolute path for the PowerCenter Big Data Edition installation directory and press Enter.

Start the path with a slash. The directory names in the path must not contain spaces or the following specialcharacters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \

If you type a directory path that does not exist, the installer creates the entire directory path on each of thenodes during the installation. Default is /opt.

8. Press Enter.

The installer creates the /<PowerCenterBigDataEditionInstallationDirectory>/Informatica directory andpopulates all of the file systems with the contents of the RPM package.

You can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file to get more informationabout the tasks performed by the installer.

Installing in a Cluster Environment from the Primary NameNode UsingSCP Protocol

You can install PowerCenter Big Data Edition in a cluster environment from the primary NameNode using SCPprotocol.

1. Log in to the primary NameNode.

2. Run the following command to start the PowerCenter Big Data Edition installation in console mode:bash InformaticaHadoopInstall.sh

3. Press y to accept the PowerCenter Big Data Edition terms of agreement.

PowerCenter Big Data Edition Installation 11

4. Press Enter.

5. Press 2 to install PowerCenter Big Data Edition in a cluster environment.

6. Press Enter.

7. Type the absolute path for the PowerCenter Big Data Edition installation directory.



8. Press Enter.

9. Press 1 to install PowerCenter Big Data Edition from the primary NameNode.

10. Press Enter.

11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.

12. Press Enter.

13. Type y.

14. Press Enter.

The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the DataNodes,the installer creates the /<PowerCenterBigDataEditionInstallationDirectory>/Informatica directory andpopulates all of the file systems with the contents of the RPM package.


Installing in a Cluster Environment from the Primary NameNode UsingFTP, HTTP, or NFS Protocol

You can install PowerCenter Big Data Edition in a cluster environment from the primary NameNode using FTP,HTTP, or NFS protocol.

1. Log in to the primary NameNode.



4. Press Enter.


6. Press Enter.

7. Type the absolute path for the PowerCenter Big Data Edition installation directory.



8. Press Enter.

9. Press 1 to install PowerCenter Big Data Edition from the primary NameNode.

10. Press Enter.

11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.

12. Press Enter.


13. Type n.

14. Press Enter.

15. Type y.

16. Press Enter.

The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the DataNodes,the installer creates the /<PowerCenterBigDataEditionInstallationDirectory>/Informatica directory andpopulates all of the file systems with the contents of the RPM package.


Installing in a Cluster Environment from any MachineYou can install PowerCenter Big Data Edition in a cluster environment from any machine.

1. Verify that the PowerCenter Big Data Edition administrator has user root privileges on the node that will berunning the PowerCenter Big Data Edition installation.

2. Log in to the machine as the root user.

3. In the HadoopDataNodes file on the node from where you want to launch the PowerCenter Big Data Editioninstallation, add the IP addresses or machine host names, one for each line, of the nodes in the Hadoopcluster on which you want to install PowerCenter Big Data Edition.



6. Press Enter.


8. Press Enter.

9. Type the absolute path for the PowerCenter Big Data Edition installation directory and press Enter. Start thepath with a slash. Default is /opt.

10. Press Enter.

11. Press 2 to install PowerCenter Big Data Edition using the HadoopDataNodes file.

12. Press Enter.

The installer creates the /<PowerCenterBigDataEditionInstallationDirectory>/Informatica directory andpopulates all of the file systems with the contents of the RPM package on the first node that appears in theHadoopDataNodes file. The installer repeats the process for each node in the HadoopDataNodes file.

After You InstallAfter you install PowerCenter Big Data Edition, perform the post-installation tasks to ensure that PowerCenter BigData Edition runs properly.

Complete the following tasks:

¨ Configure the PowerCenter Big Data Edition environment variable properties file.

¨ Configure the Data Integration Service pushdown properties for Hadoop.

After You Install 13

¨ Install the Address Validation reference data.

Hadoop Environment Variable PropertiesAfter you install PowerCenter Big Data Edition, configure the hadoopEnv.properties file to meet the PowerCenterBig Data Edition requirements.

Configure the hadoopEnv.properties file with the Informatica, locale, and library path environment variables youwant to include in the PowerCenter Big Data Edition environment.

1. Go to the following location: <InformaticaInstallationDir>/services/shared/hadoop/conf2. Find the file named hadoopEnv.properties.

Back up the file before you modify it.

3. Use a text editor to open the file and modify the properties.

4. Save the properties file with the name hadoopEnv.properties.

Hadoop Pushdown Properties for the Data Integration ServiceYou must configure Hadoop pushdown properties for the Data Integration Service to run mappings or profiles in aHive environment.

You can configure Hadoop pushdown properties for the Data Integration Service from the Administrator tool.

The following table describes the Hadoop pushdown properties for the Data Integration Service:

Property Description

Informatica Home Directory on Hadoop The PowerCenter Big Data Edition home directory on everydata node created by the Hadoop RPM install. Type /<PowerCenterBigDataEditionInstallationDirectory>/Informatica.

Hadoop Distribution Directory The directory containing a collection of Hive and HadoopJARS on the data nodes from the Hive and Hadoop Installlocations. The directory contains the minimum set of JARSrequired to process Informatica mappings in a Hadoopenvironment. Type /<PowerCenterBigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop/cdh3u4.You can modify the Hadoop distribution directory on the datanodes and set this path from the Administrator tool. To specifya different Hadoop distribution directory:1. Use the JARS from compatible Hive and Hadoop install

locations.2. Create a Hadoop distribution directory in the following

directory path:/<PowerCenterBigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop/[Hadoop_distribution_name] or another location.

3. Copy the required Hive JARS from the Hive install locationat /usr/lib/hive/lib to the following directory: /<PowerCenterBigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop/[Hadoop_distribution_name].

4. Copy the required Hadoop JARS from the Hadoop installlocation at /usr/lib/hadoop/lib to the following directory: /



<PowerCenterBigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop/[Hadoop_distribution_name].

5. If you are using Cloudera distribution, copy the requiredSnappy libraries from /usr/lib/hadoop/lib/native to thefollowing directory: /<PowerCenterBigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop/[Hadoop_distribution_name]/lib/native.

Data Integration Service Hadoop Distribution Directory The Hadoop distribution directory on the Data IntegrationService node. The contents of the Data Integration ServiceHadoop distribution directory must be identical to Hadoopdistribution directory on the data nodes.

Hadoop Distribution DirectoryYou can modify the Hadoop distribution directory on the data nodes.

When you modify the Hadoop distribution directory, you must copy the minimum set of Hive and Hadoop JARS,and the Snappy libraries required to process Informatica mappings in a Hive environment from your Hadoop installlocation. The actual Hive and Hadoop JARS can vary depending on the Hadoop distribution version.


The following table lists the contents of Hadoop distribution directory that are installed with the Hadoop RPM forCloudera in /<PowerCenterBigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop/[Hadoop_distribution_name] directory:

Directory Cloudera Distribution files

conf hive-default.xml

lib The directory must contain the following JARS from the Hiveinstallation:- hive-cli-0.7.1-cdh3u4.jar- hive-exec-0.7.1-cdh3u4.jar- hive-jdbc-0.7.1-cdh3u4.jar- hive-metastore-0.7.1-cdh3u4.jar- hive-serde-0.7.1-cdh3u4.jar- hive-service-0.7.1-cdh3u4.jar- hive-shims-0.7.1-cdh3u4.jar- ant-contrib-1.0b3.jar- antlr-runtime-3.0.1.jar- asm-3.1.jar- commons-cli-1.2.jar- commons-codec-1.3.jar- commons-collections-3.2.1.jar- commons-dbcp-1.4.jar- commons-lang-2.4.jar- commons-logging-1.0.4.jar- commons-logging-api-1.0.4.jar- commons-pool-1.5.4.jar- datanucleus-connectionpool-2.0.3.jar- datanucleus-core-2.0.3.jar- datanucleus-enhancer-2.0.3.jar- datanucleus-rdbms-2.0.3.jar- derby.jar- jackson-core-asl-1.7.3.jar- jackson-mapper-asl-1.7.3.jar- jdo2-api-2.3-ec.jar- jline-0.9.94.jar- json.jar- libthrift.jar- slf4j-api-1.6.1.jar- slf4j-log4j12-1.6.1.jar- thrift-fb303-0.5.0.jarThe directory must contain the following JARS from Hadoopinstallation:- hadoop-core-0.20.2-cdh3u4.jar- guava-r09-jarjar.jar

lib\native The directory Linux-amd64-64 or Linux-i386-32 mustcontain the following libraries:- libhadoop.a- libhadoop.la- libhadoop.so- libhadoop.so.1- libhadoop.so.1.0.0- libsnappy.a- libsnappy.la- libsnappy.so- libsnappy.so.1- libsnappy.so.1.1.1


Directory Cloudera Distribution files

- libsnappyjava.so

The following table lists the contents of Hadoop distribution directory that are installed with the Hadoop RPM forApache in /opt/Informatica/services/shared/hadoop/[Hadoop_distribution_name] directory:

Directory Apache Distribution Files

conf hive-default.xml

lib The directory must contain the following JARS from the Hiveinstallation:- hive-cli-0.7.1.jar- hive-exec-0.7.1.jar- hive-jdbc-0.7.1.jar- hive-metastore-0.7.1.jar- hive-serde-0.7.1.jar- hive-service-0.7.1.jar- hive-shims-0.7.1.jar- ant-contrib-1.0b3.jar- antlr-runtime-3.0.1.jar- asm-3.2.jar- commons-cli-1.2.jar- commons-codec-1.4.jar- commons-collections-3.2.1.jar- commons-dbcp-1.4.jar- commons-lang-2.4.jar- commons-logging-1.1.1.jar- commons-logging-api-1.0.4.jar- commons-pool-1.5.4.jar- datanucleus-connectionpool-2.0.3.jar- datanucleus-core-2.0.3.jar- datanucleus-enhancer-2.0.3.jar- datanucleus-rdbms-2.0.3.jar- derby.jar- jdo2-api-2.3-ec.jar- jline-0.9.94.jar- json.jar- libthrift.jar- slf4j-api-1.6.1.jar- slf4j-log4j12-1.6.1.jar- thrift-fb303-0.5.0.jarThe directory must contain the following JARS from theHadoop installation:- hadoop-core-1.0.3.jar- commons-configuration-1.6.jar- jackson-core-asl-1.8.8.jar- jackson-mapper-asl-1.8.8.jar

HDFS Security ConfigurationYou must enable the HDFS security property dfs.permissions in the following location: /usr/lib/hadoop/conf/hdfs-site.xml. You must create a Hadoop user with the same user name as the Data Integration Service user name inthe cluster nodes using the following commands:

hadoop fs -mkdir /user/hadoop fs -chown :/user/


Update the Hive warehouse directory in the following location: $INFA_HOME/services/shared/hadoop/hadoopEnv.properties

Set Up Address ValidationAfter you install PowerCenter Big Data Edition, optionally install address reference data files on the DataNodes.

If you use PowerCenter Big Data edition with a Data Quality license, you can push a mapping that validates theaccuracy of postal address records to a Hadoop cluster. The mapping uses address reference data files to validatethe records.

You purchase address reference data files from Informatica on a subscription basis. You can download the currentaddress reference data files from Informatica at any time during the subscription period.

Installing the Address Reference Data FilesCreate an automation script to install the address reference data files on each DataNode in the cluster.

1. Browse to the address reference data files that you downloaded from Informatica.

2. Extract the compressed address reference data files.

3. Stage the files to the NameNode machine or to another machine that can write to the DataNodes.

4. Create an automation script to copy the files to each DataNode.

The default directory for the address reference data files in the Hadoop environment is /reference_data .

¨ If you staged the files on the NameNode, use the slaves file for the Hadoop cluster to identify theDataNodes.

¨ If you staged the files on another machine, use the Hadoop_Nodes.txt file to identify the DataNodes. Youfind this file in the PowerCenter Big Data Edition installation package.

5. Run the script.

The script copies the address reference data files to the DataNodes.

PowerCenter Big Data Edition UninstallationThe PowerCenter Big Data Edition uninstallation deletes the PowerCenter Big Data Edition binary files from all ofthe DataNodes within the Hadoop cluster. Uninstall PowerCenter Big Data Edition from a shell command.

Uninstalling PowerCenter Big Data EditionTo uninstall PowerCenter Big Data Edition in a single node or cluster environment:

1. Verify that the PowerCenter Big Data Edition administrator can run sudo commands.

2. If you are uninstalling PowerCenter Big Data Edition in a cluster environment, set up password-less SecureShell (SSH) connection between the machine where you want to run the PowerCenter Big Data Editioninstallation and all of the nodes on which PowerCenter Big Data Edition will be uninstalled.

3. If you are uninstalling PowerCenter Big Data Edition in a cluster environment using the HadoopDataNodes file,verify that the HadoopDataNodes file contains the IP addresses or machine host names, one for each line, ofeach of the nodes in the Hadoop cluster from which you want to uninstall PowerCenter Big Data Edition.


4. Log in to the machine. The machine you log into depends on the PowerCenter Big Data Edition environmentand uninstallation method:

¨ If you are uninstalling PowerCenter Big Data Edition in a single node environment, log in to the machineon which PowerCenter Big Data Edition is installed.

¨ If you are uninstalling PowerCenter Big Data Edition in a cluster environment using the HADOOP_HOMEenvironment variable, log in to the primary NameNode.

¨ If you are uninstalling PowerCenter Big Data Edition in a cluster environment using the HadoopDataNodesfile, log in to any node.

5. Run the following command to start the PowerCenter Big Data Edition uninstallation in console mode:bash InformaticaHadoopInstall.sh


7. Press Enter.

8. Select 3 to uninstall PowerCenter Big Data Edition.

9. Press Enter.

10. Select the uninstallation option, depending on the PowerCenter Big Data Edition environment:

¨ Select 1 to uninstall PowerCenter Big Data Edition in a single node environment.

¨ Select 2 to uninstall PowerCenter Big Data Edition in a cluster environment.

11. Press Enter.

12. If you are uninstalling PowerCenter Big Data Edition in a cluster environment, select the uninstallation option,depending on the uninstallation method:

¨ Select 1 to uninstall PowerCenter Big Data Edition from the primary NameNode.

¨ Select 2 to uninstall PowerCenter Big Data Edition using the HadoopDataNodes file.

13. Press Enter.

14. If you are uninstalling PowerCenter Big Data Edition in a cluster environment from the primary NameNode,type the absolute path for the Hadoop installation directory. Start the path with a slash.

The uninstaller deletes all of the PowerCenter Big Data Edition binary files from the /<PowerCenterBigDataEditionInstallationDirectory>/Informatica directory. In a cluster environment, the uninstallerdelete the binary files from all of the nodes within the Hadoop cluster.

PowerCenter Big Data Edition Uninstallation 19

C H A P T E R 3

ConnectionsThis chapter includes the following topics:

¨ Connections Overview, 20

¨ HDFS Connection Properties, 20

¨ Hive Connection Properties, 21

¨ Creating a Connection, 24

Connections OverviewDefine the connections you want to use to access data in Hive or HDFS.

You can create the following types of connections:

¨ HDFS connection. Create an HDFS connection to read data from or write data to the Hadoop cluster.

¨ Hive connection. Create a Hive connection to access Hive data or run Informatica mappings in the Hadoopcluster. Create a Hive connection in the following connection modes:

- Use the Hive connection to access Hive as a source or target. If you want to use Hive as a target, you need tohave the same connection or another Hive connection that is enabled to run mappings in the Hadoop cluster.You can access Hive as a source if the mapping is enabled for the native or Hive environment. You canaccess Hive as a target only if the mapping is run in the Hadoop cluster.

- Use the Hive connection to validate or run an Informatica mapping in the Hadoop cluster. Before you runmappings in the Hadoop cluster, review the information in this guide about rules and guidelines for mappingsthat you can run in the Hadoop cluster.

You can create the connections using the Developer tool, Administrator tool, and infacmd.

Note: For information about creating connections to other sources or targets such as social media web sites orTeradata, see the respective PowerExchange adapter user guide for information.

HDFS Connection PropertiesUse the HDFS connection to access files in the Hadoop Distributed File System.

20

The following table describes the properties for an HDFS connection:


Name The name of the connection. The name is not case sensitiveand must be unique within the domain. You can change thisproperty after you create the connection. It cannot exceed128 characters, contain spaces, or contain the followingspecial characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < ,> . ? /

ID String that the Data Integration Service uses to identify theconnection. The ID is not case sensitive. It must be 255characters or less and must be unique in the domain. Youcannot change this property after you create the connection.Default value is the connection name.

Description The description of the connection. The description cannotexceed 765 characters.

Location The domain where you want to create the connection.

Type The connection type. Default is Hadoop File System.

User Name User name to access HDFS.

NameNode URI The URI to access HDFS. The URI must be in the followingformat: hdfs://<namenode>:<port>Where- <namenode> is the host name or IP address of the

NameNode.- <port> is the port that the NameNode listens for remote

procedure calls (RPC).

Hive Connection PropertiesUse a Hive connection to access data in Hive or to run a mapping in a Hadoop cluster.

General PropertiesThe following table describes the general properties that you configure for a Hive connection:


Name The name of the connection. The name is not case sensitive and must be unique withinthe domain. You can change this property after you create the connection. The namecannot exceed 128 characters, contain spaces, or contain the following specialcharacters:~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is notcase sensitive. It must be 255 characters or less and must be unique in the domain. You

Hive Connection Properties 21


cannot change this property after you create the connection. Default value is theconnection name.

Description The description of the connection. The description cannot exceed 4000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select Hive.

Connection Modes Hive connection mode. Select at least one of the following options:- Access Hive as a source or target. Select this option if you want to use the connection

to access the Hive data warehouse. Note that if you want to use Hive as a target, youneed to enable the same connection or another Hive connection to run mappings inthe Hadoop cluster.

- Use Hive to run mappings in Hadoop cluster. Select this option if you want to use theconnection to run mappings in the Hadoop cluster.

You can select both the options. Default is Access Hive as a source or target.

Environment SQL SQL commands to set the Hadoop environment. In a native environment, the DataIntegration Service executes the environment SQL each time it creates a connection toHive metastore. If you use a Hive connection to run mappings in a Hadoop cluster, theData Integration Service executes the environment SQL at the start of each Hive session.The following rules and guidelines apply to the usage of environment SQL in both theconnection modes:- Use the environment SQL to specify Hive queries.- Use the environment SQL to set the classpath for Hive user-defined functions and

then use either environment SQL or PreSQL to specify the Hive user-definedfunctions. You cannot use PreSQL in the data object properties to specify theclasspath. The path must be the fully qualified path to the JAR files used for user-defined functions. Set the parameter hive.aux.jars.path with all the entries ininfapdo.aux.jars.path and the path to the JAR files for user-defined functions.

- You can also use environment SQL to define Hadoop or Hive parameters that youintend to use in the PreSQL commands or in custom queries.

If the Hive connection is used to run mappings in the Hadoop cluster, only theenvironment SQL of the Hive connection is executed. The different environment SQLcommands for the connections of the Hive source or target are not executed, even if theHive sources and targets are on different clusters.

Properties to Access Hive as Source or TargetThe following table describes the connection properties that you configure to access Hive as a source or target:


Metadata Connection String The JDBC connection URI used to access the metadata from the Hadoop server.The connection string must be in the following format:jdbc:hive://<hostname>:<port>/<db>Where- hostname is name or IP address of the machine on which the Hive server is running.- port is the port on which the Hive server is listening.- db is the database name to which you want to connect. If you do not provide the

database name, the Data Integration Service uses the default database details.

Bypass Hive JDBC Server JDBC driver mode. Select the checkbox to use the embedded mode or embedded JDBCdriver.

22 Chapter 3: Connections


To use the JDBC embedded mode, perform the following tasks:- Verify that Hive client and Informatica Services are installed on the same machine.- Configure the Hive connection properties to run mappings in the Hadoop cluster.If you choose the non-embedded mode, you must configure the Data Access ConnectionString.The JDBC embedded mode is preferred to the non-embedded mode.

Data Access Connection String The connection string used to access data from the Hadoop data store. The non-embedded JDBC mode connection string must be in the following format:jdbc:hive://<hostname>:<port>/<db>Where- hostname is name or IP address of the machine on which the Hive server is running.- port is the port on which the Hive server is listening. Default is 10000.- db is the database to which you want to connect. If you do not provide the database

name, the Data Integration Service uses the default database details.

Properties to Run Mappings in the Hadoop ClusterThe following table describes the Hive connection properties that you configure when you want to use the Hiveconnection to run Informatica mappings in a Hive environment:


Database Name Namespace for tables. Use the name default for tables that do not have a specifieddatabase name.

Default FS URI The URI to access the default Hadoop Distributed File System (HDFS).The FS URI must be in the following format:hdfs://<node name>:<port>Where- node name is the host name or IP address of the NameNode.- port is the port on which the NameNode listens for remote procedure calls (RPC).

JobTracker URI The service within Hadoop that submits the MapReduce tasks to specific nodes in thecluster.JobTracker URI must be in the following format:<jobtrackername>:<port>Where- jobtrackername is the host name or IP address of the JobTracker.- port is the port on which the JobTracker listens for remote procedure calls (RPC).

Hive Warehouse Directory on HDFS The absolute HDFS file path of the default database for the warehouse, which is local tothe cluster. For example, the following file path specifies a local warehouse:/user/hive/warehouse

Metastore Execution Mode Controls whether to connect to a remote metastore or a local metastore. By default, localis selected. For a local metastore, you must specify the Metastore Database URI, Driver,Username, and Password. For a remote metastore, you must specify only the RemoteMetastore URI.

Metastore Database URI The JDBC connection URI used to access the data store in a local metastore setup. TheURI must be in the following format:jdbc:<datastore type>://<node name>:<port>/<database name>

Hive Connection Properties 23


where- node name is the host name or IP address of the data store.- data store type is the type of the data store.- port is the port on which the data store listens for remote procedure calls (RPC).- database name is the name of the database.For example, the following URI specifies a local metastore that uses MySQL as a datastore:jdbc:mysql://hostname23:3306/metastore

Metastore Database Driver Driver class name for the JDBC data store. For example, the following class namespecifies a MySQL driver:com.mysql.jdbc.Driver

Metastore Database Username The metastore database user name.

Metastore Database Password The password for the metastore user name.

Remote Metastore URI The metastore URI used to access metadata in a remote metastore setup. For a remotemetastore, you must specify the Thrift server details.The URI must be in the following format:thrift://<hostname>:<port>Where- hostname is name or IP address of the Thrift metastore server.- port is the port on which the Thrift server is listening.

Creating a ConnectionCreate a connection before you import data objects, preview data, profile data, and run mappings.

1. Click Window > Preferences.

2. Select Informatica > Connections.

3. Expand the domain in the Available Connections list.

4. Select the type of connection that you want to create:

¨ To select a Hive connection, select Database > Hive.

¨ To select an HDFS connection, select File Systems > Hadoop File System.

5. Click Add.

6. Enter a connection name and optional description.

7. Click Next.

24 Chapter 3: Connections

8. Configure the connection properties. For a Hive connection, you must choose the Hive connection mode andspecify the commands for environment SQL. The SQL commands appy to both the connection modes. Selectat least one of the following connection modes:

Option Description

Access Hive as a source or target Use the connection to access Hive data. If you select this option and click Next, theProperties to Access Hive as a source or target page appears. Configure theconnection strings.

Run mappings in a Hadoopcluster.

Use the Hive connection to validate and run Informatica mappings in the Hadoopcluster. If you select this option and click Next, the Properties used to RunMappings in the Hadoop Cluster page appears. Configure the properties.

9. Click Test Connection to verify the connection.

You can test a Hive connection that is configured to access Hive data. You cannot test a Hive connection thatis configured to run Informatica mappings in the Hadoop cluster.

10. Click Finish.

Creating a Connection 25

C H A P T E R 4

Mappings in the Native EnvironmentThis chapter includes the following topics:

¨ Mappings in the Native Environment Overview, 26

¨ Data Processor Mappings, 26

¨ HDFS Mappings, 27

¨ Hive Mappings, 28

¨ Social Media Mappings, 29

Mappings in the Native Environment OverviewYou can run a mapping in the native or Hive environment. In the native environment, the Data Integration Serviceruns the mapping from the Developer tool. You can run standalone mappings or mappings that are a part of aworkflow.

In the native environment, you can read and process data from large unstructured and semi-structured files, Hive,or social media web sites. You can include the following objects in the mappings:

¨ Hive sources

¨ Flat file sources or targets in the local system or in HDFS

¨ Complex file sources in the local system or in HDFS

¨ Data Processor transformations to process unstructured and semi-structured file formats

¨ Social media sources

You can also import PowerCenter mappings in the Developer tool and run them in the native environment.

Data Processor MappingsThe Data Processor transformation processes unstructured and semi-structured file formats in a mapping. Itconverts source data to flat CSV records that MapReduce applications can process.

You can configure the Data Processor transformation to process messaging formats, HTML pages, XML, and PDFdocuments. You can also configure it to transform structured formats such as ACORD, HIPAA, HL7, EDI-X12,EDIFACT, AFP, and SWIFT.

26

For example, an application produces hundreds of data files per second and writes the files to a directory. You cancreate a mapping that extracts the files from the directory, passes them to a Data Processor transformation, andwrites the data to a target.

HDFS MappingsCreate an HDFS mapping to read or write to HDFS.

You can read and write fixed-width and delimited file formats. You can read or write compressed files. You canread text files and binary file formats such as sequence file from HDFS. You can specify the compression formatof the files. You can use the binary stream output of the complex file data object as input to a Data Processortransformation to parse the file.

You can define the following objects in an HDFS mapping:

¨ Flat file data object or complex file data object operation as the source to read data from HDFS.

¨ Transformations.

¨ Flat file data object as the target to write data to HDFS or any target.

Validate and run the mapping. You can deploy the mapping and run it or add the mapping to a Mapping task in aworkflow.

HDFS Mapping ExampleYour organization, HypoMarket Corporation, needs to analyze purchase order details such as customer ID, itemcodes, and item quantity. The purchase order details are stored in a semi-structured compressed XML file inHDFS. The hierarchical data includes a purchase order parent hierarchy level and a customer contact details childhierarchy level. Create a mapping that reads all the purchase records from the file in HDFS. The mapping mustconvert the hierarchical data to relational data and load it in a relational target.

You can use the extracted data for business analytics.

The following figure shows the example mapping:

You can use the following objects in an HDFS mapping:HDFS input

The input, Read_Complex_File, is a compressed XML file stored in HDFS.

Data Processor Transformation

The Data Processor transformation, Data_Processor_XML_to_Relational, parses the XML file and provides arelational output.

Relational output

The output, Write_Relational_Data_Object, is a table in an Oracle database.

HDFS Mappings 27

When you run the mapping, the Data Integration Service reads the file in a binary stream and passes it to the DataProcessor transformation. The Data Processor transformation parses the specified file and provides a relationaloutput. The output is loaded into the relational target.

You can configure the mapping to run in the native or Hive environment.

Complete the following tasks to configure the mapping:

1. Create an HDFS connection to read files from the Hadoop cluster.

2. Create a complex file data object operation. Specify the following parameters:

¨ The file as the resource in the data object.

¨ The file compression format.

¨ The HDFS file location.

3. Optionally, you can specify the input format that the Mapper uses to read the file.

4. Drag and drop the data object operation into a mapping.

5. Create a Data Processor transformation. Configure the following properties in the Data Processortransformation:

¨ An input port set to buffer input and binary datatype.

¨ Relational output ports depending on the number of columns you want in the relational output. Specify theport size for the ports. Use an XML schema reference that describes XML hierarchy. Specify thenormalized output you want. For example, you can specify PurchaseOrderNumber_Key as a generatedkey that relates the Purchase Orders output group to a Customer Details group.

¨ Create a Streamer object and specify Streamer as a startup component.

6. Create a relational connection to an Oracle database.

7. Import a relational data object.

8. Create a write transformation for the relational data object and add it to the mapping.

Hive MappingsBased on the mapping environment, you can read data from or write data to Hive.

In a native environment, you can read data from Hive. To read data from Hive, complete the following steps:

1. Create a Hive connection.

2. Configure the Hive connection mode to access Hive as a source or target.

3. Use the Hive connection to create a data object to read from Hive.

4. Add the data object to a mapping and configure the mapping to run in the native environment.

You can write to Hive in a Hive environment. To read data from Hive, complete the following steps:

1. Create a Hive connection.

2. Configure the Hive connection mode to access Hive as a source or target.

3. Use the Hive connection to create a data object to write to Hive.

4. Add the data object to a mapping and configure the mapping to run in the Hive environment.

You can define the following types of objects in a Hive mapping:

¨ A read data object to read data from Hive

28 Chapter 4: Mappings in the Native Environment

¨ Transformations

¨ A target or an SQL data service. You can write to Hive if you run the mapping in a Hadoop cluster.

Validate and run the mapping. You can deploy the mapping and run it or add the mapping to a Mapping task in aworkflow.

Hive Mapping ExampleYour organization, HypoMarket Corporation, needs to analyze customer data. Create a mapping that reads all thecustomer records. Create an SQL data service to make a virtual database available for end users to query.

You can use the following objects in a Hive mapping:Hive input

The input file is a Hive table that contains the customer names and contact details.

Create a relational data object. Configure the Hive connection and specify the table that contains thecustomer data as a resource for the data object. Drag the data object into a mapping as a read data object.

SQL Data Service output

Create an SQL data service in the Developer tool. To make it available to end users, include it in anapplication, and deploy the application to a Data Integration Service. When the application is running, connectto the SQL data service from a third-party client tool by supplying a connect string.

You can run SQL queries through the client tool to access the customer data.

Social Media MappingsCreate mappings to read social media data from sources such as Facebook and LinkedIn.

You can extract social media data and load them to a target in the native environment only. You can choose toparse this data or use the data for data mining and analysis.

To process or analyze the data in Hadoop, you must first move the data to a relational or flat file target and thenrun the mapping in the Hadoop cluster.

You can use the following Informatica adapters in the Developer tool:

¨ PowerExchange for DataSift

¨ PowerExchange for Facebook

¨ PowerExchange for LinkedIn

¨ PowerExchange for Twitter

¨ PowerExchange for Web Content-Kapow Katalyst

Review the respective PowerExchange adapter documentation for more information.

Twitter Mapping ExampleYour organization, Hypomarket Corporation, needs to review all the tweets that mention your product"HypoBasket" with a positive attitude since the time you released the product in February 2012.

Create a mapping that identifies tweets that contain the word HypoBasket and writes those records to a table.

Social Media Mappings 29

The following figure shows the example mapping:

You can use the following objects in a Twitter mapping:Twitter input

The mapping source is a Twitter data object that contains the resource Search.

Create a physical data object and add the data object to the mapping. Add the Search resource to thephysical data object. Modify the query parameter with the following query:

QUERY=HypoBasket:)&since:2012-02-01

Sorter transformation

Optionally, sort the data based on the timestamp.

Add a Sorter transformation to the mapping. Specify the timestamp as the sort key with direction as ascending.

Mapping output

Add a relational data object to the mapping as a target.

After you run the mapping, Data Integration Service writes the extracted tweets to the target table. You can usetext analytics and sentiment analysis tools to analyze the tweets.

30 Chapter 4: Mappings in the Native Environment

C H A P T E R 5

Mappings in a Hive EnvironmentThis chapter includes the following topics:

¨ Mappings in a Hive Environment Overview, 31

¨ Datatypes in a Hive Environment, 32

¨ Sources in a Hive Environment, 32

¨ Targets in a Hive Environment, 33

¨ Transformations in a Hive Environment, 35

¨ Functions in a Hive Environment, 38

¨ Variable Ports in a Hive Environment, 39

¨ Mappings in a Hive Environment, 39

¨ Workflows that Run Mappings in a Hive Environment, 40

¨ Configuring a Mapping to Run in a Hive Environment, 40

¨ Hive Execution Plan, 40

¨ Monitoring a Mapping, 41

¨ Logs, 41

¨ Troubleshooting a Mapping in a Hive Environment, 42

Mappings in a Hive Environment OverviewYou can run a mapping on a Hadoop cluster. The Data Integration Service can push mappings that are importedfrom PowerCenter or developed in the Developer tool to a Hadoop cluster. You can run standalone mappings ormappings that are a part of a workflow.

When you run a mapping on a Hadoop cluster, you must configure a Hive validation environment, a Hive run-timeenvironment, and a Hive connection for the mapping. Validate the mapping to ensure you can push the mappinglogic to Hadoop. After you validate a mapping for the Hive environment, you can run the mapping.

To run a mapping on a Hadoop cluster, complete the following steps:

1. In the Developer tool, create a Hive connection.

2. Create a mapping in the Developer tool or import a mapping from PowerCenter.

3. Configure the mapping to run in a Hive environment.

4. Validate the mapping.

5. Optionally, include the mapping in a workflow.

31

6. Run the mapping or workflow.

When you run the mapping, the Data Integration Service converts the mapping to a Hive execution plan that runson a Hadoop cluster. You can view the Hive execution plan using the Developer tool or the Administrator tool.

The Data Integration Service has a Hive executor that can process the mapping. The Hive executor simplifies themapping to an equivalent mapping with a reduced set of instructions and generates a Hive execution plan. TheHive execution plan is a series of Hive queries.The Hive execution plan contains tasks to start the mapping, runthe mapping, and clean up the temporary tables and files. You can view the Hive execution plan that the DataIntegration Service generates before you run the mapping.

You can monitor Hive queries and the Hadoop jobs associated with a query in the Administrator tool. The DataIntegration Service logs messages from the DTM, Hive session, and Hive tasks in the runtime log files.

Datatypes in a Hive EnvironmentDue to the differences between the native environment and a Hive environment, some variations apply in theprocessing and validity of datatypes when you push datatypes to a Hive environment.

The following variations apply in datatype processing and validity:

¨ A Binary datatype in a field or an expression function is not valid. If a transformation has a port with a Binarydatatype that is not used in the mapping, you can validate and run the mapping in a Hive environment.

¨ A high precision Decimal datatype is not valid. A mapping is run in low precision mode in a Hive environment.

¨ The results of arithmetic operations on floating point types, such as a Double or a Decimal, can vary up to 0.1percent between the native environment and a Hive environment.

¨ Hive complex datatypes in a Hive source or Hive target are not valid.

¨ When the Data Integration Service converts a decimal with a precision of 10 and a scale of 3 to a stringdatatype and writes to a flat file target, the results can differ between the native environment and a Hiveenvironment. For example, in a Hive environment, HDFS writes the output string for the decimal 19711025 witha precision of 10 and a scale of 3 as 1971. In the native environment, the flat file writer sends the output stringfor the decimal 19711025 with a precision of 10 and a scale of 3 as 1971.000.

Sources in a Hive EnvironmentDue to the differences between the native environment and a Hive environment, you can only push certainsources to a Hive environment. Some of the sources that are valid in mappings in a Hive environment haverestrictions.

You can run mappings with the following sources in a Hive environment:

¨ IBM DB2

¨ Flat file

¨ HDFS complex file

¨ HDFS flat file

¨ Hive

¨ ODBC

32 Chapter 5: Mappings in a Hive Environment

¨ Oracle

Flat File SourcesFlat file sources are valid in mappings in a Hive environment with some restrictions. A mapping with a flat filesource can fail to run in certain cases.

Flat file sources are valid in mappings in a Hive environment with the following restrictions:

¨ You cannot use a command to generate or transform flat file data and send the output to the flat file reader atruntime.

¨ You cannot use an indirect source type.

¨ The row size in a flat file source cannot exceed 190 MB.

Hive SourcesHive sources are valid in mappings in a Hive environment with some restrictions.

Hive sources are valid in mappings in a Hive environment with the following restrictions:

¨ The Data Integration Service can run pre-mapping SQL commands against the source database before it readsfrom a Hive source. When you run a mapping with a Hive source in a Hive environment, references to localpath in pre-mapping SQL commands are relative to the Data Integration Service node. When you run amapping with a Hive source in the native environment, references to local path in pre-mapping SQL commandsare relative to the Hive server node.

¨ A mapping fails to validate when you configure post-mapping SQL commands.The Data Integration Servicedoes not run post-mapping SQL commands against a Hive source.

¨ A mapping fails to run when you have Unicode characters in a Hive source definition.

Relational SourcesThe Data Integration Service does not run pre-mapping SQL commands or post-mapping SQL commands againstrelational sources. You cannot validate and run a mapping with PreSQL or PostSQL properties for a relationalsource in a Hive environment.

Targets in a Hive EnvironmentDue to the differences between the native environment and a Hive environment, you can push only certain targetsto a Hive environment. Some of the targets that are valid in mappings in a Hive environment have restrictions.

You can run mappings with the following targets in a Hive environment:

¨ IBM DB2

¨ Flat file

¨ HDFS flat file

¨ Hive

¨ ODBC

¨ Oracle

¨ Teradata

Targets in a Hive Environment 33

Flat File TargetsFlat file targets are valid in mappings in a Hive environment with some restrictions.

Flat file targets are valid in mappings in a Hive environment with the following restrictions:

¨ The Data Integration Service truncates the target files and reject files before writing the data. When you use aflat file target, you cannot append output data to target files and reject files.

¨ The Data Integration Service can write to a file output for a flat file target. When you have a flat file target in amapping, you cannot write data to a command.

HDFS Flat File TargetsHDFS flat file targets are valid in mappings in a Hive environment with some restrictions.

When you use a HDFS flat file target in a mapping, you must specify the full path that includes the output filedirectory and file name. The Data Integration Service may generate multiple output files in the output directorywhen you run the mapping in a Hive environment.

Hive TargetsHive targets are valid in mappings in a Hive environment with some restrictions.

Hive targets are valid in mappings in a Hive environment with the following restrictions:

¨ The Data Integration Service does not run pre-mapping or post-mapping SQL commands against the targetdatabase for a Hive target. You cannot validate and run a mapping with PreSQL or PostSQL properties for aHive target,

¨ A mapping fails to run if the Hive target definition differs in the number and order of the columns from therelational table in the Hive database.

¨ The Data Integration Service uses the truncate table to overwrite data to a Hive target. The Data IntegrationService ignores write properties, update override, delete, insert, and update strategy when it writes data to aHive target.

¨ A mapping fails to run when you use Unicode characters in a Hive target definition.

Relational TargetsThe Data Integration Service does not run pre-mapping SQL commands or post-mapping SQL commands againstrelational targets in a Hive environment. You cannot validate and run a mapping with PreSQL or PostSQLproperties for a relational target in a Hive environment.


Transformations in a Hive EnvironmentDue to the differences between native and Hive environment only certain transformations are valid or valid withrestrictions in the Hive environment. The Data Integration Service does not process transformations that containfunctions, expressions, datatypes, and variable fields that are not valid in a Hive environment.

The following table describes the rules and guidelines for transformations:

Transformation Rules and Guidelines

Address Validator You can push mapping logic that includes an AddressValidator transformation to Hadoop if you use a Data Qualityproduct license.The following limitation applies to Address Validatortransformations:- An Address Validator transformation does not generate a

certification report when it runs in a mapping on Hadoop. Ifyou select a certification report option on thetransformation, the mapping validation fails when youattempt to push transformation logic to Hadoop.

Aggregator An Aggregator transformation with pass-through fields is validif they are group-by fields.

Case Converter The Data Integration Service can push a Case Convertertransformation to Hadoop.

Comparison You can push mapping logic that includes a Comparisontransformation to Hadoop if you use a Data Quality productlicense.

Consolidation You can push mapping logic that includes a Consolidationtransformation to Hadoop if you use a Data Quality productlicense.The following limitation applies to Consolidationtransformations:- A Consolidation transformation may process records in a

different order in native and Hadoop environments. Thetransformation may identify a different record as thesurvivor record in each environment.

Data Processor The following limitations apply when a Data Processortransformation directly connects to a complex file reader:- Ports cannot be defined as file.- Input port must be defined as binary.- Output port cannot be defined as binary.- A Streamer must be defined as startup component.- Pass-through ports cannot be used.- Additional input ports cannot be used.The following limitations apply when a mapping has a DataProcessor transformation:- Ports cannot be defined as file.- Ports cannot be defined as binary- Streamer cannot be defined as startup component.

Decision You can push mapping logic that includes a Decisiontransformation to Hadoop if you use a Data Quality productlicense.

Transformations in a Hive Environment 35


Expression An Expression transformation with a user-defined functionreturns a null value for rows that have an exception error inthe function.The Data Integration Service returns an infinite or a NaN (nota number) value when you push transformation logic toHadoop for expressions that result in numerical errors. Forexample:- Divide by zero- SQRT (negative number)- ASIN (out-of-bounds number)In the native environment, the expressions that result innumerical errors return null values and the rows do notappear in the output.

Filter The Data Integration Service can push a Filter transformationto Hadoop.

Java You must copy external JAR files that a Java transformationrequires to the Informatica installation directory in the Hadoopcluster nodes at the following location:[$HADOOP_NODE_INFA_HOME]/services/shared/jars/platform/dtm/The following limitations apply to transformation scopeproperty:- If the transformation scope is set to Transaction, you

cannot validate the Java transformation. The DataIntegration Service cannot apply transformation logic to allrows in a transaction.

- If transformation scope is set to Row, a Javatransformation is run by mapper script.

- If you select a port for Java partition key, thetransformation scope is set to All Input.

- If transformation scope is set to All Input, a Javatransformation is run by the reducer script and you mustset at least one input field as a group-by field for thereducer key.

The Java code in the transformation cannot write output tostandard output when you push transformation logic toHadoop. The Java code can write output to standard errorwhich appears in the log files.

Joiner A Joiner transformation cannot contain inequality joins in theouter join condition.

Key Generator You can push mapping logic that includes a Key Generatortransformation to Hadoop if you use a Data Quality productlicense.

Labeler You can push mapping logic that includes a Labelertransformation to Hadoop when you configure thetransformation to use probabilistic matching techniques.You can push mapping logic that includes all types of Labelerconfiguration if you use a Data Quality product license.



Lookup The following limitations apply to Lookup transformations:- An unconnected Lookup transformation is not valid.- You cannot configure an uncached lookup source.- You cannot configure a persistent lookup cache for the

lookup source.- You cannot use a Hive source for a relational lookup

source.- When you run mappings that contain Lookup

transformations, the Data Integration Service createslookup cache Jar files. Hive copies the lookup cache JARfiles to the following temporary directory:/tmp/<user_name>/hive_resources . The Hive parameterhive.downloaded.resources.dir determines the locationof the temporary directory. You can delete the lookupcache JAR files specified in the LDTM log after themapping completes to retrieve disk space.

Match You can push mapping logic that includes a Matchtransformation to Hadoop if you use a Data Quality productlicense.The following limitation applies to Match transformations:- A Match transformation generates cluster ID values

differently in native and Hadoop environments. In a Hadoopenvironment, the transformation appends a group ID valueto the cluster ID.

Merge The Data Integration Service can push a Mergetransformation to Hadoop.

Parser You can push mapping logic that includes a Parsertransformation to Hadoop when you configure thetransformation to use probabilistic matching techniques.You can push mapping logic that includes all types of Parserconfiguration if you use a Data Quality product license.

Rank A comparison is valid if it is case sensitive.

Router The Data Integration Service can push a Routertransformation to Hadoop.

Sorter The Data Integration service ignores the Sorter transformationwhen you push mapping logic to Hadoop.

SQL The Data Integration Service can push SQL transformationlogic to Hadoop.You cannot use a Hive connection.

Standardizer You can push mapping logic that includes a Standardizertransformation to Hadoop if you use a Data Quality productlicense.

Union The custom source code in the transformation cannot writeoutput to standard output when you push transformation logic

Transformations in a Hive Environment 37


to Hadoop. The custom source code can write output tostandard error, that appears in the runtime log files.

Weighted Average You can push mapping logic that includes a WeightedAverage transformation to Hadoop if you use a Data Qualityproduct license.

Functions in a Hive EnvironmentSome transformation language functions that are valid in the native environment are not valid or have limitations ina Hive environment.

The following table describes the functions that are not valid or have limitations in a Hive environment:

Name Limitation

ABORT String argument is not valid.

AES_DECRYPT Not valid

AES_ENCRYPT Not valid

COMPRESS Not valid

CRC32 Not valid

CUME Not valid

DECODE Not valid

DEC_BASE64 Not valid

DECOMPRESS Not valid

ENC_BASE64 Not valid

ERROR String argument is not valid.

FIRST Not valid

LAST Not valid

MAX (Dates) Not valid

MD5 Not valid

MIN (Dates) Not valid


Name Limitation

MOVINGAVG Not valid

MOVINGSUM Not valid

Variable Ports in a Hive EnvironmentA transformation that contains a stateful variable port is not valid in a Hive environment.

A stateful variable port refers to values from previous rows.

Mappings in a Hive EnvironmentYou can run mappings in a Hive environment. Some differences in processing and configuration apply when yourun mappings in a Hive environment.

The following processing differences apply to mappings in a Hive environment:

¨ A mapping is run in low precision mode. The Data Integration Service ignores high precision mode in a Hiveenvironment. Mappings that require high precision mode may fail to run in a Hive environment.

¨ In a Hive environment, sources that have data errors in a column result in a null value for the column. In thenative environment, the Data Integration Service does not process the rows that have data errors in a column.

¨ When you cancel a mapping that reads from a flat file source, the file copy process that copies flat file data toHDFS may continue to run. The Data Integration Service logs the command to kill this process in the Hivesession log, and cleans up any data copied to HDFS. Optionally, you can run the command to kill the file copyprocess.

The following configuration differences apply to mappings in a Hive environment:

¨ Set the optimizer level to none or minimal if a mapping validates but fails to run. If you set the optimizer level touse cost-based or semi-join optimization methods, the Data Integration Service ignores this at run-time anduses the default.

¨ Mappings that contain a Hive source or a Hive target must use the same Hive connection to push the mappingto Hadoop.

¨ The Data Integration Service ignores the data file block size configured for HDFS files in the hdfs-site.xml file.The Data Integration Service uses a default data file block size of 64 MB for HDFS files. To change the data fileblock size, copy /usr/lib/hadoop/conf/hdfs-site.xml to the following location in the Hadoop distributiondirectory for the Data Integration Service node: /opt/Informatica/services/shared/hadoop/[Hadoop_distribution_name]/conf. You can also update the data file block size in the following file: /opt/Informatica/services/shared/hadoop/[Hadoop_distribution_name]/conf/hive-default.xml.

Variable Ports in a Hive Environment 39

Workflows that Run Mappings in a Hive EnvironmentYou can add a mapping configured to run in a Hive environment to a Mapping task in a workflow. When youdeploy and run the workflow, the Mapping task runs the mapping.

You might want to run a mapping from a workflow so that you can run multiple mappings sequentially, make adecision during the workflow, or send an email notifying users of the workflow status. Or, you can develop aworkflow that runs commands to perform steps before and after the mapping runs.

When a Mapping task runs a mapping configured to run in a Hive environment, do not assign the Mapping taskoutputs to workflow variables. Mappings that run in a Hive environment do not provide the total number of target,source, and error rows. When a Mapping task includes a mapping that runs in a Hive environment, the taskoutputs contain a value of zero (0).

Configuring a Mapping to Run in a Hive EnvironmentYou can use the Developer tool to configure a mapping to run in a Hive environment. To configure a mapping, youmust specify a Hive validation environment, a Hive run-time environment, and a Hive connection.

1. Open the mapping in the Developer tool.

2. In the Advanced properties, select Hive as the validation environment.

3. In the Run-time properties, select Hive as the run-time environment.

4. In the Run-time properties, select a Hive connection.

Hive Execution PlanThe Data Integration Service generates a Hive execution plan for a mapping when you run a mapping in a Hiveenvironment. A Hive execution plan is a series of Hive tasks that the Hive executor generates after it processes amapping for a Hive environment.

Hive Execution Plan DetailsYou can view the details of a Hive execution plan for a mapping from the Developer tool.

The following table describes the properties of a Hive execution plan:


Script Name Name of the Hive script.

Script Hive script that the Data Integration Service generates basedon the mapping logic.

Depends On Tasks that the script depends on. Tasks include other scriptsand Data Integration Service tasks, like the Start task.


Viewing the Hive Execution Plan for a MappingYou can view the Hive execution plan for a mapping that runs in a Hive environment. You do not have to run themapping to view the Hive execution plan in the Developer tool.

Note: You can also view the Hive execution plan in the Administrator tool.

1. In the Developer tool, open the mapping.

2. Select the Data Viewer tab.

3. Select Show Execution Plan.

The Data Viewer tab shows the the details for the Hive execution plan.

Monitoring a MappingYou can monitor a mapping that is running on a Hadoop cluster.

1. Open the Monitoring tab in the Administrator tool.

2. Select Jobs in the Navigator.

3. Select the mapping job.

4. Click the View Logs for Selected Object button to view the run-time logs for the mapping.

The log shows the results of the Hive queries run by the Data Integration Service. This includes the location ofHive session logs and Hive session history file.

5. To view the Hive execution plan for the mapping, select the Hive Query Plan view.

6. To view each script and query included in the Hive execution plan, expand the mapping job node, and selectthe Hive script or query.

7. To view the MapReduce jobs in Jobtracker for a Hive query, select the query, and then click the job ID link inthe Properties view.

Jobtracker opens and shows details about MapReduce jobs that ran or are running on Hadoop.

LogsThe Data Integration Service generates log events when you run a mapping in a Hive environment.

You can view log events relating to different types of errors such as Hive connection failures, Hive query failures,Hive command failures, or other Hadoop job failures. You can find the information about these log events in thefollowing log files:

LDTM log

The Logical DTM logs the results of the Hive queries run for the mapping. You can view the Logical DTM logfrom the Developer tool or the Administrator tool for a mapping job.

Hive session log

For every Hive script in the Hive execution plan for a mapping, the Data Integration Service opens a Hivesession to run the Hive queries. A Hive session updates a log file in the following directory on the Data

Monitoring a Mapping 41

Integration Service node: <InformaticaInstallationDir>/tomcat/bin/disTemp/. The full path to the Hivesession log appears in the LDTM log.

Hadoop Log

To view the details about the MapReduce jobs for a Hive query, you can use the Hadoop JobTracker in theAdministrator tool to navigate to the Hadoop job page. You can also find the Hadoop JobTracker URL in theLDTM log.

Troubleshooting a Mapping in a Hive EnvironmentWhen I run a mapping with a Hive source or a Hive target on a different cluster, the Data Integration Service fails to push themapping to Hadoop with the following error: Failed to execute query [exec0_query_6] with error code [10], errormessage [FAILED: Error in semantic analysis: Line 1:181 Table not found customer_eur], and SQL state[42000]].

When you run a mapping in a Hive environment, the Hive connection selected for the Hive source or Hivetarget, and the mapping must be on the same Hive metastore.


C H A P T E R 6

ProfilesThis chapter includes the following topics:

¨ Profiles Overview, 43

¨ Native and Hadoop Environments, 44

¨ Profile Types on Hadoop, 46

¨ Running a Single Data Object Profile on Hadoop, 47

¨ Running Multiple Data Object Profiles on Hadoop, 48

¨ Monitoring a Profile, 48

¨ Viewing Profile Results, 49

¨ Troubleshooting, 49

Profiles OverviewYou can run a profile on HDFS and Hive data sources in the Hadoop environment. The Hadoop environment helpsimprove the performance. The run-time environment, native Data Integration Service or Hadoop, does not affectthe profile results.

You can run a column profile, rule profile, and data domain discovery on a single data object profile in the Hadoopenvironment. You can perform these profiling capabilities on both native and Hadoop data sources. A native datasource is a non-Hadoop source, such as a flat file, relational source, or mainframe source. A Hadoop data sourcecan be either a Hive or HDFS source.

If you use Informatica Developer, you can choose either native or Hadoop run-time environment to run a profile. Ifyou choose the Hadoop environment, the Developer tool sets the run-time environment in the profile definition.Informatica Analyst supports native environment that uses the Data Integration Service.

You run a profile in the Hadoop run-time environment from the Developer tool. You validate a data source to runthe profile in both native and Hadoop environments. To validate the profile run in the Hadoop environment, youmust select a Hive connection. You can then choose to run the profile in either native or Hadoop run-timeenvironment.

You can view the Hive query plan in the Administrator tool. The Hive query plan consists of one or more scriptsthat the Data Integration Service generates based on the logic defined in the profile. Each script contains Hivequeries that run against the Hive database. One query contains details about the MapReduce job. The remainingqueries perform other actions such as creating and dropping tables in the Hive database.

You can use the Monitoring tab of the Administrator tool to monitor a profile and Hive statements running onHadoop. You can expand a profile job to view the Hive queries generated for the profile. You can also view the run-

43

time log for each profile. The log shows run-time details, such as the time each task runs, the Hive queries that runon Hadoop, and errors that occur.

The Monitoring tab contains the following views:

Properties view

The Properties view shows properties about the selected profile. You can access the MapReduce (MR)details for the profile in Jobtracker from the Properties view. Jobtracker is a Hadoop component that showsthe status of MapReduce jobs that run on nodes in the Hadoop environment.

Hive Query Plan view

The Hive Query Plan view shows the Hive query plan for the selected profile.

Native and Hadoop EnvironmentsWhen you run a profile in the native environment, the Analyst tool or Developer tool submits the profile jobs to theProfiling Service Module. The Profiling Service Module then breaks down the profile jobs into a set of mappings.The Data Integration Service runs these mappings and writes the profile results to the profile warehouse.

The native environment runs the mappings on the same machine where the Data Integration Service runs. TheHadoop environment runs the mappings on a Hadoop cluster. The Data Integration Service pushes the mappingexecution to the Hadoop cluster through a Hive connection. This environment makes all the sources,transformations, and Hive and HDFS sources available for profile run.

If you choose a native source for the Hadoop run-time environment, the Data Integration Service runs the profileon Hadoop. You cannot run a Hadoop data source in the native run-time environment.

Supported Data Source and Run-time EnvironmentsIn the Developer tool, you can run a profile on native, Hive, and HDFS data sources. You can run a profile on bothHive and HDFS sources in the Hadoop environment.

The following table describes the combination of data source types and run-time environments that Data Explorersupports:

Data Source Type Run-time Environment

Native sources such as flat files, relationalsources, and mainframes

Native, Hadoop

Hive Hadoop

HDFS Hadoop

44 Chapter 6: Profiles

You cannot run some of the profile definitions in either the native or Hadoop environment. The following tabledescribes some of the run-time scenarios and whether you can run the profile in different run-time environments:

Scenario Hadoop Run-time Environment Native Run-time Environment

Running a profile on a Hive or HDFSsource within a mapping specification.

No No

Running a profile on a mappingspecification with a Hive or HDFS datasource.

Yes Yes

Running a profile on a logical dataobject with a Hive or HDFS data source.

Yes Yes

Running a column profile on a mappingor mapplet object with a Hive or Hadoopsource.

No Yes

Comparing the column profile results oftwo objects in a mapping or mappletobject with a Hive or HDFS source.

No Yes

Run-time Environment Setup and ValidationBy default, all profiles run in the native run-time environment. You can change the run-time environment toHadoop in the Developer tool and run a profile. Before you run a profile, you need to verify whether the validationsettings in the profile definition match its run-time requirements.

The validation settings determine whether the profile definition suits the native run-time environment, Hadoop run-time environment, or both. The steps to complete the run-time environment setup and validation are as follows:

1. Choose the validation environments. Validation environments are the environments that you want to set up forthe profile run. The Developer tool validates the data sources and transformations for these environments.You must choose at least one of the environments. If you choose both environments, you must choose therun-time environment for the profile.

2. Choose the run-time environment. When you choose the run-time environment, the Developer tool saves oneof the associated validation environments for profile run. If you choose Hadoop, you must select a Hiveconnection. The Hive connection helps the Data Integration Service communicate with the Hadoop cluster topush down the mapping execution from the Data Integration Service to the Hadoop cluster.

The validation environments determine whether the sources and transformations that any of the source rules anddata domains may contain are valid for the environments. The Developer tool validates a profile definition beforeyou run it.

Native and Hadoop Environments 45

The following table describes the validation environment settings that you can configure for a profile:

Option Description

Native (Data Integration Service) The Data Integration Service runs the profile.

Hadoop Runs the profile in the Hadoop environment. If you select thisoption, you must specify the Hive connection.

Hive connection The Hive connection to run a profile in the Hadoopenvironment.

You can specify both native and Hadoop options when you set up the validation environments for a profile. Youchoose either Native or Hadoop as the run-time environment.

Run-time Environment and Profile PerformanceIn general, you run a profile on Hadoop data in the Hadoop run-time environment. For non-Hadoop data, profileson smaller data sources run faster in the native run-time environment.

You can run a profile on bigger data sources in the Hadoop run-time environment. In addition to the data size, youalso need to consider many other factors such as the network configuration, Data Integration Serviceconfiguration, and Hadoop cluster configuration. Unless you need to run non-Hadoop data in the Hadoop run-timeenvironment at a later stage, you run a profile on data in the environment it resides.

Profile Types on HadoopYou can run a column profile, data domain profile, and column profile with rules in the Hadoop environment.

You can run a column profile in the Hadoop environment to determine the characteristics of source columns suchas value frequency, percentages, patterns, and datatypes. Run a data domain profile in the Hadoop environmentto discover source column data that match predefined data domains based on data and column name rules. Youcan also run a profile that has associated rules in the Hadoop environment.Note: Random sampling may not apply when you run a column profile in the Hadoop environment.

Column Profiles on HadoopYou can import a native or Hadoop data source into the Developer tool and then run a column profile on it. Whenyou create a column profile, you select the columns, set up filters, and sampling options. Column profile resultsinclude value frequency distribution, unique values, null values, and datatypes.

Complete the following steps to run a column profile on Hadoop.

1. Open a connection in the Developer tool to import the native or Hadoop source.

2. Import the data source as a data object. The Developer tool saves the data object in the Model repository.

3. Create a profile on the imported data object.

4. Set up the configuration options. These options include validation environment settings, run-time settings, andthe Hive connection.

5. Run the profile to view the results.


Rule Profiles on HadoopYou can run profiles on Hadoop that apply business rules to identify problems in the source data. In the Developertool, you can create a mapplet and validate the mapplet as a rule for reuse. You can also add a rule to a columnprofile on Hadoop.

You cannot run profiles that contain stateful functions, such as MOVINGAVG, MOVINGSUM, DECODE orCOMPRESS.

For more information about stateful functions, see Rules and Guidelines for Functions.

Data Domain Discovery on HadoopData domain discovery is the process of discovering logical datatypes in the data sources based on the semanticsof data. You can run a data domain profile on Hadoop and view the results in the Developer tool.

Data domain discovery results display statistics about columns that match data domains, including the percentageof matching column data and whether column names match data domains. You can drill down the results furtherfor analysis, verify the results on all the rows of the data source, and add the results to a data model from theprofile model.

Running a Single Data Object Profile on HadoopAfter you set up the validation and run-time environments for a profile, you can run the profile to view its results.

1. In the Object Explorer view, select the data object you want to run a profile on.

2. Click File > New > Profile.

The profile wizard appears.

3. Select Profile and click Next.

4. Enter a name and description for the profile and verify the project location. If required, browse to a newlocation.

Verify that Run Profile on finish is selected.

5. Click Next.

6. Configure the column profiling and domain discovery options.

7. Click Run Settings.

The Run Settings pane appears.

8. Select Hive as the validation environment.

You can select both Native and Hive as the validation environments.

9. Select Hive as the run-time environment.

10. Select a Hive connection.

11. Click Finish.

Running a Single Data Object Profile on Hadoop 47

Running Multiple Data Object Profiles on HadoopYou can run a column profile on multiple data source objects. The Developer tool uses default column profilingoptions to generate the results for multiple data sources.

1. In the Object Explorer view, select the data objects you want to run a profile on.

2. Click File > New > Profile to open the New Profile wizard.

3. Select Multiple Profiles and click Next.

4. Select the location where you want to create the profiles. You can create each profile at the same location ofthe data object, or you can specify a common location for the profiles.

5. Verify that the names of the data objects you selected appear within the Data Objects section.

Optionally, click Add to add another data object.

6. Optionally, specify the number of rows to profile, and choose whether to run the profile when the wizardcompletes.

7. Click Next.

The Run Settings pane appears. You can specify the Hive settings.

8. Select Hive and select a Hive connection.

You can select both Native and Hive as the validation environments.

9. In the Run-time Environment field, select Hive.

10. Click Finish.

11. Optionally, enter prefix and suffix strings to add to the profile names.

12. Click OK.

Monitoring a ProfileYou can monitor a profile that is running on Hadoop.

1. Open the Monitoring tab in the Administrator tool.

2. Select Jobs in the Navigator.

3. Select the profiling job.

4. Click the View Logs for Selected Object button to view the run-time logs for the profile.

The log shows all the hive queries that the Data Integration Service ran on the Hadoop cluster.

5. To view the Hive query plan for the profile, select the Hive Query Plan view.

You can also view the Hive query plan in the Developer tool.

6. To view each script and query included in the Hive query plan, expand the profiling job node, and select theHive script or query.

7. To view the MapReduce jobs in Jobtracker for a Hive query, select the query, and then click the job ID link inthe Properties view.

Jobtracker opens and shows details about MapReduce jobs that ran or are running on Hadoop.


Viewing Profile ResultsYou can view the column profile and data domain discovery results after you run a profile on Hadoop.

1. In the Object Explorer view, select the profile you want to view the results for.

2. Right-click the profile and select Run Profile.

The Run Profile dialog box appears.

3. Click the Results tab, if not selected already, in the right pane.

You can view the column profile and data domain discovery results in separate panes.

TroubleshootingCan I drill down on profile results if I run a profile in the Hadoop environment?

Yes, except for profiles in which you have set the option to drill down on staged data.

I get the following error message when I run a profile in the Hadoop environment: “[LDTM_1055] The Integration Service failed togenerate a Hive workflow for mapping [Profile_CUSTOMER_INFO12_14258652520457390]." How do I resolve this?

This error can result from a data source, rule transformation, or run-time environment that is not supported inthe Hadoop environment. Refer Rules and Guidelines for Running Mappings in a Hadoop Environment formore information on objects that are not valid in the Hadoop environment.

You can change the data source, rule, or run-time environment and run the profile again. View the profile logfile for more information on the error.

I see "N/A" in the profile results for all columns after I run a profile. How do I resolve this?

Verify that the profiling results are in the profiling warehouse. If you do not see the profile results, verify thatthe database path is accurate in the HadoopEnv.properties file. You can also verify the database path fromthe Hadoop job tracker on the Monitoring tab of the Administrator tool.

After I run a profile on a Hive source, I do not see the results. When I verify the Hadoop job tracker in the Administrator tool, Isee the following error when I open the profile job: "XML Parsing Error: no element found." What does this mean?

The Hive data source does not have any record and is empty. The data source must have a minimum of onerow of data for successful profile run.

After I run a profile on a Hive source, I cannot view some of the column patterns. Why?

When you import a Hive source, the Developer tool sets the precision for string columns to 4000. TheDeveloper tool cannot derive the pattern for a string column with a precision greater than 255. To resolve thisissue, set the precision of these string columns in the data source to 255 and run the profile again.

When I run a profile on large Hadoop sources, the profile job fails and I get an "execution failed" error. What can be the possiblecause?

One of the causes can be a connection issue. Perform the following steps to identify and resolve theconnection issue:

1. Go to the Monitoring tab in the Administrator tool.

2. Open the Hadoop job tracker.

3. Identify the profile job and open it to view the MapReduce jobs.

Viewing Profile Results 49

4. Click the hyperlink for the failed job to view the error message. If the error message contains the text"java.net.ConnectException: Connection refused", the problem occured because of an issue with theHadoop cluster. Contact your network administrator to resolve the issue.


C H A P T E R 7

Native Environment OptimizationThis chapter includes the following topics:

¨ Native Environment Optimization Overview, 51

¨ Processing Big Data on a Grid, 51

¨ Processing Big Data on Partitions, 52

¨ High Availability, 53

Native Environment Optimization OverviewYou can optimize the native environment to increase performance. To increase performance, you can configurethe Integration Service to run on a grid and create partitions for PowerCenter sessions. You can also enable highavailability to ensure that the domain can continue running despite temporary network, hardware, or servicefailures.

You can run profiles, sessions, and workflows on a grid to increase the processing bandwidth. A grid is an aliasassigned to a group of nodes that run profiles, sessions, and workflows. When you enable grid, the IntegrationService runs a service process on each available node of the grid to increase performance and scalability.

You can also run a PowerCenter session with partitioning to increase session performance. When you createpartitions for a PowerCenter session, the PowerCenter Integration Service performs the extract, transformation,and load for each partition in parallel.

You can configure high availability for the domain. High availability eliminates a single point of failure in a domainand provides minimal service interruption in the event of failure.

Processing Big Data on a GridYou can run an Integration Service on a grid to increase the processing bandwidth. When you enable grid, theIntegration Service runs a service process on each available node of the grid to increase performance andscalability.

Big data may require additional bandwidth to process large amounts of data. For example, when you run a Modelrepository profile on an extremely large data set, the Data Integration Service grid splits the profile into multiplemappings and runs the mappings simultaneously on different nodes in the grid.

51

Data Integration Service GridYou can run Model repository mappings and profiles on a Data Integration Service grid.

When you run mappings on a grid, the Data Integration Service distributes the mappings to multiple DTMprocesses on nodes in the grid. When you run a profile on a grid, the Data Integration Service splits the profile intomultiple mappings and distributes the mappings to multiple DTM processes on nodes in the grid.

For more information about the Data Integration Service grid, see the Informatica Administrator Guide.

PowerCenter Integration Service GridYou can run PowerCenter repository sessions and workflows on a PowerCenter Integration Service grid.

When you run a session on a grid, the PowerCenter Integration Service distributes session threads to multipleDTM processes on nodes in the grid. When you run a workflow on a grid, the PowerCenter Integration Servicedistributes the workflow and tasks included in the workflow across the nodes in the grid.

For more information about the PowerCenter Integration Service grid, see the PowerCenter Advanced WorkflowGuide.

Grid OptimizationYou can optimize the grid to increase performance and scalability of the Data Integration Service or PowerCenterIntegration Service.

To optimize the grid, complete the following tasks:

Add nodes to the grid.

Add nodes to the grid to increase processing bandwidth of the Integration Service.

Use a high-throughput network.

Use a high-throughput network when you access sources and targets over the network or when you runPowerCenter sessions on a grid.

Store files in an optimal storage location for the PowerCenter Integration Service processes.

Store files on a shared file system when all of the PowerCenter Integration Service processes need to accessthe files. You can store files on low-bandwidth and high-bandwidth shared file systems. Place files that areaccessed often on a high-bandwidth shared file system. Place files that are not accessed that often on a low-bandwidth shared file system.

When only one PowerCenter Integration Service process has to access a file, store the file on the localmachine running the Integration Service process instead of a shared file system.

For more information, see the PowerCenter Performance Tuning Guide.

Processing Big Data on PartitionsYou can run a PowerCenter session with partitioning to increase session performance. When you run aPowerCenter session configured with partitioning, the PowerCenter Integration Service performs the extract,transformation, and load for each partition in parallel.

For more information, see the PowerCenter Advanced Workflow Guide.

52 Chapter 7: Native Environment Optimization

Partition OptimizationYou can optimize the partitioning of PowerCenter sessions to improve session performance. You can add morepartitions, select the best performing partition types, use more CPUs, and optimize the source or target databasefor partitioning.

To optimize partitioning, perform the following tasks:

Increase the number of partitions.

Increase the number of partitions to enable the PowerCenter Integration Service to create multipleconnections to sources and process partitions of source data concurrently. Increasing the number of partitionsor partition points increases the number of threads, which also increases the load on the nodes in theIntegration Service. If the Integration Service node or nodes contain ample CPU bandwidth, processing rowsof data in a session concurrently can increase session performance.

Note: If you use a single-node Integration Service and you create a large number of partitions or partitionpoints in a session that processes large amounts of data, you can overload the system.

Select the best performing partition types at particular points in a pipeline.

Select the best performing partition type to optimize session performance. For example, use the databasepartitioning partition type for source and target databases.

Use multiple CPUs.

If you have a symmetric multi-processing (SMP) platform, you can use multiple CPUs to concurrently processsession data or partitions of data.

Optimize the source database for partitioning.

You can optimize the source database for partitioning. For example, you can tune the database, enableparallel queries, separate data into different table spaces, and group sorted data.

Optimize the target database for partitioning.

You can optimize the target database for partitioning. For example, you can enable parallel inserts into thedatabase and use a Router transformation to enable each partition to write to a single database partition.

For more information, see the PowerCenter Performance Tuning Guide.

High AvailabilityHigh availability eliminates a single point of failure in an Informatica domain and provides minimal serviceinterruption in the event of failure. When you configure high availability for a domain, the domain can continuerunning despite temporary network, hardware, or service failures. You can configure high availability among theService Manager, PowerCenter and PowerExchange application services, PowerCenter Client, and command lineprograms.

The following high availability components make services highly available in an Informatica domain:

¨ Resilience. The ability of an Informatica domain to tolerate temporary connection failures until either theresilience timeout expires or the failure is fixed.

¨ Restart and failover. The restart of a service or task or the migration to a backup node after the servicebecomes unavailable on the primary node.

¨ Recovery. The completion of operations after a service is interrupted. After a service process restarts or failsover, it restores the service state and recovers operations.

High Availability 53

When you plan a highly available Informatica environment, consider the differences between internal Informaticacomponents and systems that are external to Informatica. Internal components include the Service Manager,application services, the PowerCenter Client, and command line programs. External systems include the network,hardware, database management systems, FTP servers, message queues, and shared storage.

If you have the high availability option, you can achieve full high availability of internal Informatica components.You can achieve high availability with external components based on the availability of those components. If youdo not have the high availability option, you can achieve some high availability of internal components.

ExampleWhile you are fetching a mapping into the PowerCenter Designer workspace, the PowerCenter Repository Servicebecomes unavailable, and the request fails. The PowerCenter Repository Service fails over to another nodebecause it cannot restart on the same node.

The PowerCenter Designer is resilient to temporary failures and tries to establish a connection to the PowerCenterRepository Service. The PowerCenter Repository Service starts within the resilience timeout period, and thePowerCenter Designer reestablishes the connection.

After the PowerCenter Designer reestablishes the connection, the PowerCenter Repository Service recovers fromthe failed operation and fetches the mapping into the PowerCenter Designer workspace.

54 Chapter 7: Native Environment Optimization

A P P E N D I X A

Datatype ReferenceThis appendix includes the following topics:

¨ Datatype Reference Overview, 55

¨ Hive Complex Datatypes, 55

¨ Hive Datatypes and Transformation Datatypes, 56

Datatype Reference OverviewInformatica Developer uses the following datatypes in Hive mappings:

¨ Hive native datatypes. Hive datatypes appear in the physical data object column properties.

¨ Transformation datatypes. Set of datatypes that appear in the transformations. They are internal datatypesbased on ANSI SQL-92 generic datatypes, which the Data Integration Service uses to move data acrossplatforms. Transformation datatypes appear in all transformations in a mapping.

When the Data Integration Service reads source data, it converts the native datatypes to the comparabletransformation datatypes before transforming the data. When the Data Integration Service writes to a target, itconverts the transformation datatypes to to the comparable native datatypes.

Hive Complex DatatypesHive complex datatypes such as arrays, maps, and structs are a composite of primitive or complex datatypes.Informatica Developer represents the complex datatypes with the string dataype and uses delimiters to separatethe elements of the complex datatype.

Note: Hive complex datatypes in a Hive source or Hive target are not supported when you run mappings in aHadoop cluster.

55

The following table describes the transformation types and delimiters that are used to represent the complexdatatypes:

Complex Datatype Description

Array The elements in the array are of string datatype. Each element of the array is delimited bycommas. For example, an array of fruits is represented as [apple,banana,orange].

Map Maps contain key-value pairs and are represented as pairs of strings and integersdelimited by the = character. Each pair of string and integer pair is delimited by commas.For example, a map of fruits is represented as [1=apple,2=banana,3=orange].

Struct Struct are represented as pairs of strings and integers delimited by the : character. Eachpair of string and integer pair is delimited by commas. For example, a map of fruits isrepresented as [1,apple].

Hive Datatypes and Transformation DatatypesThe following table lists the Hive datatypes that Data Integration Service supports and the correspondingtransformation datatypes:

Hive Datatype Transformation Datatype Range and Description

Tiny Int Integer -32,768 to 32,767

Integer Integer -2,147,483,648 to 2,147,483,647 Precision 10, scale0

Bigint Bigint -9,223,372,036,854,775,808 to9,223,372,036,854,775,807 Precision 19, scale 0

Double Double Precision 15

Float Double Precision 15

String String 1 to 104,857,600 characters

Boolean* Integer 1 or 0

Arrays String 1 to 104,857,600 characters

Struct String 1 to 104,857,600 characters

Maps String 1 to 104,857,600 characters

* The default transformation type for boolean is integer. You can also set this to string datatype with values of True and False.

56 Appendix A: Datatype Reference

A P P E N D I X B

GlossaryA

Apache HadoopAn open-source software framework that supports data-intensive distributed applications.

B

big dataA set of data that is so large and complex that it cannot be processed through standard database managementtools.

C

Cloudera's Distribution Including Apache Hadoop (CDH)Cloudera's version of the open-source Hadoop software framework.

CompressionCodecHadoop compression interface. A codec is the implementation of a compression-decompression algorithm. InHadoop, a codec is represented by an implementation of the CompressionCodec interface.

D

DataNodeAn HDFS node that stores data in the Hadoop File System. An HDFS cluster can have more than one DataNode,with data replicated across them.

H

Hadoop clusterA cluster of machines that is configured to run Hadoop applications and services. A typical Hadoop clusterincludes a master node and several worker nodes. The master node runs the master daemons JobTracker andNameNode. A slave or worker node runs the DataNode and TaskTracker daemons. In small clusters, the masternode may also run the slave daemons.

Hadoop Distributed File System (HDFS)A distributed file storage system used by Hadoop applications.

Hive environmentAn environment that you can configure to run a mapping or a profile on a Hadoop Cluster. You must configureHive as the validation and run-time environment.

HiveA data warehouse infrastructure built on top of Hadoop. Hive supports an SQL-like language called HiveQL fordata summarization, query, and analysis.

Hive executorA component of the DTM that can simplify and convert a mapping or a profile to a Hive execution plan that runs ona Hadoop cluster.

Hive execution planA series of Hive tasks that the Hive executor generates after it processes a mapping or a profile. A Hive executionplan can also be referred to as a Hive workflow.

Hive scriptsScript in Hive query language that contain Hive queries and Hive commands to run the mapping.

Hive taskA task in the Hive execution plan. A Hive execution plan contains many Hive tasks. A Hive task contains a Hivescript.

J

JobTrackerA Hadoop service that coordinates map and reduce tasks and schedules them to run on TaskTrackers.

M

MapReduceA programming model for processing large volumes of data in parallel.

MapReduce jobA unit of work that consists of the input data, the MapReduce program, and configuration information. Hadoop runsthe MapReduce job by dividing it into map tasks and reduce tasks.

metastoreA database that Hive uses to store metadata of the Hive tables stored in HDFS. Metastores can be local,embedded, or remote.

N

NameNodeA node in the Hadoop cluster that manages the file system namespace, maintains the file system tree, and themetadata for all the files and directories in the tree.

58 Glossary

native environmentThe default environment in the Informatica domain that runs a mapping, a workflow, or a profile. The IntegrationService performs data extraction, transformation, and loading.

R

run-time environmentThe environment you configure to run a mapping or a profile. The run-time environment can be native or Hive.

S

stateful variable portA variable port that refers to values from previous rows.

T

TaskTrackerA node in the Hadoop cluster that runs tasks such as map or reduce tasks. TaskTrackers send progress reports tothe JobTracker.

V

validation environmentThe environment you configure to validate a mapping or a profile. You validate a mapping or a profile to ensurethat it can run in a run-time environment. The validation environment can be Hive, native, or both.

Appendix B: Glossary 59

I N D E X

Aarchitecture

grid 3Hive environment processing 4native environment processing 3

Bbig data

access 2big data processing

example 5

Ccolumn profiling on Hadoop

overview 46connections

HDFS 20Hive 20

Ddata domain discovery on Hadoop

overview 47Data Integration Service grid 52Data Replication

description 2installation and configuration 9

datatypesHive 56Hive complex datatypes 55

Ggrid

architecture 3Data Integration Service 52description 3, 51optimization 52PowerCenter Integration Service 52

HHDFS connections

creating 24properties 20

HDFS mappingsdescription 27

high availabilitydescription 3, 53

Hive connectionscreating 24properties 21

Hive environment processingarchitecture 4

Hive execution plandescription, for mapping 31

Hive mappingsdescription 28workflows 40

Hive querydescription, for mapping 31

Hive query planviewing, for mapping 41viewing, for profile 48

Hive scriptdescription, for mapping 31

IInformatica adapters

installation and configuration 8Informatica clients

installation and configuration 8Informatica services

installation and configuration 8

Mmapping example

HDFS 27Hive 29Twitter 29

mapping run on Hadoopmonitoring 41overview 31

Nnative environment

high availability 53mappings 26optimization 51

Native environment processingarchitecture 3

Ppartitioning

description 3, 52

60

optimization 53PowerCenter

installation and configuration 8PowerCenter adapters

installation and configuration 8PowerCenter Big Data Edition

overview 1PowerCenter Integration Service grid 52PowerCenter repository tasks

description 52PowerCenter sessions

partitioning 52PowerExchange adapters

installation and configuration 8profile results

viewing 49profile run on Hadoop

monitoring 48Overview 43profile types 46running a single data object 47

running multiple data objects 48

Rrule profiling on Hadoop

overview 47

Ssocial media mappings

description 29

Wworkflows

Hive mappings 40

Index 61

PC 951 BigDataEdition En

Documents

Transcript of PC 951 BigDataEdition En