HP Vertica Analytics Platform 7.0.x Hadoop Integration...

Hadoop Integration GuideHP Vertica Analytic DatabaseSoftware Version: 7.0.x

Document Release Date: 5/2/2018

Legal Notices

WarrantyThe only warranties for Micro Focus products and services are set forth in the express warranty statements accompanying such products and services. Nothing hereinshould be construed as constituting an additional warranty. Micro Focus shall not be liable for technical or editorial errors or omissions contained herein.

The information contained herein is subject to change without notice.

Restricted Rights LegendConfidential computer software. Valid license from Micro Focus required for possession, use or copying. Consistent with FAR 12.211 and 12.212, CommercialComputer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standardcommercial license.

Copyright Notice© Copyright 2006 - 2014Micro Focus International plc.

Trademark NoticesAdobe® is a trademark of Adobe Systems Incorporated.

Microsoft® andWindows® are U.S. registered trademarks of Microsoft Corporation.

UNIX® is a registered trademark of TheOpenGroup.

This document may contain branding from Hewlett-Packard Company (now HP Inc.) and HewlettPackard Enterprise Company. As of September 1, 2017, the document is now offered by MicroFocus, a separately owned and operated company. Any reference to the HP and Hewlett PackardEnterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPEmarks are the property of their respective owners.

Micro Focus Vertica Analytic Database (7.0.x) of 75

Contents

Contents 3

Using the Vertica Connector for Hadoop Map Reduce 7

Prerequisites 7

How Hadoop and VerticaWork Together 7

Hadoop and Vertica Cluster Scaling 8

Vertica Connector for Hadoop Features 8

Installing the Connector 8

Accessing Vertica Data From Hadoop 11

Selecting VerticaInputFormat 11

Setting the Query to Retrieve Data From Vertica 12

Using a Simple Query to Extract Data From Vertica 12

Using a Parameterized Query and Parameter Lists 13

Using a Discrete List of Values 13

Using a Collection Object 13

Scaling Parameter Lists for the Hadoop Cluster 14

Using aQuery to Retrieve Parameter Values for a Parameterized Query 14

Writing aMapClass That Processes Vertica Data 15

Working with the VerticaRecord Class 15

Writing Data to Vertica From Hadoop 17

Configuring Hadoop to Output to Vertica 17

Defining the Output Table 17

Writing the Reduce Class 18

Storing Data in the VerticaRecord 19

Passing Parameters to the Vertica Connector for HadoopMapReduce At Run Time 22

Specifying the Location of the Connector .jar File 22

Specifying the Database Connection Parameters 22

Parameters for a Separate Output Database 23

Example Vertica Connector for HadoopMapReduce Application 23

Compiling and Running the Example Application 27


Compiling the Example (optional) 28

Running the Example Application 29

Verifying the Results 31

Using Hadoop Streaming with the Vertica Connector for HadoopMapReduce 31

Reading Data From Vertica in a Streaming Hadoop Job 31

Writing Data to Vertica in a Streaming Hadoop Job 33

Loading a Text File From HDFS into Vertica 35

Accessing Vertica From Pig 37

Registering the Vertica .jar Files 37

Reading Data From Vertica 37

Writing Data to Vertica 38

Using the Vertica Connector for HDFS 39

Vertica Connector for HDFS Requirements 39

Kerberos Authentication Requirements 40

Testing Your HadoopWebHDFS Configuration 40

Installing the Vertica Connector for HDFS 42

Downloading and Installing the Vertica Connector for HDFS Package 42

Loading the HDFS User Defined Source 43

Loading Data Using the Vertica Connector for HDFS 44

Copying Files in Parallel 45

Viewing Rejected Rows and Exceptions 46

Creating an External Table Based on HDFS Files 46

Load Errors in External Tables 47

Vertica Connector for HDFS Troubleshooting Tips 48

Installing and Configuring Kerberos on Your Vertica Cluster 49

Installing and Configuring Kerberos on Your Vertica Cluster 50

Preparing Keytab Files for the Vertica Connector for HDFS 51

Using the HCatalog Connector 53

Hive, HCatalog, andWebHCat Overview 53

HCatalog Connection Features 53

HCatalog Connection Considerations 54

Hadoop Integration GuideContents


How the HCatalog ConnectorWorks 54

HCatalog Connector Requirements 55

Vertica Requirements 55

Hadoop Requirements 55

Testing Connectivity 56

Installing the Java Runtime on Your Vertica Cluster 57

Installing a Java Runtime 57

Setting the JavaBinaryForUDx Configuration Parameter 58

Defining a SchemaUsing the HCatalog Connector 59

Querying Hive Tables Using HCatalog Connector 60

Viewing Hive Schema and TableMetadata 61

Synching an HCatalog SchemaWith a Local Schema 65

Data Type Conversions from Hive to Vertica 67

Data-Width Handling Differences Between Hive and Vertica 67

Using Non-Standard SerDes 68

DeterminingWhich SerDe You Need 68

Installing the SerDe on the Vertica Cluster 69

Troubleshooting HCatalog Connector Problems 69

Connection Errors 69

SerDe Errors 70

Differing Results Between Hive and Vertica Queries 71

Preventing Excessive Query Delays 71

Integrating Vertica with the MapR Distribution of Hadoop 73

We appreciate your feedback! 75



of 75Micro Focus Vertica Analytic Database (7.0.x)


Using the Vertica Connector for HadoopMap Reduce

Apache Hadoop is a software platform for performing distributed data processing. Micro Focus hascreated an interface between Vertica and Hadoop that lets you take advantage of both platform'sstrengths. Using the interface, a Hadoop application can access and store data in an Verticadatabase.

You use the Vertica Connector for HadoopMapReduce when you want Hadoop to be able to readdata from and write data to Vertica. If you want Vertica to import data stored in a HadoopDistributed File System (HDFS), use the Vertica Connector for HDFS. See Using the VerticaConnector for HDFS.

PrerequisitesBefore you can use the Vertica Connector for HadoopMapReduce, youmust install and configureHadoop and be familiar with developing Hadoop applications. For details on installing and usingHadoop, please see the Apache HadoopWeb site.

See Vertica 7.0.x Supported Platforms for a list of the versions of Hadoop and Pig that theconnector supports.

How Hadoop and Vertica Work TogetherHadoop and Vertica share some common features—they are both platforms that use clusters ofhosts to store and operate on very large sets of data. Their key difference is the type of data theywork with best. Hadoop is best suited for tasks involving unstructured data such as naturallanguage content. Vertica works best on structured data that can be loaded into database tables.

The Vertica Connector for HadoopMapReduce lets you use these two platforms together to takeadvantage of their strengths. With the connector, you can use data from Vertica in a Hadoop job,and store the results of a Hadoop job in an Vertica database. For example, you can use Hadoop toextract and process keywords from amass of unstructured text messages from a social media site,turning thematerial into structured data that is then loaded into an Vertica database. Once you haveloaded the data into Vertica, you can perform many different analytic queries on the data far fasterthan by using Hadoop alone.

For more on how Hadoop and Vertica work together, see the VerticaMapReduceWeb page.

Note: Vertica supports two different programming interfaces—user defined functions (UDFs),and external procedures—which can handle complex processing tasks that are hard to resolveusing SQL. You should consider using either of these interfaces instead of Hadoop if possible,since they do not require the overhead of maintaining a Hadoop cluster.

Since code written to use either user defined functions or external procedures runs on theVertica cluster, they can impact the performance of your database. In general, if your

Hadoop Integration GuideUsing the Vertica Connector for HadoopMapReduce


http://hadoop.apache.org/

http://www.vertica.com/mapreduce

application has straightforward operation that is hard to do in SQL, you should consider usingUDFs or external procedures rather than resorting to Hadoop. If your application needs toperform significant data extraction and processing, Hadoop is often a better solution, becausethe processing takes place off of the Vertica cluster.

Hadoop and Vertica Cluster ScalingNodes in the Hadoop cluster connect directly to Vertica nodes when retrieving or storing data,allowing large volumes of data to be transferred in parallel. If the Hadoop cluster is larger than theVertica cluster, this parallel data transfer can negatively impact the performance of the Verticadatabase.

To avoid performance impacts on your Vertica database, you should ensure that your Hadoopcluster cannot overwhelm your Vertica cluster. The exact sizing of each cluster depends on howfast your Hadoop cluster generates data requests and the load placed on the Vertica database byqueries from other sources. A good rule of thumb to follow is for your Hadoop cluster to be no largerthan your Vertica cluster.

Vertica Connector for Hadoop FeaturesThe Vertica Connector for HadoopMapReduce:

l gives Hadoop access to data stored in Vertica.

l lets Hadoop store its results in Vertica. The Connector can create a table for the Hadoop data ifit does not already exist.

l lets applications written in Apache Pig access and store data in Vertica.

l works with Hadoop streaming.

The Connector runs on each node in the Hadoop cluster, so the Hadoop nodes and Vertica nodescommunicate with each other directly. Direct connections allow data to be transferred in parallel,dramatically increasing processing speed.

The Connector is written in Java, and is compatible with all platforms supported by Hadoop.

Note: To prevent Hadoop from potentially insertingmultiple copies of data into Vertica, theVertica Connector for HadoopMapReduce disables Hadoop's speculative execution feature.

Installing the ConnectorFollow these steps to install the Vertica Connector for HadoopMapReduce:

If you have not already done so, download theVertica Connector for HadoopMapReduceinstallation package from themyVertica portal. Be sure to download the package that is compatiblewith your version of Hadoop. You can find your Hadoop version by issuing the following commandon a Hadoop node:



http://pig.apache.org/

http://wiki.apache.org/hadoop/HadoopStreaming http://wiki.apache.org/hadoop/HadoopStreaming

http://my.vertica.com/

# hadoop version

Youwill also need a copy of the Vertica JDBC driver which you can also download from themyVertica portal.

You need to perform the following steps on each node in your Hadoop cluster:

1. Copy the Vertica Connector for HadoopMapReduce .zip archive you downloaded to atemporary location on the Hadoop node.

2. Copy the Vertica JDBC driver .jar file to the same location on your node. If you haven't already,you can download this driver from themyVertica portal.

3. Unzip the connector .zip archive into a temporary directory. On Linux, you usually use thecommand unzip.

4. Locate the Hadoop home directory (the directory where Hadoop is installed). The location ofthis directory depends on how you installed Hadoop (manual install versus a package suppliedby your Linux distribution or Cloudera). If you do not know the location of this directory, you cantry the following steps:

n See if the HADOOP_HOME environment variable is set by issuing the command echo$HADOOP_HOME on the command line.

n See if Hadoop is in your path by typing hadoop classpath on the command line. If it is,this command lists the paths of all the jar files used by Hadoop, which should tell you thelocation of the Hadoop home directory.

n If you installed using a .deb or .rpm package, you can look in /usr/lib/hadoop, as this isoften the location where these packages install Hadoop.

5. Copy the file hadoop-vertica.jar from the directory where you unzipped the connectorarchive to the lib subdirectory in the Hadoop home directory.

6. Copy the Vertica JDBC driver file (vertica_x.x.x_jdk_5.jar) to the lib subdirectory in theHadoop home directory ($HADOOP_HOME/lib).

7. Edit the $HADOOP_HOME/conf/hadoop-env.sh file, and find the lines:

# Extra Java CLASSPATH elements. Optional.# export HADOOP_CLASSPATH=

Uncomment the export line by removing the hash character (#) and add the absolute path ofthe JDBC driver file you copied in the previous step. For example:

export HADOOP_CLASSPATH=$HADOOP_HOME/lib/vertica_x.x.x_jdk_5.jar

This environment variable ensures that Hadoop can find the Vertica JDBC driver.





8. Also in the $HADOOP_HOME/conf/hadoop-env.sh file, ensure that the JAVA_HOME environmentvariable is set to your Java installation.

9. If you want your application written in Pig to be able to access Vertica, you need to:

a. Locate the Pig home directory. Often, this directory is in the same parent directory as theHadoop home directory.

b. Copy the file named pig-vertica.jar from the directory where you unpacked theconnector .zip file to the lib subdirectory in the Pig home directory.

c. Copy the Vertica JDBC driver file (vertica_x.x.x_jdk_5.jar) to the lib subdirectory inthe Pig home directory.



Accessing Vertica Data From HadoopYou need to follow three steps to have Hadoop fetch data from Vertica:

l Set the Hadoop job's input format to be VerticaInputFormat.

l Give the VerticaInputFormat class a query to be used to extract data from Vertica.

l Create a Mapper class that accepts VerticaRecord objects as input.

The following sections explain each of these steps in greater detail.

Selecting VerticaInputFormatThe first step to reading Vertica data from within a Hadoop job is to set its input format. You usuallyset the input format within the run()method in your Hadoop application's class. To set up the inputformat, pass the job.setInputFormatClass method the VerticaInputFormat.class, as follows:

public int run(String[] args) throws Exception {// Set up the configuration and job objectsConfiguration conf = getConf();Job job = new Job(conf);

(later in the code)

// Set the input format to retrieve data from// Vertica.job.setInputFormatClass(VerticaInputFormat.class);

Setting the input to the VerticaInputFormat class means that the mapmethod will getVerticaRecord objects as its input.



Setting the Query to Retrieve Data From VerticaA Hadoop job that reads data from your Vertica database has to execute a query that selects itsinput data. You pass this query to your Hadoop application using the setInputmethod of theVerticaInputFormat class. The Vertica Connector for HadoopMapReduce sends this query tothe Hadoop nodes which then individually connect to Vertica nodes to run the query and get theirinput data.

A primary consideration for this query is how it segments the data being retrieved from Vertica.Since each node in the Hadoop cluster needs data to process, the query result needs to besegmented between the nodes.

There are three formats you can use for the query you want your Hadoop job to use when retrievinginput data. Each format determines how the query's results are split across the Hadoop cluster.These formats are:

l A simple, self-contained query.

l A parameterized query along with explicit parameters.

l A parameterized query along with a second query that retrieves the parameter values for the firstquery from Vertica.

The following sections explain each of thesemethods in detail.

Using a Simple Query to Extract Data From VerticaThe simplest format for the query that Hadoop uses to extract data from Vertica is a self-containedhard-coded query. You pass this query in a String to the setInputmethod of theVerticaInputFormat class. You usually make this call in the runmethod of your Hadoop jobclass. For example, the following code retrieves the entire contents of the table named allTypes.

// Sets the query to use to get the data from the Vertica database.// Simple query with no parametersVerticaInputFormat.setInput(job,

"SELECT * FROM allTypes ORDER BY key;");

The query you supply must have anORDER BY clause, since the Vertica Connector for HadoopMapReduce uses it to figure out how to segment the query results between the Hadoop nodes.When it gets a simple query, the connector calculates limits and offsets to be sent to each node inthe Hadoop cluster, so they each retrieve a portion of the query results to process.

Having Hadoop use a simple query to retrieve data from Vertica is the least efficient method, sincethe connector needs to perform extra processing to determine how the data should be segmentedacross the Hadoop nodes.



Using a Parameterized Query and Parameter ListsYou can have Hadoop retrieve data from Vertica using a parametrized query, to which you supply aset of parameters. The parameters in the query are represented by a questionmark (?).

You pass the query and the parameters to the setInputmethod of the VerticaInputFormat class.You have two options for passing the parameters: using a discrete list, or by using a Collectionobject.

Using a Discrete List of Values

To pass a discrete list of parameters for the query, you include them in the setInputmethod call ina comma-separated list of string values, as shown in the next example:

// Simple query with supplied parametersVerticaInputFormat.setInput(job,

"SELECT * FROM allTypes WHERE key = ?", "1001", "1002", "1003");

The Vertica Connector for HadoopMapReduce tries to evenly distribute the query and parametersamong the nodes in the Hadoop cluster. If the number of parameters is not amultiple of the numberof nodes in the cluster, some nodes will get more parameters to process than others. Once theconnector divides up the parameters among the Hadoop nodes, each node connects to a host in theVertica database and executes the query, substituting in the parameter values it received.

This format is useful when you have a discrete set of parameters that will not change over time.However, it is inflexible because any changes to the parameter list requires you to recompile yourHadoop job. An added limitation is that the query can contain just a single parameter, because thesetInputmethod only accepts a single parameter list. Themore flexible way to use parameterizedqueries is to use a collection to contain the parameters.

Using a Collection Object

Themore flexible method of supplying the parameters for the query is to store them into aCollection object, then include the object in the setInputmethod call. This method allows you tobuild the list of parameters at run time, rather than having them hard-coded. You can also usemultiple parameters in the query, since you will pass a collection of ArrayList objects to setInputstatement. Each ArrayList object supplies one set of parameter values, and can contain valuesfor each parameter in the query.

The following example demonstrates using a collection to pass the parameter values for a querycontaining two parameters. The collection object passed to setInput is an instance of the HashSetclass. This object contains four ArrayList objects added within the for loop. This example justadds dummy values (the loop counter and the string "FOUR"). In your own application, you usuallycalculate parameter values in somemanner before adding them to the collection.

Note: If your parameter values are stored in Vertica, you should specify the parameters using aquery instead of a collection. See Using aQuery to Retrieve Parameters for a Parameterized



Query for details.

// Collection to hold all of the sets of parameters for the query.Collection<List<Object>> params = new HashSet<List<Object>>() {};

// Each set of parameters lives in an ArrayList. Each entry// in the list supplies a value for a single parameter in// the query. Here, ArrayList objects are created in a loop// that adds the loop counter and a static string as the// parameters. The ArrayList is then added to the collection.for (int i = 0; i < 4; i++) {

ArrayList<Object> param = new ArrayList<Object>();param.add(i);param.add("FOUR");params.add(param);

}

VerticaInputFormat.setInput(job,"select * from allTypes where key = ? AND NOT varcharcol = ?",params);

Scaling Parameter Lists for the Hadoop Cluster

Whenever possible, make the number of parameter values you pass to the Vertica Connector forHadoopMapReduce equal to the number of nodes in the Hadoop cluster because each parametervalue is assigned to a single Hadoop node. This ensures that the workload is spread across theentire Hadoop cluster. If you supply fewer parameter values than there are nodes in the Hadoopcluster, some of the nodes will not get a value and will sit idle. If the number of parameter values isnot amultiple of the number of nodes in the cluster, Hadoop randomly assigns the extra values tonodes in the cluster. It does not perform scheduling—it does not wait for a nodes finish its task andbecome free before assigning additional tasks. In this case, a node could become a bottleneck if itis assigned the longer-running portions of the job.

In addition to the number of parameters in the query, you shouldmake the parameter values yieldroughly the same number of results. Ensuring each [parameter yields the same number of resultshelps prevent a single node in the Hadoop cluster from becoming a bottleneck by having to processmore data than the other nodes in the cluster.

Using a Query to Retrieve Parameter Values for aParameterized Query

You can pass the Vertica Connector for HadoopMapReduce a query to extract the parametervalues for a parameterized query. This query must return a single column of data that is used asparameters for the parameterized query.

To use a query to retrieve the parameter values, supply the VerticaInputFormat class's setInputmethod with the parameterized query and a query to retrieve parameters. For example:



// Sets the query to use to get the data from the Vertica database.// Query using a parameter that is supplied by another queryVerticaInputFormat.setInput(job,

"select * from allTypes where key = ?","select distinct key from regions");

When it receives a query for parameters, the connector runs the query itself, then groups the resultstogether to send out to the Hadoop nodes, along with the parameterized query. The Hadoop nodesthen run the parameterized query using the set of parameter values sent to them by the connector.

Writing a Map Class That Processes Vertica DataOnce you have set up your Hadoop application to read data from Vertica, you need to create a Mapclass that actually processes the data. Your Map class's mapmethod receives LongWritablevalues as keys and VerticaRecord objects as values. The key values are just sequential numbersthat identify the row in the query results. The VerticaRecord class represents a single row from theresult set returned by the query you supplied to the VerticaInput.setInputmethod.

Working with the VerticaRecord ClassYour mapmethod extracts the data it needs from the VerticaRecord class. This class containsthreemainmethods you use to extract data from the record set:

l get retrieves a single value, either by index value or by name, from the row sent to themapmethod.

l getOrdinalPosition takes a string containing a column name and returns the column'snumber.

l getType returns the data type of a column in the row specified by index value or by name. Thismethod is useful if you are unsure of the data types of the columns returned by the query. Thetypes are stored as integer values defined by the java.sql.Types class.

The following example shows a Mapper class and mapmethod that accepts VerticaRecordobjects. In this example, no real work is done. Instead two values are selected as the key and valueto be passed on to the reducer.

public static class Map extendsMapper<LongWritable, VerticaRecord, Text, DoubleWritable> {// This mapper accepts VerticaRecords as input.public void map(LongWritable key, VerticaRecord value, Context context)

throws IOException, InterruptedException {

// In your mapper, you would do actual processing here.// This simple example just extracts two values from the row of// data and passes them to the reducer as the key and value.if (value.get(3) != null && value.get(0) != null) {

context.write(new Text((String) value.get(3)),new DoubleWritable((Long) value.get(0)));

}



}}

If your Hadoop job has a reduce stage, all of the mapmethod output is managed by Hadoop. It is notstored or manipulated in any way by Vertica. If your Hadoop job does not have a reduce stage, andneeds to store its output into Vertica, your mapmethodmust output its keys as Text objects andvalues as VerticaRecord objects.



Writing Data to Vertica From HadoopThere are three steps you need to take for your Hadoop application to store data in Vertica:

l Set the output value class of your Hadoop job to VerticaRecord.

l Set the details of the Vertica table where you want to store your data in theVerticaOutputFormat class.

l Create a Reduce class that adds data to a VerticaRecord object and calls its write method tostore the data.

The following sections explain these steps in more detail.

Configuring Hadoop to Output to VerticaTo tell your Hadoop application to output data to Vertica you configure your Hadoop application tooutput to the Vertica Connector for HadoopMapReduce. You will normally perform these steps inyour Hadoop application's runmethod. There are threemethods that need to be called in order toset up the output to be sent to the connector and to set the output of the Reduce class, as shown inthe following example:

// Set the output format of Reduce class. It will// output VerticaRecords that will be stored in the// database.job.setOutputKeyClass(Text.class);job.setOutputValueClass(VerticaRecord.class);// Tell Hadoop to send its output to the Vertica// Vertica Connector for Hadoop Map Reduce.job.setOutputFormatClass(VerticaOutputFormat.class);

The call to setOutputValueClass tells Hadoop that the output of the Reduce.reducemethod is aVerticaRecord class object. This object represents a single row of an Vertica database table. Youtell your Hadoop job to send the data to the connector by setting the output format class toVerticaOutputFormat.

Defining the Output TableCall the VerticaOutputFormat.setOutputmethod to define the table that will hold the Hadoopapplication output:

VerticaOutputFormat.setOutput(jobObject, tableName, [truncate, ["columnName1 dataType1"[,"columnNamen dataTypen" ...]] );

jobObject The Hadoop job object for your application.

tableName The name of the table to store Hadoop's output. If this table does notexist, the Vertica Connector for HadoopMapReduce automaticallycreates it. The name can be a full database.schema.table reference.



truncate A Boolean controlling whether to delete the contents of tableName if italready exists. If set to true, any existing data in the table is deletedbefore Hadoop's output is stored. If set to false or not given, theHadoop output is added to the existing data in the table.

"columnName1 dataType1" The table column definitions, where columnName1 is the column nameand dataType1 the SQL data type. These two values are separated bya space. If not specified, the existing table is used.

The first two parameters are required. You can add as many column definitions as you need in youroutput table.

You usually call the setOutputmethod in your Hadoop class's runmethod, where all other setupwork is done. The following example sets up an output table namedmrtarget that contains 8columns, each containing a different data type:

// Sets the output format for storing data in Vertica. It defines the// table where data is stored, and the columns it will be stored in.VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int",

"b boolean", "c char(1)", "d date", "f float", "t timestamp","v varchar", "z varbinary");

If the truncate parameter is set to true for themethod call, and target table already exists in theVertica database, the connector deletes the table contents before storing the Hadoop job's output.

Note: If the table already exists in the database, and themethod call truncate parameter is setto false, the Vertica Connector for HadoopMapReduce adds new application output to theexisting table. However, the connector does not verify that the column definitions in theexisting table match those defined in the setOutputmethod call. If the new application outputvalues cannot be converted to the existing column values, your Hadoop application can throwcasting exceptions.

Writing the Reduce ClassOnce your Hadoop application is configured to output data to Vertica and has its output tabledefined, you need to create the Reduce class that actually formats and writes the data for storage inVertica.

The first step your Reduce class should take is to instantiate a VerticaRecord object to hold theoutput of the reducemethod. This is a little more complex than just instantiating a base object,since the VerticaRecord object must have the columns defined in it that match the out table'scolumns (see Defining the Output Table for details). To get the properly configured VerticaRecordobject, you pass the constructor the configuration object.

You usually instantiate the VerticaRecord object in your Reduce class's setupmethod, whichHadoop calls before it calls the reducemethod. For example:

// Sets up the output record that will be populated by



// the reducer and eventually written out.public void setup(Context context) throws IOException,

InterruptedException {super.setup(context);try {

// Instantiate a VerticaRecord object that has the proper// column definitions. The object is stored in the record// field for use later.record = new VerticaRecord(context.getConfiguration());

} catch (Exception e) {throw new IOException(e);

}}

Storing Data in the VerticaRecordYour reducemethod starts the sameway any other Hadoop reducemethod does—it processes itsinput key and value, performing whatever reduction task your application needs. Afterwards, yourreducemethod adds the data to be stored in Vertica to the VerticaRecord object that wasinstantiated earlier. Usually you use the setmethod to add the data:

VerticaRecord.set(column, value);

column The column to store the value in. This is either an integer (the column number) or aString (the column name, as defined in the table definition).

Note: The set method throws an exception if you pass it the name of a column that doesnot exist. You should always use a try/catch block around any set method call that usesa column name.

value The value to store in the column. The data type of this valuemust match the definition ofthe column in the table.

Note: If you do not have the setmethod validate that the data types of the value and thecolumnmatch, the Vertica Connector for HadoopMapReduce throws a ClassCastExceptionif it finds amismatch when it tries to commit the data to the database. This exception causes arollback of the entire result. By having the setmethod validate the data type of the value, youcan catch and resolve the exception before it causes a rollback.

In addition to the setmethod, you can also use the setFromStringmethod to have the VerticaConnector for HadoopMapReduce convert the value from String to the proper data type for thecolumn:

VerticaRecord.setFromString(column, "value");

column The column number to store the value in, as an integer.

value A String containing the value to store in the column. If the String cannot be convertedto the correct data type to be stored in the column, setFromString throws an exception(ParseException for date values, NumberFormatException numeric values).



Your reducemethodmust output a value for every column in the output table. If you want a columnto have a null value youmust explicitly set it.

After it populates the VerticaRecord object, your reducemethod calls the Context.writemethod, passing it the name of the table to store the data in as the key, and the VerticaRecordobject as the value.

The following example shows a simple Reduce class that stores data into Vertica. Tomake theexample as simple as possible, the code doesn't actually process the input it receives, and insteadjust writes dummy data to the database. In your own application, you would process the key andvalues into data that you then store in the VerticaRecord object.

public static class Reduce extendsReducer<Text, DoubleWritable, Text, VerticaRecord> {// Holds the records that the reducer writes its values to.VerticaRecord record = null;

// Sets up the output record that will be populated by// the reducer and eventually written out.public void setup(Context context) throws IOException,


// Need to call VerticaOutputFormat to get a record object// that has the proper column definitions.record = new VerticaRecord(context.getConfiguration());


}}

// The reduce method.public void reduce(Text key, Iterable<DoubleWritable> values,

Context context) throws IOException, InterruptedException {

// Ensure that the record object has been set up properly. This is// where the results are written.if (record == null) {

throw new IOException("No output record found");}// In this part of your application, your reducer would process the// key and values parameters to arrive at values that you want to// store into the database. For simplicity's sake, this example// skips all of the processing, and just inserts arbitrary values// into the database.//// Use the .set method to store a value in the record to be stored// in the database. The first parameter is the column number,// the second is the value to store.//// Column numbers start at 0.//// Set record 0 to an integer value, you// should always use a try/catch block to catch the exception.



try {record.set(0, 125);

} catch (Exception e) {// Handle the improper data type here.e.printStackTrace();

}

// You can also set column values by name rather than by column// number. However, this requires a try/catch since specifying a// non-existent column name will throw an exception.try {

// The second column, named "b", contains a Boolean value.record.set("b", true);

} catch (Exception e) {// Handle an improper column name here.e.printStackTrace();

}

// Column 2 stores a single char value.record.set(2, 'c');

// Column 3 is a date. Value must be a java.sql.Date type.record.set(3, new java.sql.Date(

Calendar.getInstance().getTimeInMillis()));

// You can use the setFromString method to convert a string// value into the proper data type to be stored in the column.// You need to use a try...catch block in this case, since the// string to value conversion could fail (for example, trying to// store "Hello, World!" in a float column is not going to work).try {

record.setFromString(4, "234.567");} catch (ParseException e) {

// Thrown if the string cannot be parsed into the data type// to be stored in the column.e.printStackTrace();

}

// Column 5 stores a timestamprecord.set(5, new java.sql.Timestamp(

Calendar.getInstance().getTimeInMillis()));// Column 6 stores a varcharrecord.set(6, "example string");// Column 7 stores a varbinaryrecord.set(7, new byte[10]);// Once the columns are populated, write the record to store// the row into the database.context.write(new Text("mrtarget"), record);

}}



Passing Parameters to the Vertica Connectorfor Hadoop Map Reduce At Run Time

Specifying the Location of the Connector .jar FileRecent versions of Hadoop fail to find the Vertica Connector for HadoopMapReduce classesautomatically, even though they are included in the Hadoop lib directory. Therefore, you need tomanually tell Hadoop where to find the connector .jar file using the libjars argument:

hadoop jar myHadoopApp.jar com.myorg.hadoop.myHadoopApp \-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \. . .

Specifying the Database Connection ParametersYou need to pass connection parameters to the Vertica Connector for HadoopMapReduce whenstarting your Hadoop application, so it knows how to connect to your database. At aminimum,these parameters must include the list of hostnames in the Vertica database cluster, the name ofthe database, and the user name. The common parameters for accessing the database appear inthe following table. Usually, you will only need the basic parameters listed in this table in order tostart your Hadoop application.

Parameter Description Required Default

mapred.vertica.hostnames A comma-separated list of the names or IPaddresses of the hosts in the Vertica cluster.You should list all of the nodes in the clusterhere, since individual nodes in the Hadoopcluster connect directly with a randomlyassigned host in the cluster.

The hosts in this cluster are used for bothreading from and writing data to the Verticadatabase, unless you specify a different outputdatabase (see below).

Yes none

mapred.vertica.port The port number for the Vertica database. No 5433

mapred.vertica.database The name of the database the Hadoopapplication should access.

Yes

mapred.vertica.username The username to use when connecting to thedatabase.

Yes

mapred.vertica.password The password to use when connecting to thedatabase.

No empty



You pass the parameters to the connector using the -D command line switch in the command youuse to start your Hadoop application. For example:

hadoop jar myHadoopApp.jar com.myorg.hadoop.myHadoopApp \-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \-Dmapred.vertica.hostnames=Vertica01,Vertica02,Vertica03,Vertica04 \-Dmapred.vertica.port=5433 -Dmapred.vertica.username=exampleuser \-Dmapred.vertica.password=password123 -Dmapred.vertica.database=ExampleDB

Parameters for a Separate Output DatabaseThe parameters in the previous table are all you need if your Hadoop application accesses a singleVertica database. You can also have your Hadoop application read from one Vertica database andwrite to a different Vertica database. In this case, the parameters shown in the previous table applyto the input database (the one Hadoop reads data from). The following table lists the parametersthat you use to supply your Hadoop application with the connection information for the outputdatabase (the one it writes its data to). None of these parameters is required. If you do not assign avalue to one of these output parameters, it inherits its value from the input database parameters.

Parameter Description Default

mapred.vertica.hostnames.output A comma-separated list of the names or IPaddresses of the hosts in the output Verticacluster.

Inputhostnames

mapred.vertica.port.output The port number for the output Verticadatabase.

5433

mapred.vertica.database.output The name of the output database. Inputdatabasename

mapred.vertica.username.output The username to use when connecting to theoutput database.

Inputdatabaseuser name

mapred.vertica.password.output The password to use when connecting to theoutput database.

Inputdatabasepassword

Example Vertica Connector for Hadoop MapReduce Application

This section presents an example of using the Vertica Connector for HadoopMapReduce toretrieve and store data from an Vertica database. The example pulls together the code that hasappeared on the previous topics to present a functioning example.



This application reads data from a table named allTypes. Themapper selects two values from thistable to send to the reducer. The reducer doesn't perform any operations on the input, and insteadinserts arbitrary data into the output table namedmrtarget.

package com.vertica.hadoop;import java.io.IOException;import java.util.ArrayList;import java.util.Calendar;import java.util.Collection;import java.util.HashSet;import java.util.Iterator;import java.util.List;import java.math.BigDecimal;import java.sql.Date;import java.sql.Timestamp;// Needed when using the setFromString method, which throws this exception.import java.text.ParseException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.io.DoubleWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import com.vertica.hadoop.VerticaConfiguration;import com.vertica.hadoop.VerticaInputFormat;import com.vertica.hadoop.VerticaOutputFormat;import com.vertica.hadoop.VerticaRecord;// This is the class that contains the entire Hadoop example.public class VerticaExample extends Configured implements Tool {

public static class Map extendsMapper<LongWritable, VerticaRecord, Text, DoubleWritable> {// This mapper accepts VerticaRecords as input.public void map(LongWritable key, VerticaRecord value, Context context)

throws IOException, InterruptedException {

// In your mapper, you would do actual processing here.// This simple example just extracts two values from the row of// data and passes them to the reducer as the key and value.if (value.get(3) != null && value.get(0) != null) {

context.write(new Text((String) value.get(3)),new DoubleWritable((Long) value.get(0)));

}}

}

public static class Reduce extendsReducer<Text, DoubleWritable, Text, VerticaRecord> {// Holds the records that the reducer writes its values to.VerticaRecord record = null;



// Sets up the output record that will be populated by// the reducer and eventually written out.public void setup(Context context) throws IOException,


// Need to call VerticaOutputFormat to get a record object// that has the proper column definitions.record = new VerticaRecord(context.getConfiguration());


}}

// The reduce method.public void reduce(Text key, Iterable<DoubleWritable> values,

Context context) throws IOException, InterruptedException {

// Ensure that the record object has been set up properly. This is// where the results are written.if (record == null) {

throw new IOException("No output record found");}// In this part of your application, your reducer would process the// key and values parameters to arrive at values that you want to// store into the database. For simplicity's sake, this example// skips all of the processing, and just inserts arbitrary values// into the database.//// Use the .set method to store a value in the record to be stored// in the database. The first parameter is the column number,// the second is the value to store.//// Column numbers start at 0.//// Set record 0 to an integer value, you// should always use a try/catch block to catch the exception.

try {record.set(0, 125);

} catch (Exception e) {// Handle the improper data type here.e.printStackTrace();

}

// You can also set column values by name rather than by column// number. However, this requires a try/catch since specifying a// non-existent column name will throw an exception.try {

// The second column, named "b", contains a Boolean value.record.set("b", true);



} catch (Exception e) {// Handle an improper column name here.e.printStackTrace();

}

// Column 2 stores a single char value.record.set(2, 'c');

// Column 3 is a date. Value must be a java.sql.Date type.record.set(3, new java.sql.Date(

Calendar.getInstance().getTimeInMillis()));

// You can use the setFromString method to convert a string// value into the proper data type to be stored in the column.// You need to use a try...catch block in this case, since the// string to value conversion could fail (for example, trying to// store "Hello, World!" in a float column is not going to work).

try {record.setFromString(4, "234.567");

} catch (ParseException e) {// Thrown if the string cannot be parsed into the data type// to be stored in the column.e.printStackTrace();

}

// Column 5 stores a timestamprecord.set(5, new java.sql.Timestamp(

Calendar.getInstance().getTimeInMillis()));// Column 6 stores a varcharrecord.set(6, "example string");// Column 7 stores a varbinaryrecord.set(7, new byte[10]);// Once the columns are populated, write the record to store// the row into the database.context.write(new Text("mrtarget"), record);

}}

@Overridepublic int run(String[] args) throws Exception {

// Set up the configuration and job objectsConfiguration conf = getConf();Job job = new Job(conf);conf = job.getConfiguration();conf.set("mapreduce.job.tracker", "local");job.setJobName("vertica test");// Set the input format to retrieve data from// Vertica.job.setInputFormatClass(VerticaInputFormat.class);

// Set the output format of the mapper. This is the interim// data format passed to the reducer. Here, we will pass in a// Double. The interim data is not processed by Vertica in any// way.



job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(DoubleWritable.class);

// Set the output format of the Hadoop application. It will// output VerticaRecords that will be stored in the// database.job.setOutputKeyClass(Text.class);job.setOutputValueClass(VerticaRecord.class);job.setOutputFormatClass(VerticaOutputFormat.class);job.setJarByClass(VerticaExample.class);job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);

// Sets the output format for storing data in Vertica. It defines the// table where data is stored, and the columns it will be stored in.VerticaOutputFormat.setOutput(job, "mrtarget", true, "a int",

"b boolean", "c char(1)", "d date", "f float", "t timestamp","v varchar", "z varbinary");

// Sets the query to use to get the data from the Vertica database.// Query using a list of parameters.VerticaInputFormat.setInput(job, "select * from allTypes where key = ?",

"1", "2", "3");job.waitForCompletion(true);return 0;

}public static void main(String[] args) throws Exception {

int res = ToolRunner.run(new Configuration(), new VerticaExample(),args);System.exit(res);

}}

Compiling and Running the ExampleApplication

To run the example Hadoop application, you first need to set up the allTypes table that the examplereads as input. To set up the input table, save the following Perl script as MakeAllTypes.pl to alocation on one of your Vertica nodes:

#!/usr/bin/perlopen FILE, ">datasource" or die $!;for ($i=0; $i < 10; $i++) {

print FILE $i . "|" . rand(10000);print FILE "|one|ONE|1|1999-01-08|1999-02-23 03:11:52.35";print FILE '|1999-01-08 07:04:37|07:09:23|15:12:34 EST|0xabcd|';print FILE '0xabcd|1234532|03:03:03' . qq(\n);

}close FILE;

Then follow these steps:



1. Connect to the node where you saved the MakeAllTypes.pl file.

2. Run the MakeAllTypes.pl file. This will generate a file named datasource in the currentdirectory.

Note: If your node does not have Perl installed, can can run this script on a system thatdoes have Perl installed, then copy the datasource file to a database node.

3. On the same node, use vsql to connect to your Vertica database.

4. Run the following query to create the allTypes table:

CREATE TABLE allTypes (key identity,intcol integer,floatcol float,charcol char(10),varcharcol varchar,boolcol boolean,datecol date,timestampcol timestamp,timestampTzcol timestamptz,timecol time,timeTzcol timetz,varbincol varbinary,bincol binary,numcol numeric(38,0),intervalcol interval

);

5. Run the following query to load data from the datasource file into the table:

COPY allTypes COLUMN OPTION (varbincol FORMAT 'hex', bincol FORMAT 'hex')FROM '/path-to-datasource/datasource' DIRECT;

Replace path-to-datasourcewith the absolute path to the datasource file located in thesame directory where you ran MakeAllTypes.pl.

Compiling the Example (optional)The example code presented in this section is based on example code distributed along with theVertica Connector for HadoopMapReduce in the file hadoop-vertica-example.jar. If you justwant to run the example, skip to the next section and use the hadoop-vertica-example.jar filethat came as part of the connector package rather than a version you compiled yourself.

To compile the example code listed in Example Vertica Connector for HadoopMapReduceApplication, follow these steps:



1. Log into a node on your Hadoop cluster.

2. Locate the Hadoop home directory. See Installing the Connector for tips on how to find thisdirectory.

3. If it is not already set, set the environment variable HADOOP_HOME to the Hadoop homedirectory:

export HADOOP_HOME=path_to_Hadoop_home

If you installed Hadoop using an .rpm or .deb package, Hadoop is usually installed in/usr/lib/hadoop:

export HADOOP_HOME=/usr/lib/hadoop

4. Save the example source code to a file named VerticaExample.java on your Hadoop node.

5. In the same directory where you saved VerticaExample.java, create a directory namedclasses. On Linux, the command is:

mkdir classes

6. Compile the Hadoop example:

javac -classpath \$HADOOP_HOME/hadoop-core.jar:$HADOOP_HOME/lib/hadoop-vertica.jar \

-d classes VerticaExample.java \&& jar -cvf hadoop-vertica-example.jar -C classes .

Note: If you receive errors about missing Hadoop classes, check the name of thehadoop-code.jar file. Most Hadoop installers (including the Cloudera) create a symboliclink named hadoop-core.jar to a version specific .jar file (such as hadoop-core-0.20.203.0.jar). If your Hadoop installation did not create this link, you will have tosupply the .jar file namewith the version number.

When the compilation finishes, you will have a file named hadoop-vertica-example.jar in thesame directory as the VerticaExample.java file. This is the file you will have Hadoop run.

Running the Example ApplicationOnce you have compiled the example, run it using the following command line:

hadoop jar hadoop-vertica-example.jar \com.vertica.hadoop.VerticaExample \



-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \-Dmapred.vertica.port=portNumber \-Dmapred.vertica.username=userName \-Dmapred.vertica.password=dbPassword \-Dmapred.vertica.database=databaseName

This command tells Hadoop to run your application's .jar file, and supplies the parameters neededfor your application to connect to your Vertica database. Fill in your own values for the hostnames,port, user name, password, and database name for your Vertica database.

After entering the command line, you will see output from Hadoop as it processes data that lookssimilar to the following:

12/01/11 10:41:19 INFO mapred.JobClient: Running job: job_201201101146_000512/01/11 10:41:20 INFO mapred.JobClient: map 0% reduce 0%12/01/11 10:41:36 INFO mapred.JobClient: map 33% reduce 0%12/01/11 10:41:39 INFO mapred.JobClient: map 66% reduce 0%12/01/11 10:41:42 INFO mapred.JobClient: map 100% reduce 0%12/01/11 10:41:45 INFO mapred.JobClient: map 100% reduce 22%12/01/11 10:41:51 INFO mapred.JobClient: map 100% reduce 100%12/01/11 10:41:56 INFO mapred.JobClient: Job complete: job_201201101146_000512/01/11 10:41:56 INFO mapred.JobClient: Counters: 2312/01/11 10:41:56 INFO mapred.JobClient: Job Counters12/01/11 10:41:56 INFO mapred.JobClient: Launched reduce tasks=112/01/11 10:41:56 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2154512/01/11 10:41:56 INFO mapred.JobClient: Total time spent by all reduces waitingafter reserving slots (ms)=012/01/11 10:41:56 INFO mapred.JobClient: Total time spent by all maps waiting afterreserving slots (ms)=012/01/11 10:41:56 INFO mapred.JobClient: Launched map tasks=312/01/11 10:41:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1385112/01/11 10:41:56 INFO mapred.JobClient: File Output Format Counters12/01/11 10:41:56 INFO mapred.JobClient: Bytes Written=012/01/11 10:41:56 INFO mapred.JobClient: FileSystemCounters12/01/11 10:41:56 INFO mapred.JobClient: FILE_BYTES_READ=6912/01/11 10:41:56 INFO mapred.JobClient: HDFS_BYTES_READ=31812/01/11 10:41:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8936712/01/11 10:41:56 INFO mapred.JobClient: File Input Format Counters12/01/11 10:41:56 INFO mapred.JobClient: Bytes Read=012/01/11 10:41:56 INFO mapred.JobClient: Map-Reduce Framework12/01/11 10:41:56 INFO mapred.JobClient: Reduce input groups=112/01/11 10:41:56 INFO mapred.JobClient: Map output materialized bytes=8112/01/11 10:41:56 INFO mapred.JobClient: Combine output records=012/01/11 10:41:56 INFO mapred.JobClient: Map input records=312/01/11 10:41:56 INFO mapred.JobClient: Reduce shuffle bytes=5412/01/11 10:41:56 INFO mapred.JobClient: Reduce output records=112/01/11 10:41:56 INFO mapred.JobClient: Spilled Records=612/01/11 10:41:56 INFO mapred.JobClient: Map output bytes=5712/01/11 10:41:56 INFO mapred.JobClient: Combine input records=012/01/11 10:41:56 INFO mapred.JobClient: Map output records=312/01/11 10:41:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=31812/01/11 10:41:56 INFO mapred.JobClient: Reduce input records=3

Note: The version of the example supplied in the Hadoop Connector download package will



producemore output, since it runs several input queries.

Verifying the ResultsOnce your Hadoop application finishes, you can verify it ran correctly by looking at themrtargettable in your Vertica database:

Connect to your Vertica database using vsql and run the following query:

=> SELECT * FROM mrtarget;

The results should look like this:

a | b | c | d | f | t | v |z

-----+---+---+------------+---------+-------------------------+----------------+------------------------------------------125 | t | c | 2012-01-11 | 234.567 | 2012-01-11 10:41:48.837 | example string |

\000\000\000\000\000\000\000\000\000\000(1 row)

Using Hadoop Streaming with the VerticaConnector for Hadoop Map Reduce

Hadoop streaming allows you to create an ad-hoc Hadoop job that uses standard commands (suchas UNIX command-line utilities) for its map and reduce processing. When using streaming, Hadoopexecutes the command you pass to it a mapper and breaks each line from its standard output intokey and value pairs. By default, the key and value are separated by the first tab character in the line.These values are then passed to the standard input to the command that you specified as thereducer. See the Hadoop wiki's topic on streaming for more information. You can have a streamingjob retrieve data from an Vertica database, store data into an Vertica database, or both.

Reading Data From Vertica in a Streaming HadoopJob

To have a streaming Hadoop job read data from an Vertica database, you set the inputformatargument of the Hadoop command line to com.vertica.deprecated.VerticaStreamingInput.You also need to supply parameters that tell the Hadoop job how to connect to your Verticadatabase. See Passing Parameters to the Vertica Connector for HadoopMapReduce At Run Timefor an explanation of these command-line parameters.

Note: The VerticaStreamingInput class is within the deprecated namespace because thecurrent version of Hadoop (as of 0.20.1) has not defined a current API for streaming. Instead,the streaming classes conform to the Hadoop version 0.18 API.



http://wiki.apache.org/hadoop/HadoopStreaming http://wiki.apache.org/hadoop/HadoopStreaming

In addition to the standard command-line parameters that tell Hadoop how to access yourdatabase, there are additional streaming-specific parameters you need to use that supply Hadoopwith the query it should use to extract data from Vertica and other query-related options.


mapred.vertica.input.query The query to use to retrieve datafrom the Vertica database. SeeSetting the Query to Retrieve Datafrom Vertica for more information.

Yes none

mapred.vertica.input.paramquery A query to execute to retrieveparameters for the query given inthe .input.query parameter.

If query hasparameter andno discreteparameterssupplied

mapred.vertica.query.params Discrete list of parameters for thequery.

If query hasparameter andno parameterquery supplied

mapred.vertica.input.delimiter The character to use for separatingcolumn values. The command youuse as amapper needs to splitindividual column values apartusing this delimiter.

No 0xa

mapred.vertica.input.terminator The character used to signal theend of a row of data from the queryresult.

No 0xb

The following command demonstrates reading data from a table named allTypes. This commanduses the UNIX cat command as amapper and reducer so it will just pass the contents through.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,... \-Dmapred.vertica.database=ExampleDB \-Dmapred.vertica.username=ExampleUser \-Dmapred.vertica.password=password123 \-Dmapred.vertica.port=5433 \-Dmapred.vertica.input.query="SELECT key, intcol, floatcol, varcharcol FROM allTypes

ORDER BY key" \-Dmapred.vertica.input.delimiter=, \-Dmapred.map.tasks=1 \-inputformat com.vertica.hadoop.deprecated.VerticaStreamingInput \-input /tmp/input -output /tmp/output -reducer /bin/cat -mapper /bin/cat

The results of this command are saved in the /tmp/output directory on your HDFS filesystem. Ona four-node Hadoop cluster, the results would be:



# $HADOOP_HOME/bin/hadoop dfs -ls /tmp/outputFound 5 items

drwxr-xr-x - release supergroup 0 2012-01-19 11:47 /tmp/output/_logs

-rw-r--r-- 3 release supergroup 88 2012-01-19 11:47 /tmp/output/part-00000-rw-r--r-- 3 release supergroup 58 2012-01-19 11:47 /tmp/output/part-00001-rw-r--r-- 3 release supergroup 58 2012-01-19 11:47 /tmp/output/part-00002-rw-r--r-- 3 release supergroup 87 2012-01-19 11:47 /tmp/output/part-00003# $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-000001 2,1,3165.75558015273,ONE,5 6,5,1765.76024139635,ONE,9 10,9,4142.54176256463,ONE,# $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-000012 3,2,8257.77313710329,ONE,6 7,6,7267.69718012601,ONE,# $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-000023 4,3,443.188765520475,ONE,7 8,7,4729.27825566408,ONE,# $HADOOP_HOME/bin/hadoop dfs -tail /tmp/output/part-000030 1,0,2456.83076632307,ONE,4 5,4,9692.61214265391,ONE,8 9,8,3327.25019418294,ONE,13 1015,15,15.1515,FIFTEEN2 1003,3,333.0,THREE3 1004,4,0.0,FOUR4 1005,5,0.0,FIVE5 1007,7,0.0,SEVEN6 1008,8,1.0E36,EIGHT7 1009,9,-1.0E36,NINE8 1010,10,0.0,TEN9 1011,11,11.11,ELEVEN

Notesl Even though the input is coming from Vertica, you need to supply the -input parameter toHadoop for it to process the streaming job.

l The -Dmapred.map.tasks=1 parameter prevents multiple Hadoop nodes from reading the samedata from the database, which would result in Hadoop processingmultiple copies of the data.

Writing Data to Vertica in a Streaming Hadoop JobSimilar to reading from a streaming Hadoop job, you write data to Vertica by setting theoutputformat parameter of your Hadoop command tocom.vertica.deprecated.VerticaStreamingOutput. This class requires key/value pairs, butthe keys are ignored. The values passed to VerticaStreamingOutput are broken into rows andinserted into a target table. Since keys are ignored, you can use the keys to partition the data for thereduce phase without affecting Vertica's data transfer.

As with reading from Vertica, you need to supply parameters that tell the streaming Hadoop job howto connect to the database. See Passing Parameters to the Vertica Connector for HadoopMapReduce At Run Time for an explanation of these command-line parameters. If you are reading datafrom one Vertica database and writing to another, you need to use the output parameters, similarly if



you were reading and writing to separate databases using a Hadoop application. There are alsoadditional parameters that configure the output of the streaming Hadoop job, listed in the followingtable.


mapred.vertica.output.table.name The name of the table whereHadoop should store its data.

Yes none

mapred.vertica.output.table.def The definition of the table. Theformat is the same as used fordefining the output table for aHadoop application. SeeDefining the Output Table fordetails.

If the tabledoes notalreadyexist in thedatabase

mapred.vertica.output.table.drop Whether to truncate the tablebefore adding data to it.

No false

mapred.vertica.output.delimiter The character to use forseparating column values.

No 0x7 (ASCIIbellcharacter)

mapred.vertica.output.terminator The character used to signal theend of a row of data..

No 0x8 (ASCIIbackspace)

The following example demonstrates reading two columns of data data from an Vertica databasetable named allTypes and writing it back to the same database in a table named hadoopout. Thecommand provides the definition for the table, so you do not have tomanually create the tablebeforehand.

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \-Dmapred.vertica.output.table.name=hadoopout \-Dmapred.vertica.output.table.def="intcol integer, varcharcol varchar" \-Dmapred.vertica.output.table.drop=true \-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \-Dmapred.vertica.database=ExampleDB \-Dmapred.vertica.username=ExampleUser \-Dmapred.vertica.password=password123 \-Dmapred.vertica.port=5433 \-Dmapred.vertica.input.query="SELECT intcol, varcharcol FROM allTypes ORDER BY key" \-Dmapred.vertica.input.delimiter=, \-Dmapred.vertica.output.delimiter=, \-Dmapred.vertica.input.terminator=0x0a \-Dmapred.vertica.output.terminator=0x0a \-inputformat com.vertica.hadoop.deprecated.VerticaStreamingInput \-outputformat com.vertica.hadoop.deprecated.VerticaStreamingOutput \-input /tmp/input \-output /tmp/output \-reducer /bin/cat \-mapper /bin/cat

After running this command, you can view the result by querying your database:



=> SELECT * FROM hadoopout;intcol | varcharcol

--------+------------1 | ONE5 | ONE9 | ONE2 | ONE6 | ONE0 | ONE4 | ONE8 | ONE3 | ONE7 | ONE

(10 rows)

Loading a Text File From HDFS into VerticaOne common task when working with Hadoop and Vertica is loading text files from the HadoopDistributed File System (HDFS) into an Vertica table. You can load these files using Hadoopstreaming, saving yourself the trouble of having to write custommap and reduce classes.

Note: Hadoop streaming is less efficient than a Javamap/reduce Hadoop job, since it passesdata through several different interfaces. Streaming is best used for smaller, one-time loads. Ifyou need to load large amounts of data on a regular basis, you should create a standardHadoopmap/reduce job in Java or a script in Pig.

For example, suppose you have a text file in the HDFS you want to load contains values delimitedby pipe characters (|), with each line of the file is terminated by a carriage return:

# $HADOOP_HOME/bin/hadoop dfs -cat /tmp/textdata.txt1|1.0|ONE2|2.0|TWO3|3.0|THREE

In this case, the line delimiter poses a problem. You can easily include the column delimiter in theHadoop command line arguments. However, it is hard to specify a carriage return in the Hadoopcommand line. To get around this issue, you can write amapper script to strip the carriage returnand replace it with some other character that is easy to enter in the command line and also does notoccur in the data.

Below is an example of amapper script written in Python. It performs two tasks:

l Strips the carriage returns from the input text and terminates each line with a tilde (~).

l Adds a key value (the string "streaming") followed by a tab character at the start of each line ofthe text file. Themapper script needs to do this because the streaming job to read text files skipsthe reducer stage. The reducer isn't necessary, since the all of the data being read in text fileshould be stored in the Vertica tables. However, VerticaStreamingOutput class requires keyand values pairs, so themapper script adds the key.



#!/usr/bin/pythonimport sysfor line in sys.stdin.readlines():

# Get rid of carriage returns.# CR is used as the record terminator by Streaming.jarline = line.strip();# Add a key. The key value can be anything.# The convention is to use the name of the# target table, as shown here.sys.stdout.write("streaming\t%s~\n" % line)

The Hadoop command to stream text files from the HDFS into Vertica using the abovemapperscript appears below.

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \-libjars $HADOOP_HOME/lib/hadoop-vertica.jar \-Dmapred.reduce.tasks=0 \-Dmapred.vertica.output.table.name=streaming \-Dmapred.vertica.output.table.def="intcol integer, floatcol float, varcharcol

varchar" \-Dmapred.vertica.hostnames=VerticaHost01,VerticaHost02,VerticaHost03 \-Dmapred.vertica.port=5433 \-Dmapred.vertica.username=ExampleUser \-Dmapred.vertica.password=password123 \-Dmapred.vertica.database=ExampleDB \-Dmapred.vertica.output.delimiter="|" \-Dmapred.vertica.output.terminator="~" \-input /tmp/textdata.txt \-output output \-mapper "python path-to-script/mapper.py" \-outputformat com.vertica.hadoop.deprecated.VerticaStreamingOutput

Notesl The -Dmapred.reduce-tasks=0 parameter disables the streaming job's reducer stage. It doesnot need a reducer since themapper script processes the data into the format that theVerticaStreamingOutput class expects.

l Even though the VerticaStreamingOutput class is handling the output from themapper, youneed to supply a valid output directory to the Hadoop command.

The result of running the command is a new table in the Vertica database:

=> SELECT * FROM streaming;intcol | floatcol | varcharcol

--------+----------+------------3 | 3 | THREE1 | 1 | ONE2 | 2 | TWO

(3 rows)



Accessing Vertica From PigThe Vertica Connector for HadoopMapReduce includes a Java package that lets you access anVertica database using Pig. Youmust copy this .jar to somewhere in your Pig installation'sCLASSPATH (see Installing the Connector for details).

Registering the Vertica .jar FilesBefore it can access Vertica, your Pig Latin script must register the Vertica-related .jar files. All ofyour Pig scripts should start with the following commands:

REGISTER 'path-to-pig-home/lib/vertica_7.0.x_jdk_5.jar';REGISTER 'path-to-pig-home/lib/pig-vertica.jar';

These commands ensure that Pig can locate the Vertica JDBC classes, as well as the interface forthe connector.

Reading Data From VerticaTo read data from an Vertica database, you tell Pig Latin's LOAD statement to use a SQL query andto use the VerticaLoader class as the load function. Your query can be hard coded, or contain aparameter. See Setting the Query to Retrieve Data from Vertica for details.

Note: You can only use a discrete parameter list or supply a query to retrieve parametervalues—you cannot use a collection to supply parameter values as you can from within aHadoop application.

The format for calling the VerticaLoader is:

com.vertica.pig.VerticaLoader('hosts','database','port','username','password');

hosts A comma-separated list of the hosts in the Vertica cluster.

database The name of the database to be queried.

port The port number for the database.

username The username to use when connecting to the database.

password The password to use when connecting to the database. This is the only optionalparameter. If not present, the connector assumes the password is empty.

The following Pig Latin command extracts all of the data from the table named allTypes using asimple query:

A = LOAD 'sql://{SELECT * FROM allTypes ORDER BY key}' USINGcom.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03',

'ExampleDB','5433','ExampleUser','password123');



This example uses a parameter and supplies a discrete list of parameter values:

A = LOAD 'sql://{SELECT * FROM allTypes WHERE key = ?};{1,2,3}' USINGcom.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03','ExampleDB','5433','ExampleUser','password123');

This final example demonstrates using a second query to retrieve parameters from the Verticadatabase.

A = LOAD 'sql://{SELECT * FROM allTypes WHERE key = ?};sql://{SELECT DISTINCT key FROMallTypes}'

USING com.vertica.pig.VerticaLoader('Vertica01,Vertica02,Vertica03','ExampleDB','5433','ExampleUser','password123');

Writing Data to VerticaTowrite data to an Vertica database, you tell Pig Latin's STORE statement to save data to adatabase table (optionally giving the definition of the table) and to use the VerticaStorer class asthe save function. If the table you specify as the destination does not exist, and you supplied thetable definition, the table is automatically created in your Vertica database and the data from therelation is loaded into it.

The syntax for calling the VerticaStorer is the same as calling VerticaLoader:

com.vertica.pig.VerticaStorer('hosts','database','port','username','password');

The following example demonstrates saving a relation into a table named hadoopOut whichmustalready exist in the database:

STORE A INTO '{hadoopOut}' USINGcom.vertica.pig.VerticaStorer('Vertica01,Vertica02,Vertica03','ExampleDB','5433','ExampleUser','password123');

This example shows how you can add a table definition to the table name, so that the table iscreated in Vertica if it does not already exist:

STORE A INTO '{outTable(a int, b int, c float, d char(10), e varchar, f boolean, g date,h timestamp, i timestamptz, j time, k timetz, l varbinary, m binary,n numeric(38,0), o interval)}' USINGcom.vertica.pig.VerticaStorer('Vertica01,Vertica02,Vertica03','ExampleDB','5433','ExampleUser','password123');

Note: If the table already exists in the database, and the definition that you supply differs fromthe table's definition, the table is not dropped and recreated. This may cause data type errorswhen data is being loaded.



Using the Vertica Connector for HDFSThe Hadoop Distributed File System (HDFS) is where Hadoop usually stores its input and outputfiles. It stores files across the Hadoop cluster redundantly, ensuring files remain available even ifsome nodes are down. It alsomakes Hadoopmore efficient, by spreading file access tasks acrossthe cluster to help limit I/O bottlenecks.

The Vertica Connector for HDFS lets you load files from HDFS into Vertica using the COPYstatement. You can also create external tables that access data stored on HDFS as if it were anative Vertica table. The connector is useful if your Hadoop job does not directly store its data inVertica using the Vertica Connector for HadoopMapReduce (see Using the Vertica Connector forHadoopMapReduce) or if you use HDFS to store files and want to process them using Vertica.

Note: The files you load from HDFS using the Vertica Connector for HDFS is usually in adelimited format, where column values are separated by a character such as a comma or apipe character (|).

Like the Vertica Connector for HadoopMapReduce, the Vertica Connector for HDFS takesadvantage of the distributed nature of both Vertica and Hadoop. Individual nodes in the Verticacluster connect directly to nodes in the Hadoop cluster when loadingmultiple files from the HDFS.This parallel operation decreases load times.

The connector is read-only—it cannot write data to HDFS. Use it when you want to import datafrom HDFS into Vertica. If you want Hadoop to be able to read data from and write data to Vertica,use the Vertica Connector for HadoopMapReduce.

The Vertica Connector for HDFS can connect to a Hadoop cluster through unauthenticated andKerberos-authenticated connections.

Vertica Connector for HDFS RequirementsThe Vertica Connector for HDFS connects to the Hadoop file system usingWebHDFS, a built-incomponent of HDFS that provides access to HDFS files to applications outside of Hadoop.WebHDFS was added to Hadoop in version 1.0, so your Hadoop installationmust be version 1.0 orlater. The the connector has been tested with Apache Hadoop 1.0.0, Hortonworks 1.1, andCloudera CDH4.

In addition, theWebHDFS systemmust be enabled. See your Hadoop distribution's documentationfor instructions on configuring and enablingWebHDFS.

Note: HTTPfs (also known as HOOP) is another method of accessing files stored in an HDFS.It relies on a separate server process that receives requests for files and retrieves them fromthe HDFS. Since it uses a REST API that is compatible withWebHDFS, it could theoreticallywork with the connector. However, the connector has not been tested with HTTPfs andMicroFocus does not support using the Vertica Connector for HDFS with HTTPfs. In addition, sinceall of the files retrieved from HDFS must pass through the HTTPfs server, it is less efficientthanWebHDFS, which lets Vertica nodes directly connect to the Hadoop nodes storing the file

Hadoop Integration GuideUsing the Vertica Connector for HDFS


blocks.

Kerberos Authentication RequirementsThe Vertica Connector for HDFS can connect to the Hadoop file system using Kerberosauthentication. To use Kerberos, your connector must meet these additional requirements:

l Your Vertica installationmust be Kerberos-enabled.

l Your Hadoop cluster must be configured to use Kerberos authentication.

l Your connector must be able to connect to the Kerberos-enabled Hadoop Cluster.

l The Kerberos server must be running version 5.

l The Kerberos server must be accessible from every node in your Vertica cluster.

l Youmust have Kerberos principals (users) that map to Hadoop users. You use these principalsto authenticate your Vertica users with the Hadoop cluster.

Before you can use the Vertica Connector for HDFS with Kerberos youmust Install the Kerberosclient and libraries on your Vertica cluster.

Testing Your Hadoop WebHDFS ConfigurationTo ensure that your Hadoop installation's WebHDFS system is configured and running, followthese steps:

1. Log into your Hadoop cluster and locate a small text file on the Hadoop filesystem. If you donot have a suitable file, you can create a file named test.txt in the /tmp directory using thefollowing command:

echo -e "A|1|2|3\nB|4|5|6" | hadoop fs -put - /tmp/test.txt

2. Log into a host in your Vertica database using the database administrator account.

3. If you are using Kerberos authentication, authenticate with the Kerberos server using thekeytab file for a user who is authorized to access the file. For example, to authenticate as anuser named [email protected], use the command:

$ kinit [email protected] -k -t /path/exampleuser.keytab

Where path is the path to the keytab file you copied over to the node. You do not receive anymessage if you authenticate successfully. You can verify that you are authenticated by usingthe klist command:



$ klistTicket cache: FILE:/tmp/krb5cc_500Default principal: [email protected] starting Expires Service principal07/24/13 14:30:19 07/25/13 14:30:19 krbtgt/[email protected]

renew until 07/24/13 14:30:19

4. Test retrieving the file:

n If you are not using Kerberos authentication, run the following command from the Linuxcommand line:

curl -i -L \"http://hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN&user.name=hadoopUserName"

Replacing hadoopNameNodewith the hostname or IP address of the name node in yourHadoop cluster, /tmp/test.txtwith the path to the file in the Hadoop filesystem youlocated in step 1, and hadoopUserNamewith the user name of a Hadoop user that has readaccess to the file.

If successful, the command you will see output similar to the following:

HTTP/1.1 200 OKServer: Apache-Coyote/1.1Set-Cookie:hadoop.auth="u=hadoopUser&p=password&t=simple&e=1344383263490&s=n8YB/CHFg56qNmRQRTqO0IdRMvE="; Version=1; Path=/Content-Type: application/octet-streamContent-Length: 16Date: Tue, 07 Aug 2012 13:47:44 GMTA|1|2|3B|4|5|6

n If you are using Kerberos authentication, run the following command from the Linuxcommand line:

curl --negotiate -i -L -u:anyUserhttp://hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN

Replace hadoopNameNodewith the hostname or IP address of the name node in yourHadoop cluster, and /tmp/test.txtwith the path to the file in the Hadoop filesystem youlocated in step 1.

If successful, you will see output similar to the following:

HTTP/1.1 401 UnauthorizedContent-Type: text/html; charset=utf-8WWW-Authenticate: NegotiateContent-Length: 0



Server: Jetty(6.1.26)HTTP/1.1 307 TEMPORARY_REDIRECTContent-Type: application/octet-streamExpires: Thu, 01-Jan-1970 00:00:00 GMTSet-Cookie: hadoop.auth="u=exampleuser&[email protected]&t=kerberos&e=1375144834763&s=iY52iRvjuuoZ5iYG8G5g12O2Vwo=";Path=/Location:http://hadoopnamenode.mycompany.com:1006/webhdfs/v1/user/release/docexample/test.txt?op=OPEN&delegation=JAAHcmVsZWFzZQdyZWxlYXNlAIoBQCrfpdGKAUBO7CnRju3TbBSlID_osB658jfGfRpEt8-u9WHymRJXRUJIREZTIGRlbGVnYXRpb24SMTAuMjAuMTAwLjkxOjUwMDcw&offset=0Content-Length: 0Server: Jetty(6.1.26)HTTP/1.1 200 OKContent-Type: application/octet-streamContent-Length: 16Server: Jetty(6.1.26)A|1|2|3B|4|5|6

If the curl command fails, youmust review the error messages and resolve any issues before usingthe Vertica Connector for HDFS with your Hadoop cluster. Some debugging steps include:

l Verify the HDFS service's port number.

l Verify that the Hadoop user you specified exists and has read access to the file you areattempting to retrieve.

Installing the Vertica Connector for HDFSThe Vertica Connector for HDFS is not included as part of the Vertica Server installation. Youmustdownload it frommy.vertica.com and install it on all nodes in your Vertica database.

The connector installation packages contains several support libraries in addition to the library forthe connector. Unlike some other packages supplied by Micro Focus, you need to install thesepackage on all of the hosts in your Vertica database so each host has a copy of the supportlibraries.

Downloading and Installing the Vertica Connector forHDFS Package

Following these steps to install the Vertica Connector for HDFS:

1. Use aWeb browser to log into themyVertica portal.

2. Click the Download tab.





3. Locate the section for the Vertica Connector for HDFS that you want and download theinstallation package that matches the Linux distribution on your Vertica cluster.

4. Copy the installation package to each host in your database.

5. Log into each host as root and run the following command to install the connector package.

n OnRedHat-based Linux distributions, use the command:

rpm -Uvh /path/installation-package.rpm

For example, if you downloaded the Red Hat installation package to the dbadmin homedirectory, the command to install the package is:

rpm -Uvh /home/dbadmin/vertica-hdfs-connectors-7.0.x86_64.RHEL5.rpm

n OnDebian-based systems, use the command:

dpkg -i /path/installation-package.deb

Once you have installed the connector package on each host, you need to load the connector libraryinto Vertica. See Loading the HDFS User Defined Source for instructions.

Loading the HDFS User Defined SourceOnce you have installed the Vertica Connector for HDFS package on each host in your database,you need to load the connector's library into Vertica and define the User Defined Source (UDS) inthe Vertica catalog. The UDS is what you use to access data from the HDFS. The connector installpackage contains a SQL script named install.sql that performs these steps for you. To run it:

1. Log into to the Administration Host using the database administrator account.

2. Execute the installation script:

vsql -f /opt/vertica/packages/hdfs_connectors/install.sql

3. Enter the database administrator password if prompted.

Note: You only need to run an installation script once in order to create the User DefinedSource in the Vertica catalog. You do not need to run the install script on each node.

The SQL install script loads the Vertica Connector for HDFS library and defines the HDFS UserDefined Source namedHDFS. The output of running the installation script successfully looks likethis:



version-------------------------------------------Vertica Analytic Database v7.0.x

(1 row)CREATE LIBRARYCREATE SOURCE FUNCTION

Once the install script finishes running, the connector is ready to use.

Loading Data Using the Vertica Connector forHDFS

Once you have installed the Vertica Connector for HDFS, you can use the HDFS User DefinedSource (UDS) in a COPY statement to load data from HDFS files.

The basic syntax for using the HDFS UDS in a COPY statement is:

COPY tableName SOURCE Hdfs(url='HDFSFileURL', username='username');

tableName The name of the table to receive the copied data.

HDFSFileURL A string containing one or more comma-separated URLs that identify the file or filesto be read. See below for details.

Note: If the URL contains commas, they must be escaped.

username The username of a Hadoop user that has permissions to access the files you wantto copy.

The HDFSFileURL parameter is a string containing one or more comma-separated HTTP URLsthat identify the files in the HDFS that you want to load. The format for each URL in this string is:

http://NameNode:port/webhdfs/v1/HDFSFilePath

NameNode The host name or IP address of the Hadoop cluster's name node.

Port The port number on which theWebHDFS service is running. This is usually 50070or 14000, but may be different in your Hadoop installation.

webhdfs/v1/ The protocol being used to retrieve the file. This portion of the URL is always thesame, to tell Hadoop to use version 1 of theWebHDFS API.



HDFSFilePath The path from the root of the HDFS filesystem to the file or files you want to load.This path can contain standard Linux wildcards.

Note: Any wildcards you use to specify multiple input files must resolve tofiles only, and not include any directories. For example, if you specify thepath /user/HadoopUser/output/*, and the output directory contains asubdirectory, the connector returns an error message.

For example, the following command loads a single file named /tmp/test.txt from the Hadoopcluster whose name node's host name is hadoop using the Vertica Connector for HDFS:

=> COPY testTable SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1/tmp/test.txt',->username='hadoopUser');Rows Loaded

-------------2

(1 row)

Copying Files in ParallelThe basic COPY command shown above copies a single file using just a single host in the Verticacluster. This is not efficient, since just a single Vertica host performs the data load.

Tomake the load process more efficient, the data you want to load should be inmultiple files (whichit usually is if you are loading the results of a Hadoop job). You can then load them all by usingwildcards in the URL, by supplyingmultiple comma-separated URLs in the url parameter of theHdfs user-defined source function call, or both. Loadingmultiple files through the Vertica Connectorfor HDFS results in a very efficient load since the Vertica hosts connect directly to individual nodesin the Hadoop cluster to retrieve files. It is less likely that a single node in either cluster will becomea bottleneck that slows down the operation.

For example, the following statement loads all of the files whose filenames start with "part-" locatedin the /user/hadoopUser/output directory on the HDFS. Assuming there are at least as manyfiles in this directory as there are nodes in your Vertica cluster, all of the nodes in your cluster willfetch and load the data from the HDFS.

=> COPY Customers SOURCE-> Hdfs(url='http://hadoop:50070/webhdfs/v1/user/hadoopUser/output/part-*',-> username='hadoopUser');Rows Loaded

-------------40008

(1 row)

Usingmultiple comma-separated URLs in the URL string you supply in the COPY statement letsyou load data frommultiple directories on the HDFS at once (or even data from different Hadoopclusters):



=> COPY Customers SOURCE-> Hdfs(url='http://hadoop:50070/webhdfs/v1/user/HadoopUser/output/part-*,-> http://hadoop:50070/webhdfs/v1/user/AnotherUser/part-*',-> username='HadoopUser');Rows Loaded

-------------80016

(1 row)

Note: Vertica statements have a 65,000 character limit. If you supply toomany long URLs in asingle statement, you could go over this limit. Normally, you would only approach this limit ifyou have automated the generation of the COPY statement.

Viewing Rejected Rows and ExceptionsCOPY statements that use the Vertica Connector for HDFS use the samemethod for recordingrejections and exceptions as other COPY statements. Rejected rows and exceptions are saved tolog files stored by default in the CopyErrorLogs subdirectory in the database's catalog directory.Due to the distributed nature of the Vertica Connector for HDFS, you cannot use the ON option toforce all of the exception and rejected row information to be written to log files on a single Verticahost. You need to collect all of the log files from across the hosts in order to review all of theexception and rejection information.

Creating an External Table Based on HDFSFiles

You can use the Vertica Connector for HDFS as a source for an external table that lets you directlyperform queries on the contents of files on the Hadoop Distributed File System (HDFS). See UsingExternal Tables in the Administrator's Guide for more information on external tables.

Using an external table to access data stored on an HDFS is useful when you need to extract datafrom files that are periodically updated, or have additional files added on the HDFS. It saves youfrom having to drop previously loaded data and then reload the data using a COPY statement. Theexternal table always accesses the current version of the files on the HDFS.

Note: An external table performs a bulk load each time it is queried. Its performance issignificantly slower than querying an internal Vertica table. You should only use external tablesfor infrequently-run queries (such as daily reports). If you need to frequently query the contentof the HDFS files, you should either use COPY to load the entire content of the files intoVertica or save the results of a query run on an external table to an internal table which you thenuse for repeated queries.

To create an external table that reads data from HDFS, you use the Hdfs User Defined Source(UDS) in a CREATE EXTERNAL TABLE AS COPY statement. The COPY portion of thisstatement has the same format as the COPY statement used to load data from HDFS. See LoadingData Using the Vertica Connector for HDFS for more information.



The following simple example shows how to create an external table that extracts data from everyfile in the /user/hadoopUser/example/output directory using the Vertica Connector for HDFS.

=> CREATE EXTERNAL TABLE hadoopExample (A VARCHAR(10), B INTEGER, C INTEGER, D INTEGER)-> AS COPY SOURCE Hdfs(url=-> 'http://hadoop01:50070/webhdfs/v1/user/hadoopUser/example/output/*',-> username='hadoopUser');CREATE TABLE=> SELECT * FROM hadoopExample;

A | B | C | D-------+---+---+---test1 | 1 | 2 | 3test1 | 3 | 4 | 5

(2 rows)

Later, after another Hadoop job adds contents to the output directory, querying the table producesdifferent results:

=> SELECT * FROM hadoopExample;A | B | C | D

-------+----+----+----test3 | 10 | 11 | 12test3 | 13 | 14 | 15test2 | 6 | 7 | 8test2 | 9 | 0 | 10test1 | 1 | 2 | 3test1 | 3 | 4 | 5

(6 rows)

Load Errors in External TablesNormally, querying an external table on HDFS does not produce any errors if rows rejected by theunderlying COPY statement (for example, rows containing columns whose contents areincompatible with the data types in the table). Rejected rows are handled the sameway they are ina standard COPY statement: they are written to a rejected data file, and are noted in the exceptionsfile. For more information on how COPY handles rejected rows and exceptions, see Capturing LoadRejections and Exceptions in the Administrator's Guide.

Rejections and exception files are created on all of the nodes that load data from the HDFS. Youcannot specify a single node to receive all of the rejected row and exception information. These filesare created on each Vertica node as they process files loaded through the Vertica Connector forHDFS.

Note: Since the the connector is read-only, there is no way to store rejection and exceptioninformation on the HDFS.

Fatal errors during the transfer of data (for example, specifying files that do not exist on the HDFS)do not occur until you query the external table. The following example shows what happens if yourecreate the table based on a file that does not exist on HDFS.

=> DROP TABLE hadoopExample;



DROP TABLE=> CREATE EXTERNAL TABLE hadoopExample (A INTEGER, B INTEGER, C INTEGER, D INTEGER)-> AS COPY SOURCE HDFS(url='http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt',-> username='hadoopUser');CREATE TABLE=> SELECT * FROM hadoopExample;ERROR 0: Error calling plan() in User Function HdfsFactory at[src/Hdfs.cpp:222], error code: 0, message: No files match[http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt]

Note that it is not until you actually query the table that the the connector attempts to read the file.Only then does it return an error.

Vertica Connector for HDFS TroubleshootingTips

Here are a few things to check if you have issue using the Vertica Connector for HDFS:

l If you find a user is suddenly unable to connect to Hadoop through the connector in a Kerberos-enabled environment, it could be that someone exported a new keytab file for the user whichinvalidates existing keytab files. You can determine if this is the problem by comparing the keyversion number associated with the user's principal in Kerberos with the key version numberstored in the keytab file.

a. To find the key version number for a user in Kerberos, start the kadmin utility (kadmin.local ifyou are logged into the Kerberos Key Distribution Center) from the Linux command line andrun the getprinc command for the user:

$ sudo kadmin[sudo] password for dbadmin:Authenticating as principal root/[email protected] with password.Password for root/[email protected]:kadmin: getprinc [email protected]: [email protected] date: [never]Last password change: Fri Jul 26 09:40:44 EDT 2013Password expiration date: [none]Maximum ticket life: 1 day 00:00:00Maximum renewable life: 0 days 00:00:00Last modified: Fri Jul 26 09:40:44 EDT 2013 (root/[email protected])Last successful authentication: [never]Last failed authentication: [never]Failed password attempts: 0Number of keys: 2Key: vno 3, des3-cbc-sha1, no saltKey: vno 3, des-cbc-crc, no saltMKey: vno 0Attributes:Policy: [none]



In the above example, there are two keys stored for the user, both of which are at versionnumber (vno) 3.

b. Use the klist command to get the version numbers of the keys stored in the keytab file:

$ sudo klist -ek exampleuser.keytabKeytab name: FILE:exampleuser.keytabKVNO Principal---- ----------------------------------------------------------------------

2 [email protected] (des3-cbc-sha1)2 [email protected] (des-cbc-crc)3 [email protected] (des3-cbc-sha1)3 [email protected] (des-cbc-crc)

The first column in the output lists the key version number. In the above example, the keytabincludes both key versions 2 and 3, so the keytab file can be used to authenticate the userwith Kerberos.

l Youmay receive an error message similar to this:

ERROR 5118: UDL specified no execution nodes; at least one execution node must bespecified

To fix this error, ensure that all of the nodes in your cluster have the correct Vertica Connectorfor HDFS package installed. This error can occur if one or more of the nodes do not have thesupporting libraries installed on it. This may occur if youmissed one of the nodes when initiallyinstalling the the connector, or if a node has been recovered or added since the connector wasinstalled.

Installing and Configuring Kerberos on YourVertica Cluster

Youmust perform several steps to configure your Vertica cluster before the Vertica Connector forHDFS can authenticate an Vertica user using Kerberos.

Note: You only need to perform these configuration steps of you are using the connector withKerberos. In a non-Kerberos environment, the connector does not require your Vertica clusterto have any special configuration.

Perform the following steps on each node in your Vertica cluster:

1. Install the Kerberos libraries and client. To learn how to install these packages, see thedocumentation your Linux distribution. On someRedHat and CentOS version, you can installthese packages by executing the following commands as root:



yum install krb5-libs krb5-workstation

On some versions of Debian, you would use the command:

apt-get install krb5-config krb5-user krb5-clients

2. Update the Kerberos configuration file (/etc/krb5.conf) to reflect your site's Kerberosconfiguration. The easiest method of doing this is to copy the /etc/krb5.conf file from yourKerberos Key Distribution Center (KDC) server.

3. Copy the keytab files for the users to a directory on the node (see Preparing Keytab Files forthe Vertica Connector for HDFS). The absolute path to these files must be the same on everynode in your Vertica cluster.

4. Ensure the keytab files are readable by the database administrator's Linux account (usuallydbadmin). The easiest way to do this is to change ownership of the files to dbadmin:

sudo chown dbadmin *.keytab

Installing and Configuring Kerberos on Your VerticaCluster

Youmust perform several steps to configure your Vertica cluster before the Vertica Connector forHDFS can authenticate an Vertica user using Kerberos.

Note: You only need to perform these configuration steps of you are using the connector withKerberos. In a non-Kerberos environment, the connector does not require your Vertica clusterto have any special configuration.

Perform the following steps on each node in your Vertica cluster:

1. Install the Kerberos libraries and client. To learn how to install these packages, see thedocumentation your Linux distribution. On someRedHat and CentOS version, you can installthese packages by executing the following commands as root:

yum install krb5-libs krb5-workstation

On some versions of Debian, you would use the command:

apt-get install krb5-config krb5-user krb5-clients



2. Update the Kerberos configuration file (/etc/krb5.conf) to reflect your site's Kerberosconfiguration. The easiest method of doing this is to copy the /etc/krb5.conf file from yourKerberos Key Distribution Center (KDC) server.

3. Copy the keytab files for the users to a directory on the node (see Preparing Keytab Files forthe Vertica Connector for HDFS). The absolute path to these files must be the same on everynode in your Vertica cluster.

4. Ensure the keytab files are readable by the database administrator's Linux account (usuallydbadmin). The easiest way to do this is to change ownership of the files to dbadmin:

sudo chown dbadmin *.keytab

Preparing Keytab Files for the Vertica Connector forHDFS

The Vertica Connector for HDFS uses keytab files to authenticate Vertica users with Kerberos, sothey can access files on the Hadoop HDFS filesystem. These files take the place of entering apassword at a Kerberos login prompt.

Youmust have a keytab file for each Vertica user that needs to use the connector. The keytab filemust match the Kerberos credentials of a Hadoop user that has access to the HDFS.

Caution: Exporting a keytab file for a user changes the version number associated with theuser's Kerberos account. This change invalidates any previously exported keytab file forthe user. If a keytab file has already been created for a user and is currently in use, you shoulduse that keytab file rather than exporting a new keytab file. Otherwise, exporting a new keytabfile will cause any processes using an existing keytab file to no longer be able to authenticate.

To export a keytab file for a user:

1. Start the kadmin utility:

n If you have access to the root account on the Kerberos Key Distribution Center (KDC)system, log into it and use the kadmin.local command. (If this command is not in thesystem search path, try /usr/kerberos/sbin/kadmin.local.)

n If you do not have access to the root account on the Kerberos KDC, then you can use thekadmin command from a Kerberos client system as long as you have the password for theKerberos administrator account. When you start kadmin, it will prompt you for the Kerberosadministrator's password. Youmay need to have root privileges on the client system inorder to run kadmin.

2. Use kadmin's xst (export) command to export the user's credentials to a keytab file:

xst -k username.keytab [email protected]



where username is the name of Kerberos principal you want to export, and YOURDOMAIN.COM isyour site's Kerberos realm. This command creates a keytab file named username.keytab inthe current directory.

The following example demonstrates exporting a keytab file for a user [email protected] using the kadmin command on a Kerberos client system:

$ sudo kadmin[sudo] password for dbadmin:Authenticating as principal root/[email protected] with password.Password for root/[email protected]:kadmin: xst -k exampleuser.keytab [email protected] for principal [email protected] with kvno 2, encryptiontype des3-cbc-sha1 added to keytab WRFILE:exampleuser.keytab.Entry for principal [email protected] with kvno 2, encryptiontype des-cbc-crc added to keytab WRFILE:exampleuser.keytab.

After exporting the keyfile, you can use the klist command to list the keys stored in the file:

$ sudo klist -k exampleuser.keytab[sudo] password for dbadmin:Keytab name: FILE:exampleuser.keytabKVNO Principal---- -----------------------------------------------------------------

2 [email protected] [email protected]



Using the HCatalog ConnectorThe Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehousesoftware the sameway you access it within a native Vertica table.

Hive, HCatalog, and WebHCat OverviewThere are several Hadoop components that you need to understand in order to use the HCatalogconnector:

l Apache's Hive lets you query data stored in a Hadoop Distributed File System (HDFS) the sameway you query data stored in a relational database. Behind the scenes, Hive uses a set ofserializer and deserializer (SerDe) classes to extract data from files stored on the HDFS andbreak it into columns and rows. Each SerDe handles data files in a specific format. For example,one SerDe extracts data from comma-separated data files while another interprets data stored inJSON format.

l Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadataavailable to other Hadoop components (such as Pig).

l WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via aREST web API. Through it, you canmake an HTTP request to retrieve data stored in Hive, aswell as information about the Hive schema.

Vertica's HCatalog Connector lets you transparently access data that is available throughWebHCat. You use the connector to define a schema in Vertica that corresponds to a Hivedatabase or schema. When you query data within this schema, the HCatalog Connectortransparently extracts and formats the data from Hadoop into tabular data. The data within thisHCatalog schema appears as if it is native to Vertica. You can even perform operations such asjoins between Vertica-native tables and HCatalog tables. For more details, see How the HCatalogConnectorWorks.

HCatalog Connection FeaturesThe HCatalog Connector lets you query data stored in Hive using the Vertica native SQL syntax.Some of its main features are:

l The HCatalog Connector always reflects the current state of data stored in Hive.

l The HCatalog Connector uses the parallel nature of both Vertica and Hadoop to process Hivedata. The result is that querying data through the HCatalog Connector is often faster thanquerying the data directly through Hive.

l Since Vertica performs the extraction and parsing of data, the HCatalog Connector does notsignficantly increase the load on your Hadoop cluster.

Hadoop Integration GuideUsing the HCatalog Connector


l The data you query through the HCatalog Connector can be used as if it were native Verticadata. For example, you can execute a query that joins data from a table in an HCatalog schemawith a native table.

HCatalog Connection ConsiderationsThere are a few things to keep inmind when using the HCatalog Connector:

l Hive's data is stored in flat files in a distributed filesystem, requiring it to be read anddeserialized each time it is queried. This deserialization causes Hive's performance to bemuchslower than Vertica. The HCatalog Connector has to perform the same process as Hive to readthe data. Therefore, querying data stored in Hive using the HCatalog Connector is much slowerthan querying a native Vertica table. If you need to perform extensive analysis on data stored inHive, you should consider loading it into Vertica through the HCatalog Connector or theWebHDFS connector. Vertica optimization oftenmakes querying data through the HCatalogConnector faster than directly querying it through Hive.

l Hive supports complex data types such as lists, maps, and structs that Vertica does notsupport. Columns containing these data types are converted to a JSON representation of thedata type and stored as a VARCHAR. See Data Type Conversions from Hive to Vertica.

Note: The HCatalog Connector is read only. It cannot insert data into Hive.

How the HCatalog Connector WorksWhen planning a query that accesses data from aHive table, the Vertica HCatalog Connector onthe initiator node contacts theWebHCat server in your Hadoop cluster to determine if the tableexists. If it does, the connector retrieves the table's metadata from themetastore database so thequery planning can continue. When the query executes, all nodes in the Vertica cluster directlyretrieve the data necessary for completing the query from the Hadoop HDFS. They then use theHive SerDe classes to extract the data so the query can execute.



This approach takes advantage of the parallel nature of both Vertica and Hadoop. In addition, byperforming the retrieval and extraction of data directly, the HCatalog Connector reduces the impactof the query on the Hadoop cluster.

HCatalog Connector RequirementsBoth your Vertica and Hadoop installations must meet some requirements before you can use theHCatalog Connector.

Vertica RequirementsAll of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See Installing theJava Runtime on Your Vertica Cluster.

Hadoop RequirementsYour Hadoop cluster must meet several requirements in order for it to work with the HCatalogconnector:

l It must have Hive and HCatalog installed and running. See Apache's HCatalog page for moreinformation.

l It must haveWebHCat (formerly known as Templeton) installed and running. See Apache' sWebHCat page for details.



https://hive.apache.org/hcatalog/

https://cwiki.apache.org/confluence/display/Hive/WebHCat

l TheWebHCat server and all of the HDFS nodes that store HCatalog datamust be directlyaccessible from all of the hosts in your Vertica database. You need to ensure that any firewallseparating the Hadoop cluster and the Vertica cluster will let WebHCat, metastore database,and HDFS traffic through.

l The data that you want to query must be in an internal or external Hive table.

l If a table you want to query uses a non-standard SerDe, youmust install the SerDe's classes onyour Vertica cluster before you can query the data. See Using Non-Standard SerDes.

The Vertica Connector for HCatalog has been tested with:

l HCatalog 0.5

l Hive 0.10 and 0.11

l Hadoop 2.0

Testing ConnectivityTo test the connection between your database cluster andWebHcat, log into a node in your Verticacluster and run the following command to execute an HCatalog query:

$ curl http://webHCatServer:port/templeton/v1/status?user.name=hcatUsername

Where:

l webHCatServer is the IP address or hostname of theWebHCat server

l port is the port number assigned to theWebHCat service (usually 50111)

l hcatUsername is a valid username authorized to use HCatalog

You will usually want to append ;echo to the command to add a linefeed after the curl command'soutput. Otherwise, the command prompt is appended to the command's output, making it harder toread.

For example:

$ curl http://hcathost:50111/templeton/v1/status?user.name=hive; echo

If there are no errors, this command returns a status message in JSON format, similar to thefollowing:

{"status":"ok","version":"v1"}

This result indicates that WebHCat is running and that the Vertica host can connect to it andretrieve a result. If you do not receive this result, troubleshoot your Hadoop installation and theconnectivity between your Hadoop and Vertica clusters.



You can also run some queries to verify that WebHCat is correctly configured to work with Hive.The following example demonstrates listing the databases defined in Hive and the tables definedwithin a database:

$ curl http://hcathost:50111/templeton/v1/ddl/database?user.name=hive; echo{"databases":["default","production"]}$ curl http://hcathost:50111/templeton/v1/ddl/database/default/table?user.name=hive; echo{"tables":["messages","weblogs","tweets","transactions"],"database":"default"}

See Apache's WebHCat reference for details about querying Hive usingWebHCat.

Installing the Java Runtime on Your VerticaCluster

The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM). The JVMmust supportJava 6 or later.

Note: If your Vertica cluster is configured to execute User Defined Extensions (UDxs) writtenin Java, it already has a correctly-configured JVM installed. See Developing User DefinedFunctions in Java in the Programmer's Guide for more information.

Installing Java on your Vertica cluster is a two-step process:

1. Install a Java runtime on all of the hosts in your cluster.

2. Set the JavaBinaryForUDx configuration parameter to tell Vertica the location of the Javaexecutable.

Installing a Java RuntimeFor Java-based features, Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java runtime.Vertica supports runtimes from either Oracle or OpenJDK. You can choose to install either the JavaRuntime Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE.

Many Linux distributions include a package for the OpenJDK runtime. See your Linux distribution'sdocumentation for information about installing and configuring OpenJDK.

To install the Oracle Java runtime, see the Java Standard Edition (SE) Download Page. Youusually run the installation package as root in order to install it. See the download page forinstructions.

Once you have installed a JVM on each host, ensure that the java command is in the search pathand calls the correct JVM by running the command:

$ java -version

This command should print something similar to:



https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference

http://openjdk.java.net/

http://www.oracle.com/technetwork/java/javase/downloads/index.html

java version "1.6.0_37"Java(TM) SE Runtime Environment (build 1.6.0_37-b06)Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)

Note: Any previously installed Java VM on your hosts may interfere with a newly installedJava runtime. See your Linux distribution's documentation for instructions on configuring whichJVM is the default. Unless absolutely required, you should uninstall any incompatible versionof Java before installing the Java 6 or Java 7 runtime.

Setting the JavaBinaryForUDx ConfigurationParameter

The JavaBinaryForUDx configuration parameter tells Vertica where to look for the JRE to executeJava UDxs. After you have installed the JRE on all of the nodes in your cluster, set this parameterto the absolute path of the Java executable. You can use the symbolic link that some Javainstallers create (for example /usr/bin/java). If the Java executable is in your shell search path,you can get the path of the Java executable by running the following command from the Linuxcommand line shell:

$ which java/usr/bin/java

If the java command is not in the shell search path, use the path to the Java executable in thedirectory where you installed the JRE. Suppose you installed the JRE in /usr/java/default(which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this casethe Java executable is /usr/java/default/bin/java.

You set the configuration parameter by executing the following statement as a database superuser:

=> SELECT SET_CONFIG_PARAMETER('JavaBinaryForUDx','/usr/bin/java');

See SET_CONFIG_PARAMETER in the SQLReferenceManual for more information on settingconfiguration parameters.

To view the current setting of the configuration parameter, query the CONFIGURATION_PARAMETERS system table:

=> \xExpanded display is on.=> SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name = 'JavaBinaryForUDx';-[ RECORD 1 ]-----------------+----------------------------------------------------------node_name | ALLparameter_name | JavaBinaryForUDxcurrent_value | /usr/bin/javadefault_value |change_under_support_guidance | fchange_requires_restart | fdescription | Path to the java binary for executing UDx written in Java



Once you have set the configuration parameter, Vertica can find the Java executable on each nodein your cluster.

Note: Since the location of the Java executable is set by a single configuration parameter forthe entire cluster, youmust ensure that the Java executable is installed in the same path on allof the hosts in the cluster.

Defining a Schema Using the HCatalogConnector

After you set up the HCatalog Connector, you can use it to define a schema in your Verticadatabase to access the tables in a Hive database. You define the schema using theCREATE HCATALOG SCHEMA statement. See CREATE HCATALOG SCHEMA in the SQLReferenceManual for a full description.

When creating the schema, youmust supply at least two pieces of information:

l the name of the schema to define in Vertica

l the host name or IP address of Hive's metastore database (the database server that containsmetadata about Hive's data, such as the schema and table definitions)

Other parameters are optional. If you do not supply a value, Vertica uses default values.

After you define the schema, you can query the data in the Hive data warehouse in the samewayyou query a native Vertica table. The following example demonstrates creating an HCatalogschema and then querying several system tables to examine the contents of the new schema. SeeViewing Hive Schema and TableMetadata for more information about these tables.

=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default'-> HCATALOG_USER='hcatuser';CREATE SCHEMA=> -- Show list of all HCatalog schemas=> \xExpanded display is on.=> SELECT * FROM v_catalog.hcatalog_schemata;-[ RECORD 1 ]--------+------------------------------schema_id | 45035996273748980schema_name | hcatschema_owner_id | 45035996273704962schema_owner | dbadmincreate_time | 2013-11-04 15:09:03.504094-05hostname | hcathostport | 9933webservice_hostname | hcathostwebservice_port | 50111hcatalog_schema_name | defaulthcatalog_user_name | hcatusermetastore_db_name | hivemetastoredb



Renee | Coleman | 2013-10-29 00:42:55 | lectus quis imperdietFay | Moss | 2013-10-29 00:53:39 | sit eleifend tempus a aliquam mauriDominique | Cabrera | 2013-10-29 01:15:07 | turpis vehicula tortorMohammad | Eaton | 2013-10-29 00:21:27 | Fusce ad sem vehicula morbiCade | Barr | 2013-10-29 01:25:51 | sapien adipiscing eget Aliquam torOprah | Mcmillan | 2013-10-29 01:36:35 | varius Cum iaculis metusAstra | Sherman | 2013-10-29 01:58:03 | dignissim odio Pellentesque primisChelsea | Malone | 2013-10-29 02:08:47 | pede tempor dignissim Sed luctus(10 rows)

Viewing Hive Schema and Table MetadataWhen using Hive, you access metadata about schemata and tables by executing statementswritten in HiveQL (Hive's version of SQL) such as SHOW TABLES. When using the HCatalogConnector, you can get metadata about the tables in the Hive database through several Verticasystem tables.

There are four system tables that contain metadata about the tables accessible through theHCatalog Connector:

l HCATALOG_SCHEMATA lists all of the schemata (plural of schema) that have been definedusing the HCatalog Connector. See HCATALOG_SCHEMATA in the SQLReferenceManualfor detailed information.

l HCATALOG_TABLE_LIST contains an overview of all of the tables available from all schematadefined using the HCatalog Connector. This table only shows the tables which the user queryingthe table can access. The information in this table is retrieved using a single call toWebHCat foreach schema defined using the HCatalog Connector, whichmeans there is a little overheadwhen querying this table. See HCATALOG_TABLE_LIST in the SQLReferenceManual fordetailed information.

l HCATALOG_TABLES contains more in-depth information than HCATALOG_TABLE_LIST.However, querying this table results in Verticamaking a REST web service call toWebHCat foreach table available through the HCatalog Connector. If there aremany tables in the HCatalogschemata, this query could take a while to complete. See HCATALOG_TABLES in the SQLReferenceManual for more information.

l HCATALOG_COLUMNS lists metadata about all of the columns in all of the tables availablethrough the HCatalog Connector. Similarly to HCATALOG_TABLES, querying this table resultsin one call toWebHCat per table, and therefore can take a while to complete. See HCATALOG_COLUMNS in the SQLReferenceManual for more information.

The following example demonstrates querying the system tables containingmetadata for the tablesavailable through the HCatalog Connector.

=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost'-> HCATALOG_SCHEMA='default' HCATALOG_DB='default' HCATALOG_USER='hcatuser';CREATE SCHEMA=> SELECT * FROM HCATALOG_SCHEMATA;



numeric_scale |datetime_precision |interval_precision |ordinal_position | 7-[ RECORD 8 ]------------+-----------------table_schema | hcathcatalog_schema | defaulttable_name | hcatalogtypesis_partition_column | fcolumn_name | varbincolhcatalog_data_type | binarydata_type | varbinary(65000)data_type_id | 17data_type_length | 65000character_maximum_length | 65000numeric_precision |numeric_scale |datetime_precision |interval_precision |ordinal_position | 8-[ RECORD 9 ]------------+-----------------table_schema | hcathcatalog_schema | defaulttable_name | hcatalogtypesis_partition_column | fcolumn_name | bincolhcatalog_data_type | binarydata_type | varbinary(65000)data_type_id | 17data_type_length | 65000character_maximum_length | 65000numeric_precision |numeric_scale |datetime_precision |interval_precision |ordinal_position | 9

Synching an HCatalog Schema With a LocalSchema

Querying data from anHCatalog schema can be slow due to Hive andWebHCat performanceissues. This slow performance can be especially annoying when you use the HCatalog Connectorto query the HCatalog schema's metadata to examine the structure of the tables in the Hivedatabase.

To avoid this problem you can use the use the SYNC_WITH_HCATALOG_SCHEMA function tocreate snapshot of the HCatalog schema's metadata within an Vertica schema. You supply thisfunction with the name of a pre-existing Vertica schema and an HCatalog schema available throughthe HCatalog Connector. It creates a set of external tables within the Vertica schema that you canthen use to examine the structure of the tables in the Hive database. Since the data in the Verticaschema is local, queries aremuch faster. You can also use standard Vertica statements andsystem tables queries to examine the structure of Hive tables in the HCatalog schema.



Caution: The SYNC_WITH_HCATALOG_SCHEMA overwrites tables in the Vertica schemawhose names match a table in the HCatalog schema. To avoid losing data, always create anempty Vertica schema to sync with an HCatalog schema.

The Vertica schema is just a snapshot of the HCatalog schema's metadata. Vertica does notsynchronize later changes to the HCatalog schemawith the local schema after you call SYNC_WITH_HCATALOG_SCHEMA. You can call the function again to re-synchronize the local schemato the HCatalog schema.

Note: By default, the function does not drop tables that appear in the local schema that do notappear in the HCatalog schema. Thus after the function call the local schema does not reflecttables that have been dropped in the Hive database. You can change this behavior bysupplying the optional third Boolean argument that tells the function to drop any table in thelocal schema that does not correspond to a table in the HCatalog schema.

The following example demonstrates calling SYNC_WITH_HCATALOG_SCHEMA to sync theHCatalog schema named hcat with a local schema.

=> CREATE SCHEMA hcat_local;CREATE SCHEMA=> SELECT sync_with_hcatalog_schema('hcat_local', 'hcat');sync_with_hcatalog_schema----------------------------------------Schema hcat_local synchronized with hcattables in hcat = 56tables altered in hcat_local = 0tables created in hcat_local = 56stale tables in hcat_local = 0table changes erred in hcat_local = 0(1 row)

=> -- Use vsql's \d command to describe a table in the synced schema

=> \d hcat_local.messagesList of Fields by Tables

Schema | Table | Column | Type | Size | Default | Not Null | PrimaryKey | Foreign Key-----------+----------+---------+----------------+-------+---------+----------+-------------+-------------hcat_local | messages | id | int | 8 | | f | f

|hcat_local | messages | userid | varchar(65000) | 65000 | | f | f

|hcat_local | messages | "time" | varchar(65000) | 65000 | | f | f

|hcat_local | messages | message | varchar(65000) | 65000 | | f | f

|(4 rows)

Note: You can query tables in the local schema you synched with an HCatalog schema.Querying tables in a synched schema isn't much faster than directly querying the HCatalog



schema because SYNC_WITH_HCATALOG_SCHEMA only duplicates the HCatalogschema's metadata. The data in the table is still retrieved using the HCatalog Connector,

Data Type Conversions from Hive to VerticaThe data types recognized by Hive differ from the data types recognize by Vertica. The followingtable lists how the HCatalog Connector converts Hive data types into data types compatible withVertica.

Hive Data Type Vertica Data Type

TINYINT (1-byte) TINYINT (8-bytes)

SMALLINT (2-bytes) SMALLINT (8-bytes)

INT (4-bytes) INT (8-bytes)

BIGINT (8-bytes) BIGINT (8-bytes)

BOOLEAN BOOLEAN

FLOAT (4-bytes) FLOAT (8-bytes)

DOUBLE (8-bytes) DOUBLE PRECISION (8-bytes)

STRING (2GB max) VARCHAR (65000)

BINARY (2GB max) VARBINARY (65000)

LIST/ARRAY VARCHAR (65000) containing a JSON-format representation of the list.

MAP VARCHAR (65000) containing a JSON-format representation of themap.

STRUCT VARCHAR (65000) containing a JSON-format representation of thestruct.

Data-Width Handling Differences Between Hive andVertica

The HCatalog Connector relies on Hive SerDe classes to extract data from files on HDFS.Therefore, the data read from these files are subject to Hive's data width restrictions. For example,suppose the SerDe parses a value for an INT column into a value that is greater than 232-1 (themaximum value for a 32-bit integer). In this case, the value is rejected even if it would fit into anVertica's 64-bit INTEGER column because it cannot fit into Hive's 32-bit INT.

Once the value has been parsed and converted to an Vertica data type, it is treated as an nativedata. This treatment can result in some confusion when comparing the results of an identical queryrun in Hive and in Vertica. For example, if your query adds two INT values that result in a value thatis larger than 232-1, the value overflows its 32-bit INT data type, causing Hive to return an error.When running the same query with the same data in Vertica using the HCatalog Connector, the



value will probably still fit within Vertica's 64-int value. Thus the addition is successful and returns avalue.

Using Non-Standard SerDesHive stores its data in unstructured flat files located in the Hadoop Distributed File System (HDFS).When you execute a Hive query, it uses a set of serializer and deserializer (SerDe) classes toextract data from these flat files and organize it into a relational database table. For Hive to be ableto extract data from a file, it must have a SerDe that can parse the data the file contains. When youcreate a table in Hive, you can select the SerDe to be used for the table's data.

Hive has a set of standard SerDes that handle data in several formats such as delimited data anddata extracted using regular expressions. You can also use third-party or custom-defined SerDesthat allow Hive to process data stored in other file formats. For example, some commonly-usedthird-party SerDes handle data stored in JSON format.

The HCatalog Connector directly fetches file segments from HDFS and uses Hive'sSerDes classes to extract data from them. The Connector includes all Hive's standard SerDesclasses, so it can process data stored in any file that Hive natively supports. If you want to querydata from aHive table that uses a custom SerDe, youmust first install the SerDe classes on theVertica cluster.

Determining Which SerDe You NeedIf you have access to the Hive command line, you can determine which SerDe a table uses byusing Hive's SHOW CREATE TABLE statement. This statement shows the HiveQL statementneeded to recreate the table. For example:

hive> SHOW CREATE TABLE msgjson;OKCREATE EXTERNAL TABLE msgjson(messageid int COMMENT 'from deserializer',userid string COMMENT 'from deserializer',time string COMMENT 'from deserializer',message string COMMENT 'from deserializer')ROW FORMAT SERDE'org.apache.hadoop.hive.contrib.serde2.JsonSerde'STORED AS INPUTFORMAT'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION'hdfs://hivehost.example.com:8020/user/exampleuser/msgjson'TBLPROPERTIES ('transient_lastDdlTime'='1384194521')Time taken: 0.167 seconds

In the example, ROW FORMAT SERDE indicates that a special SerDe is used to parse the data files.The next row shows that the class for the SerDe is namedorg.apache.hadoop.hive.contrib.serde2.JsonSerde.Youmust provide the HCatalogConnector with a copy of this SerDe class so that it can read the data from this table.



You can also find out which SerDe class you need by querying the table that uses the customSerDe. The query will fail with an error message that contains the class name of the SerDe neededto parse the data in the table. In the following example, the portion of the error message that namesthemissing SerDe class is in bold.

=> SELECT * FROM hcat.jsontable;ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User DefinedObject [VHCatSource], error code: 0com.vertica.sdk.UdfException: Error message is [ org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat notinitialized, setOutput has to be called. Cause : java.io.IOException:java.lang.RuntimeException:MetaException(message:org.apache.hadoop.hive.serde2.SerDeExceptionSerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If errormessage is not descriptive or local, may be we cannot read metadata from hivemetastore service thrift://hcathost:9083 or HDFS namenode (checkUDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information)at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory.plan(HCatalogSplitsNoOpSourceFactory.java:98)at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898). . .

Installing the SerDe on the Vertica ClusterYou usually have two options to getting the SerDe class file the HCatalog Connector needs:

l Find the installation files for the SerDe, then copy those over to your Vertica cluster. Forexample, there are several third-party JSON SerDes available from sites like Google Code andGitHub. Youmay find the one that matches the file installed on your Hive cluster. If so, thendownload the package and copy it to your Vertica cluster.

l Directly copy the JAR files from aHive server onto your Vertica cluster. The location for theSerDe JAR files depends on your Hive installation. On some systems, they may be located in/usr/lib/hive/lib.

Wherever you get the files, copy them into the /opt/vertica/packages/hcat/lib directory onevery node in your Vertica cluster.

Important: If you add a new host to your Vertica cluster, remember to copy every customSerDer JAR file to it.

Troubleshooting HCatalog Connector ProblemsYoumay encounter the following issues when using the HCatalog Connector.

Connection ErrorsWhen you use CREATE HCATALOG SCHEMA to create a new schema, the HCatalog Connectordoes not actually attempt to connect to theWebHCat or metastore servers. When you execute a



query using the schema or HCatalog-related system tables, the connector attempts to connect toand retrieve data from your Hadoop cluster.

The types of errors you get will depend on which parameters are incorrect. Suppose you haveincorrect parameters for themetastore database, but correct parameters forWebHCat. In thiscase, HCatalog-related system table queries succeed, while queries on the HCatalog schema fail.The following example demonstrates creating an HCatalog schemawith the correct defaultWebHCat information but the incorrect port number for themetastore database.

=> CREATE HCATALOG SCHEMA hcat2 WITH hostname='hcathost'-> HCATALOG_SCHEMA='default' HCATALOG_USER='hive' PORT=1234;CREATE SCHEMA=> SELECT * FROM HCATALOG_TABLE_LIST;-[ RECORD 1 ]------+---------------------table_schema_id | 45035996273864536table_schema | hcat2hcatalog_schema | defaulttable_name | testhcatalog_user_name | hive

=> SELECT * FROM hcat2.test;ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User DefinedObject [VHCatSource], error code: 0

com.vertica.sdk.UdfException: Error message is [ org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat notinitialized, setOutput has to be called. Cause : java.io.IOException:MetaException(message:Could not connect to meta store using any of the URIsprovided. Most recent failure: org.apache.thrift.transport.TTransportException:java.net.ConnectException:Connection refusedat org.apache.thrift.transport.TSocket.open(TSocket.java:185)at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:277). . .

To resolve these issues, youmust drop the schema and recreate it with the correct parameters. Ifyou still have issues, determine whether there are connectivity issues between your Vertica clusterand your Hadoop cluster. Such issues can include a firewall that prevents one or more Verticahosts from contacting theWebHCat, metastore, or HDFS hosts.

SerDe ErrorsIf you attempt to query a table Hive table that uses a non-standard SerDe, and you have notinstalled the SerDe JAR files on your Vertica cluster, you will receive an error similar to thefollowing example:

=> SELECT * FROM hcat.jsontable;ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User DefinedObject [VHCatSource], error code: 0com.vertica.sdk.UdfException: Error message is [ org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat notinitialized, setOutput has to be called. Cause : java.io.IOException:java.lang.RuntimeException:MetaException(message:org.apache.hadoop.hive.serde2.SerDeException



SerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If errormessage is not descriptive or local, may be we cannot read metadata from hivemetastore service thrift://hcathost:9083 or HDFS namenode (checkUDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information)at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory.plan(HCatalogSplitsNoOpSourceFactory.java:98)at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898). . .

In the error message, you can see that the root cause is amissing SerDe class (the section of theerror message shown in bold). To resolve this issue, you need to install the SerDe class on yourVertica cluster. See Using Non-Standard SerDes for more information.

This error may occur intermittently if just one or a few hosts in your cluster do not have the SerDeclass.

Differing Results Between Hive and Vertica QueriesYoumay find that running the same query on Hive and on Vertica through the HCatalog Connectorcan return different results. This discrepancy is often caused by the differences between the datatypes supported by Hive and Vertica. See Data Type Conversions from Hive to Vertica for moreinformation.

Preventing Excessive Query DelaysNetwork issues or high system loads on theWebHCat server can cause long delays while queryinga Hive database using the HCatalog Connector. While Vertica cannot resolve these issues, youcan set parameters that limit how long it waits before canceling a query on an HCatalog schema.These parameters can be set globally via Vertica configuration parameters. You can also set themfor specific HCatalog schemas in the CREATE HCATALOG SCHEMA statement. These specificsettings override the settings in the configuration parameters.

The HCatConnectionTimeout configuration parameter and the CREATE HCATALOG SCHEMAstatement's HCATALOG_CONNECTION_TIMEOUT parameter control how many seconds theHCatalog Connector waits for a connection to theWebHCat server. A value of 0 (the default settingfor the configuration parameter) means to wait indefinitely. If theWebHCat server does not respondby the time this timeout elapses, the HCatalog Connector breaks the connection and cancels thequery. If you find that some queries on an HCatalog schema pause execessively, try setting thisparameter to a timeout value, so the query does hang indefinitely.

The other two parameters work together to terminate HCatalog queries that are running too slowly.After it successfully connects to theWebHCat server and requests data, the HCatalog Connectorwaits for the number of seconds set in the HCatSlowTransferTime configuration parameter. Thisvalue can also be set using the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_TIME parameter. After this time has elapsed, if the data transfer rate from theWebHCat server is not at least the value set in the HCatSlowTransferLimit configuration parameter(or by the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_LIMITparameter) the HCatalog Connector terminates the connection and cancels the query. You can setthis parameter to cancel queries that run very slowly but do eventually complete.



Note: Query delays are usually caused by a slow connection rather than a problemestablishing the connection. Therefore, try adjusting the slow transfer rate settings first. If youfind the cause of the issue is connections that never complete, you can adjust the Linux TCPsocket timeouts to a suitable value instead of relying on the HCatConnectionTimeoutparameter.



Integrating Vertica with the MapRDistribution of Hadoop

MapR is a distribution of Apache Hadoop produced by MapR Technologies that extends thestandard Hadoop components with its own features. By adding Vertica to aMapR cluster, you canbenefit from the advantages of both Vertica and Hadoop. To learnmore about integrating VerticaandMapR, see Configuring HP Vertica Analytics Platform with MapR,which appears on theMapRwebsite.

Hadoop Integration GuideIntegrating Vertica with theMapR Distribution of Hadoop


http://doc.mapr.com/display/MapR/HPVertica

of 75Micro Focus Vertica Analytic Database (7.0.x)

Hadoop Integration GuideIntegrating Vertica with theMapR Distribution of Hadoop

We appreciate your feedback!If you have comments about this document, you can contact the documentation team by email. Ifan email client is configured on this system, click the link above and an email window opens withthe following information in the subject line:

Feedback on Hadoop Integration Guide (Vertica Analytic Database 7.0.x)

Just add your feedback to the email and click send.

If no email client is available, copy the information above to a new message in a webmail client,and send your feedback to [email protected].


mailto:[email protected]?subject=Feedback on Hadoop Integration Guide (Vertica Analytic Database 7.0.x)

HP Vertica Analytics Platform 7.0.x Hadoop Integration...

Documents

Transcript of HP Vertica Analytics Platform 7.0.x Hadoop Integration...