Apache Hive Hook

Post on 11-May-2015

4.223 views 12 download

Tags:

description

Apache Hive Hook I couldn't find enough info about Hive hooks. So, I made this. I hope this presentation will be useful when you want to use hooks. This included some infomation about metastore event listeners. This was written based on release-0.11 tag.

Transcript of Apache Hive Hook

Apache Hive Hook

2013. 8Minwoo Kim

michael.kim@nexr.com

Apache Hive Hook

• The reason why I made this is that Ryan asked me about hive hook, but I couldn’t find any info about hook in hive wiki.

• I hope this will be helpful to develop applications using Hive when you want to get extra info while executing a query on Hive.

• This document was written based on release-0.11 tag

• Source:

- https://github.com/apache/hive (mirror of apache hive)

What is a hook?• As you know, this is about computer programming technique,

but ..

• Hooking

- Techniques for intercepting function calls or messages or events in an operating system, applications, and other software components.

• Hook

- Code that handles intercepted function calls, events or messages

Hive provides some hooking points

• pre-execution

• post-execution

• execution-failure

• pre- and post-driver-run

• pre- and post-semantic-analyze

• metastore-initialize

How to set up hooks in Hive

<property> <name>hive.exec.pre.hooks</name> <value></value> <description> Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface. </description></property>

hive-site.xml

<property> <name>hive.aux.jars.path</name> <value></value></property>

Setting hook property

Setting path of jars contains implementations of hook interfaces or abstract class

You can use hive.added.jars.path instead of hive.aux.jars.path

Hive hook properties and interfaces

Property Interface or Abstract class

hive.exec.pre.hooksorg.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext

(PreExecute is deprecated)

hive.exec.post.hooksorg.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext

(PostExecute is deprecated)

hive.exec.failure.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext

hive.metastore.init.hooks org.apache.hadoop.hive.metastore.MetaStoreInitListener

hive.exec.driver.run.hooks org.apache.hadoop.hive.ql.HiveDriverRunHook

hive.semantic.analyzer.hook org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook

When those hooks fire?

• You can submit a query on Hive through the following entry points

- CLIDriver main method (called by shell script)

- HCatCli main method (called by shell script)

- HiveServer (called by thrift client)

- HiveServer2 (called by thrift client or beeline)

CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()

↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠

➔ processLocalCmd() ➔ Driver.run() ➠

CLIDriver

➔ is remote ?yes

no

CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()

↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠

➔ processLocalCmd() ➔ Driver.run() ➠

CLIDriver

➔ is remote ?yes

no

HCatCli HCatCli.main() ➔ processLine() ➔ processCmd()

➔ HCatDriver.run() ⤇ Driver.run() ➠

HiveServer.execute() ➔ Driver.run() ➠

HiveServer

CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()

↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠

➔ processLocalCmd() ➔ Driver.run() ➠

CLIDriver

➔ is remote ?yes

no

HCatCli HCatCli.main() ➔ processLine() ➔ processCmd()

➔ HCatDriver.run() ⤇ Driver.run() ➠

HiveServer2

ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()

CLIService.executeStatement()

↳ SessionManager.getSession()

↳ HiveSession.executeStatement()

↳ OperationManager.newExecuteStatementOperation()

↳ SQLOperation.run() ➔ Driver.run() ➠

HiveServer2

ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()

CLIService.executeStatement()

↳ SessionManager.getSession()

↳ HiveSession.executeStatement()

↳ OperationManager.newExecuteStatementOperation()

↳ SQLOperation.run() ➔ Driver.run() ➠

• OperationManager.newExecuteStatementOperation() is like a kind of factory

- AddResourceOperation, DeleteResourceOperation, DfsOperation, GetCatalogsOperation, GetColumnsOperation, GetFunctionsOperation, GetSchemasOperation, GetTablesOperation, GetTableTypesOperation, GetTypeInfoOperation, SetOperation, SQLOperation

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↝ HiveParser {• HiveParser.g

- SelectClauseParser.g- FromClauseParser.g- IdentifiersParser.g

• ParseDriver.parse()

- Command String ➡ root of AST tree

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

• SemanticAnalyzerFactory.get(conf, ast)

- SemanticAnalyzer, ColumnStatsSemanticAnalyzer, ExplainSemanticAnalyzer, ExportSemanticAnalyzer, FunctionSemanticAnalyzer, ImportSemanticAnalyzer, LoadSemanticAnalyzer, MacroSemanticAnalyzer

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()

{

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

• FilterOperator

• SelectOperator

• ForwardOperator

• FileSinkOperator

• ScriptOperator

• PTFOperator

• ReduceSinkOperator

• ExtractOperator

• GroupByOperator

• JoinOperator

• MapJoinOperator

• SMBMapJoinOperator

• LimitOperator

• TableScanOperator

• UnionOperator

• UDTFOperator

• LateralViewJoinOperator

• LateralViewForwardOperator

• HashTableDummyOperator

• HashTableSinkOperator

• DummyStoreOperator

• DemuxOperator

• MuxOperator

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()

{

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

• PredicateTransitivePropagate

• PredicatePushDown

• PartitionPruner

• PartitionConditionRemover

• ListBucketingPruner

• ListBucketingPruner

• ColumnPruner

• SkewJoinOptimizer

• RewriteGBUsingIndex

• GroupByOptimizer

• SamplePruner

• MapJoinProcessor

• BucketMapJoinOptimizer

• BucketMapJoinOptimizer

• SortedMergeBucketMapJoinOptimizer

• BucketingSortingReduceSinkOptimizer

• UnionProcessor

• JoinReorder

• ReduceSinkDeDuplication

• NonBlockingOpDeDupProc

• GlobalLimitOptimizer

• CorrelationOptimizer

• SimpleFetchOptimizer

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()

{

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

• MapRedTask

• FetchTask

• ConditionalTask

• ExplainTask

• CopyTask

• DDLTask

• MoveTask

• FunctionTask

• StatsTask

• ColumnStatsTask

• DependencyCollectionTask

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()

{

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

↳ Driver.execute()

➔ loop (List<Task>)

⟳ Driver.launchTask()

➔ TaskRunner.runSequential() ➔ Task.executeTask()

➔ Task.execute()

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize()• MapReduceCompiler.compile()

{

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

↳ Driver.execute()

➔ loop (List<Task>)

⟳ Driver.launchTask()

➔ TaskRunner.runSequential() ➔ Task.executeTask()

➔ Task.execute()

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize()• MapReduceCompiler.compile()

{

• ex) MapRedTask.execute() ⤇ ExecDriver.execute() ➔ JobClient.submitJob()

ExecMapper, ExecReducer

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

↳ Driver.execute()

➔ loop (List<Task>)

⟳ Driver.launchTask()

➔ TaskRunner.runSequential() ➔ Task.executeTask()

➔ Task.execute()

PRE- and POST-DRIVER-RUN

PRE- and POST-SEMANTIC-ANALYZE

PRE-, POST-EXEC and ON-FAILURE

HiveServer2.main() ➔ HiveServer2.start()

➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠

HiveServer2.main() ➔ HiveServer2.start()

➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠

➔ HiveSession.getMetaStoreClient()

➔ new HiveMetaStoreClient() ➠

CLIService.executeStatement()

⇒GetColumnsOperation.run()

GetSchemasOperation.run()

GetTablesOperation.run()

HiveServer2.main() ➔ HiveServer2.start()

➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠

➔ HiveSession.getMetaStoreClient()

➔ new HiveMetaStoreClient() ➠

CLIService.executeStatement()

SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object

Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠

GetColumnsOperation.run()

GetSchemasOperation.run()

GetTablesOperation.run()

HiveServer2.main() ➔ HiveServer2.start()

➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠

➔ HiveSession.getMetaStoreClient()

➔ new HiveMetaStoreClient() ➠

➠ new HiveMetaStoreClient()

➔ HiveMetaStore.newHMSHandler()

➔ RetryingHMSHandler.getProxy()

➔ new RetryingHMSHandler()

➔ new HMSHandler() ➔ HMSHandler.init()

➔ HiveMetaStore.init()

CLIService.executeStatement()

MATASTORE-INIT

SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object

Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠

GetColumnsOperation.run()

GetSchemasOperation.run()

GetTablesOperation.run()

How Hive executes hooks

List<HiveDriverRunHook> driverRunHooks;try { driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS, HiveDriverRunHook.class); for (HiveDriverRunHook driverRunHook : driverRunHooks) { driverRunHook.preDriverRun(hookContext); }} catch (Exception e) {

• Hive executes multiple hooks on each hook points.

ex. Driver.runInternal()

1. MetaStoreInitListenerpublic abstract class MetaStoreInitListener implements Configurable {

private Configuration conf;

public MetaStoreInitListener(Configuration config){ this.conf = config; }

public abstract void onInit(MetaStoreInitContext context) throws MetaException;

@Override public Configuration getConf() { return this.conf; }

@Override public void setConf(Configuration config) { this.conf = config; }}

1. MetaStoreInitListenerpublic abstract class MetaStoreInitListener implements Configurable {

private Configuration conf;

public MetaStoreInitListener(Configuration config){ this.conf = config; }

public abstract void onInit(MetaStoreInitContext context) throws MetaException;

@Override public Configuration getConf() { return this.conf; }

@Override public void setConf(Configuration config) { this.conf = config; }}

What MetaStoreInitContext got

• has Nothing!

- This hook just alarms you when metastore initialize.(but you, of course, can get HiveConf by calling getConf())

public class MetaStoreInitContext { }

2. HiveDriverRunHook

• preDriverRun

- Invoked before Hive begins any processing of a command in the Driver, before compilation

• postDriverRun

- Invoked after Hive performs any processing of a command, just before a response is returned to the entity calling the Driver.run()

public interface HiveDriverRunHook extends Hook { public void preDriverRun( HiveDriverRunHookContext hookContext) throws Exception; public void postDriverRun( HiveDriverRunHookContext hookContext) throws Exception;}

What HiveDriverRunHookContext got

• You can get command string from this hook context.

- This is the only thing that HiveDriverRunHookContext has.

public interface HiveDriverRunHookContext extends Configurable{ public String getCommand(); public void setCommand(String command);}

3. AbstractSemanticAnalyzerHook

• You can get

- HiveSemanticAnalyzerHookContext and ASTNode (Root node of abstract syntax tree) before analyze.

- HiveSemanticAnalyzerHookContext and List<Task> after analyze.

public abstract class AbstractSemanticAnalyzerHook implementsHiveSemanticAnalyzerHook { public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,ASTNode ast) throws SemanticException { return ast; } public void postAnalyze(HiveSemanticAnalyzerHookContext context, List<Task<? extends Serializable>> rootTasks) throws SemanticException { }}

What HiveSemanticAnalyzerHookContext got

• Hive Object

- contains information about a set of data in HDFS organized for query processing. (from comment)

• ReadEntity, WriteEntity

• update method will be invoked after the semantic analyzer completes.

public interface HiveSemanticAnalyzerHookContext extends Configurable{ public Hive getHive() throws HiveException; public void update(BaseSemanticAnalyzer sem); public Set<ReadEntity> getInputs(); public Set<WriteEntity> getOutputs();}

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

4. ExecuteWithHookContext• Can be used in the followings

- hive.exec.pre.hooks

- hive.exec.post.hooks

- hive.exec.failure.hooks

public interface ExecuteWithHookContext extends Hook { /**   *   * @param hookContext   * The hook context passed to each hooks.   * @throws Exception   */ void run(HookContext hookContext) throws Exception; }

What HookContext got• HookType

- PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK

• QueryPlan

• HiveConf

• LineageInfo

• UserGroupInformation

• OperationName

• List<TaskRunner> completeTaskList

• Set<ReadEntity> inputs

• Set<WriteEntity> outputs

• Map<String, ContentSummary> inputPathToContentSummary

How Hive fires hooks without executing query physically

• This has the effect of causing the pre/post execute hooks to fire.

ALTER TABLE table_name TOUCH [PARTITION partitionSpec];

MetaStore Event Listeners

Property Abstract Class

hive.metastore.pre.event.listeners MetaStorePreEventListener

hive.metastore.end.function.listeners MetaStoreEndFunctionListener

hive.metastore.event.listeners MetaStoreEventListener

package : org.apache.hadoop.hive.metastore

• I think those listeners look like hooks.

• I couldn’t find any particular differences between listeners and hooks while just taking a look. The only thing I found is that listeners can’t affect query processing. It can only read.

• Anyway, it looks useful to let you know when a metastore do something.

MetaStoreEventListener• The followings will be performed when a particular event occurs on a

metastore.

- onCreateTable

- onDropTable

- onAlterTable

- onDropPartition

- onAlterPartition

- onCreateDatabase

- onDropDatabase

- onLoadPartitionDone

If you need more details, see org.apache.hadoop.hive.metastore.MetaStoreEventListener

Be careful!

• Hooks

- can be a critical failure point!(you should better catch runtime exceptions)

- are preformed synchronously.

- can affect query processing time.

Let's try it out

• Demo

- Don’t be surprised if it doesn’t work.

- That’s the way the demo is...

Thanks!

• Questions?

• Resources

- https://cwiki.apache.org/confluence/display/Hive/

- https://github.com/apache/hive