Apache Hive Hook

47
Apache Hive Hook 2013. 8 Minwoo Kim [email protected]

description

Apache Hive Hook I couldn't find enough info about Hive hooks. So, I made this. I hope this presentation will be useful when you want to use hooks. This included some infomation about metastore event listeners. This was written based on release-0.11 tag.

Transcript of Apache Hive Hook

Page 2: Apache Hive Hook

Apache Hive Hook

• The reason why I made this is that Ryan asked me about hive hook, but I couldn’t find any info about hook in hive wiki.

• I hope this will be helpful to develop applications using Hive when you want to get extra info while executing a query on Hive.

• This document was written based on release-0.11 tag

• Source:

- https://github.com/apache/hive (mirror of apache hive)

Page 3: Apache Hive Hook

What is a hook?• As you know, this is about computer programming technique,

but ..

• Hooking

- Techniques for intercepting function calls or messages or events in an operating system, applications, and other software components.

• Hook

- Code that handles intercepted function calls, events or messages

Page 4: Apache Hive Hook

Hive provides some hooking points

• pre-execution

• post-execution

• execution-failure

• pre- and post-driver-run

• pre- and post-semantic-analyze

• metastore-initialize

Page 5: Apache Hive Hook

How to set up hooks in Hive

<property> <name>hive.exec.pre.hooks</name> <value></value> <description> Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface. </description></property>

hive-site.xml

<property> <name>hive.aux.jars.path</name> <value></value></property>

Setting hook property

Setting path of jars contains implementations of hook interfaces or abstract class

You can use hive.added.jars.path instead of hive.aux.jars.path

Page 6: Apache Hive Hook

Hive hook properties and interfaces

Property Interface or Abstract class

hive.exec.pre.hooksorg.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext

(PreExecute is deprecated)

hive.exec.post.hooksorg.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext

(PostExecute is deprecated)

hive.exec.failure.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext

hive.metastore.init.hooks org.apache.hadoop.hive.metastore.MetaStoreInitListener

hive.exec.driver.run.hooks org.apache.hadoop.hive.ql.HiveDriverRunHook

hive.semantic.analyzer.hook org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook

Page 7: Apache Hive Hook

When those hooks fire?

• You can submit a query on Hive through the following entry points

- CLIDriver main method (called by shell script)

- HCatCli main method (called by shell script)

- HiveServer (called by thrift client)

- HiveServer2 (called by thrift client or beeline)

Page 8: Apache Hive Hook

CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()

↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠

➔ processLocalCmd() ➔ Driver.run() ➠

CLIDriver

➔ is remote ?yes

no

Page 9: Apache Hive Hook

CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()

↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠

➔ processLocalCmd() ➔ Driver.run() ➠

CLIDriver

➔ is remote ?yes

no

HCatCli HCatCli.main() ➔ processLine() ➔ processCmd()

➔ HCatDriver.run() ⤇ Driver.run() ➠

Page 10: Apache Hive Hook

HiveServer.execute() ➔ Driver.run() ➠

HiveServer

CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd()

↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠

➔ processLocalCmd() ➔ Driver.run() ➠

CLIDriver

➔ is remote ?yes

no

HCatCli HCatCli.main() ➔ processLine() ➔ processCmd()

➔ HCatDriver.run() ⤇ Driver.run() ➠

Page 11: Apache Hive Hook

HiveServer2

ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()

CLIService.executeStatement()

↳ SessionManager.getSession()

↳ HiveSession.executeStatement()

↳ OperationManager.newExecuteStatementOperation()

↳ SQLOperation.run() ➔ Driver.run() ➠

Page 12: Apache Hive Hook

HiveServer2

ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement()

CLIService.executeStatement()

↳ SessionManager.getSession()

↳ HiveSession.executeStatement()

↳ OperationManager.newExecuteStatementOperation()

↳ SQLOperation.run() ➔ Driver.run() ➠

• OperationManager.newExecuteStatementOperation() is like a kind of factory

- AddResourceOperation, DeleteResourceOperation, DfsOperation, GetCatalogsOperation, GetColumnsOperation, GetFunctionsOperation, GetSchemasOperation, GetTablesOperation, GetTableTypesOperation, GetTypeInfoOperation, SetOperation, SQLOperation

Page 13: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

Page 14: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↝ HiveParser {• HiveParser.g

- SelectClauseParser.g- FromClauseParser.g- IdentifiersParser.g

• ParseDriver.parse()

- Command String ➡ root of AST tree

Page 15: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

• SemanticAnalyzerFactory.get(conf, ast)

- SemanticAnalyzer, ColumnStatsSemanticAnalyzer, ExplainSemanticAnalyzer, ExportSemanticAnalyzer, FunctionSemanticAnalyzer, ImportSemanticAnalyzer, LoadSemanticAnalyzer, MacroSemanticAnalyzer

Page 16: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()

{

Page 17: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

• FilterOperator

• SelectOperator

• ForwardOperator

• FileSinkOperator

• ScriptOperator

• PTFOperator

• ReduceSinkOperator

• ExtractOperator

• GroupByOperator

• JoinOperator

• MapJoinOperator

• SMBMapJoinOperator

• LimitOperator

• TableScanOperator

• UnionOperator

• UDTFOperator

• LateralViewJoinOperator

• LateralViewForwardOperator

• HashTableDummyOperator

• HashTableSinkOperator

• DummyStoreOperator

• DemuxOperator

• MuxOperator

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()

{

Page 18: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

• PredicateTransitivePropagate

• PredicatePushDown

• PartitionPruner

• PartitionConditionRemover

• ListBucketingPruner

• ListBucketingPruner

• ColumnPruner

• SkewJoinOptimizer

• RewriteGBUsingIndex

• GroupByOptimizer

• SamplePruner

• MapJoinProcessor

• BucketMapJoinOptimizer

• BucketMapJoinOptimizer

• SortedMergeBucketMapJoinOptimizer

• BucketingSortingReduceSinkOptimizer

• UnionProcessor

• JoinReorder

• ReduceSinkDeDuplication

• NonBlockingOpDeDupProc

• GlobalLimitOptimizer

• CorrelationOptimizer

• SimpleFetchOptimizer

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()

{

Page 19: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

• MapRedTask

• FetchTask

• ConditionalTask

• ExplainTask

• CopyTask

• DDLTask

• MoveTask

• FunctionTask

• StatsTask

• ColumnStatsTask

• DependencyCollectionTask

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize() • MapReduceCompiler.compile()

{

Page 20: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

↳ Driver.execute()

➔ loop (List<Task>)

⟳ Driver.launchTask()

➔ TaskRunner.runSequential() ➔ Task.executeTask()

➔ Task.execute()

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize()• MapReduceCompiler.compile()

{

Page 21: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

↳ Driver.execute()

➔ loop (List<Task>)

⟳ Driver.launchTask()

➔ TaskRunner.runSequential() ➔ Task.executeTask()

➔ Task.execute()

➔ analyzeInternal()• processPositionAlias()• doPhase1()• getMetaData()• genPlan()• Optimizer.optimize()• MapReduceCompiler.compile()

{

• ex) MapRedTask.execute() ⤇ ExecDriver.execute() ➔ JobClient.submitJob()

ExecMapper, ExecReducer

Page 22: Apache Hive Hook

➠ Driver.run()

➔ Driver.runInternal()

↳ Driver.compile()

↳ ParseDriver.parse()

↳ SemanticAnalyzer.analyze()

↳ Driver.execute()

➔ loop (List<Task>)

⟳ Driver.launchTask()

➔ TaskRunner.runSequential() ➔ Task.executeTask()

➔ Task.execute()

PRE- and POST-DRIVER-RUN

PRE- and POST-SEMANTIC-ANALYZE

PRE-, POST-EXEC and ON-FAILURE

Page 23: Apache Hive Hook

HiveServer2.main() ➔ HiveServer2.start()

➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠

Page 24: Apache Hive Hook

HiveServer2.main() ➔ HiveServer2.start()

➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠

➔ HiveSession.getMetaStoreClient()

➔ new HiveMetaStoreClient() ➠

CLIService.executeStatement()

⇒GetColumnsOperation.run()

GetSchemasOperation.run()

GetTablesOperation.run()

Page 25: Apache Hive Hook

HiveServer2.main() ➔ HiveServer2.start()

➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠

➔ HiveSession.getMetaStoreClient()

➔ new HiveMetaStoreClient() ➠

CLIService.executeStatement()

SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object

Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠

GetColumnsOperation.run()

GetSchemasOperation.run()

GetTablesOperation.run()

Page 26: Apache Hive Hook

HiveServer2.main() ➔ HiveServer2.start()

➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠

➔ HiveSession.getMetaStoreClient()

➔ new HiveMetaStoreClient() ➠

➠ new HiveMetaStoreClient()

➔ HiveMetaStore.newHMSHandler()

➔ RetryingHMSHandler.getProxy()

➔ new RetryingHMSHandler()

➔ new HMSHandler() ➔ HMSHandler.init()

➔ HiveMetaStore.init()

CLIService.executeStatement()

MATASTORE-INIT

SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object

Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠

GetColumnsOperation.run()

GetSchemasOperation.run()

GetTablesOperation.run()

Page 27: Apache Hive Hook

How Hive executes hooks

List<HiveDriverRunHook> driverRunHooks;try { driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS, HiveDriverRunHook.class); for (HiveDriverRunHook driverRunHook : driverRunHooks) { driverRunHook.preDriverRun(hookContext); }} catch (Exception e) {

• Hive executes multiple hooks on each hook points.

ex. Driver.runInternal()

Page 28: Apache Hive Hook

1. MetaStoreInitListenerpublic abstract class MetaStoreInitListener implements Configurable {

private Configuration conf;

public MetaStoreInitListener(Configuration config){ this.conf = config; }

public abstract void onInit(MetaStoreInitContext context) throws MetaException;

@Override public Configuration getConf() { return this.conf; }

@Override public void setConf(Configuration config) { this.conf = config; }}

Page 29: Apache Hive Hook

1. MetaStoreInitListenerpublic abstract class MetaStoreInitListener implements Configurable {

private Configuration conf;

public MetaStoreInitListener(Configuration config){ this.conf = config; }

public abstract void onInit(MetaStoreInitContext context) throws MetaException;

@Override public Configuration getConf() { return this.conf; }

@Override public void setConf(Configuration config) { this.conf = config; }}

Page 30: Apache Hive Hook

What MetaStoreInitContext got

• has Nothing!

- This hook just alarms you when metastore initialize.(but you, of course, can get HiveConf by calling getConf())

public class MetaStoreInitContext { }

Page 31: Apache Hive Hook

2. HiveDriverRunHook

• preDriverRun

- Invoked before Hive begins any processing of a command in the Driver, before compilation

• postDriverRun

- Invoked after Hive performs any processing of a command, just before a response is returned to the entity calling the Driver.run()

public interface HiveDriverRunHook extends Hook { public void preDriverRun( HiveDriverRunHookContext hookContext) throws Exception; public void postDriverRun( HiveDriverRunHookContext hookContext) throws Exception;}

Page 32: Apache Hive Hook

What HiveDriverRunHookContext got

• You can get command string from this hook context.

- This is the only thing that HiveDriverRunHookContext has.

public interface HiveDriverRunHookContext extends Configurable{ public String getCommand(); public void setCommand(String command);}

Page 33: Apache Hive Hook

3. AbstractSemanticAnalyzerHook

• You can get

- HiveSemanticAnalyzerHookContext and ASTNode (Root node of abstract syntax tree) before analyze.

- HiveSemanticAnalyzerHookContext and List<Task> after analyze.

public abstract class AbstractSemanticAnalyzerHook implementsHiveSemanticAnalyzerHook { public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,ASTNode ast) throws SemanticException { return ast; } public void postAnalyze(HiveSemanticAnalyzerHookContext context, List<Task<? extends Serializable>> rootTasks) throws SemanticException { }}

Page 34: Apache Hive Hook

What HiveSemanticAnalyzerHookContext got

• Hive Object

- contains information about a set of data in HDFS organized for query processing. (from comment)

• ReadEntity, WriteEntity

• update method will be invoked after the semantic analyzer completes.

public interface HiveSemanticAnalyzerHookContext extends Configurable{ public Hive getHive() throws HiveException; public void update(BaseSemanticAnalyzer sem); public Set<ReadEntity> getInputs(); public Set<WriteEntity> getOutputs();}

Page 35: Apache Hive Hook

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

Page 36: Apache Hive Hook

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

Page 37: Apache Hive Hook

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

Page 38: Apache Hive Hook

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

Page 39: Apache Hive Hook

How Hive executes analyzer hooks

List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();hookCtx.setConf(conf);for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree);}sem.analyze(tree, ctx);hookCtx.update(sem);for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks());}

Page 40: Apache Hive Hook

4. ExecuteWithHookContext• Can be used in the followings

- hive.exec.pre.hooks

- hive.exec.post.hooks

- hive.exec.failure.hooks

public interface ExecuteWithHookContext extends Hook { /**   *   * @param hookContext   * The hook context passed to each hooks.   * @throws Exception   */ void run(HookContext hookContext) throws Exception; }

Page 41: Apache Hive Hook

What HookContext got• HookType

- PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK

• QueryPlan

• HiveConf

• LineageInfo

• UserGroupInformation

• OperationName

• List<TaskRunner> completeTaskList

• Set<ReadEntity> inputs

• Set<WriteEntity> outputs

• Map<String, ContentSummary> inputPathToContentSummary

Page 42: Apache Hive Hook

How Hive fires hooks without executing query physically

• This has the effect of causing the pre/post execute hooks to fire.

ALTER TABLE table_name TOUCH [PARTITION partitionSpec];

Page 43: Apache Hive Hook

MetaStore Event Listeners

Property Abstract Class

hive.metastore.pre.event.listeners MetaStorePreEventListener

hive.metastore.end.function.listeners MetaStoreEndFunctionListener

hive.metastore.event.listeners MetaStoreEventListener

package : org.apache.hadoop.hive.metastore

• I think those listeners look like hooks.

• I couldn’t find any particular differences between listeners and hooks while just taking a look. The only thing I found is that listeners can’t affect query processing. It can only read.

• Anyway, it looks useful to let you know when a metastore do something.

Page 44: Apache Hive Hook

MetaStoreEventListener• The followings will be performed when a particular event occurs on a

metastore.

- onCreateTable

- onDropTable

- onAlterTable

- onDropPartition

- onAlterPartition

- onCreateDatabase

- onDropDatabase

- onLoadPartitionDone

If you need more details, see org.apache.hadoop.hive.metastore.MetaStoreEventListener

Page 45: Apache Hive Hook

Be careful!

• Hooks

- can be a critical failure point!(you should better catch runtime exceptions)

- are preformed synchronously.

- can affect query processing time.

Page 46: Apache Hive Hook

Let's try it out

• Demo

- Don’t be surprised if it doesn’t work.

- That’s the way the demo is...

Page 47: Apache Hive Hook

Thanks!

• Questions?

• Resources

- https://cwiki.apache.org/confluence/display/Hive/

- https://github.com/apache/hive