HCatalog
(and friends)
Sushanth Sowmyan Committer, Apache HCatalog [email protected] @khorgath Hortonworks Inc. 2011 Page 1
Let's think about data for a bit...From Wikipedia: Data ( / de t / day-t , / dt / da-t , or / d t / dah-t )
Qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e., unprocessed data, refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols.
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 2
So what is needed to make Data useful?
Arguably, tools to convert data into information.
Arguably also, knowledge about the data, so that the tools can then make use of the data in a meaningful sense, to extract information from it.
Architecting the Future of Big Data Hortonworks Inc. 2011
So what are the characteristics of a Data Warehouse?
Data is present, organized, recorded, and catalogued. Tools exist that are able to operate on the data.
So what do tools need to be able to operate on data?Architecting the Future of Big Data Hortonworks Inc. 2011
Finding it
Photo credit : dkeats on flickr Hortonworks Inc. 2011
Finding it
Knowing where data is.
Evolve : Knowing which data is where naming data,
Evolve : Organization to support various data modeling concepts (table, partitions, columns, records)
Evolve : done semantics, existence semantics
Architecting the Future of Big Data Hortonworks Inc. 2011
Reading it
Photo credit : kylesteed on flickr
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 7
Reading it
Each tool having its own storage space, its own private world
Evolve : Abstracting away storage mechanism and having tools sit on top of file formats and mechanisms, so now, suddenly, tools have interoperability.
Evolve : Having a storage abstraction that adapts to existing storage mechanisms in an easy to develop manner
Architecting the Future of Big Data Hortonworks Inc. 2011
Who are the various actors in a data ecosystem?
Analyst uses sql (hive) and/or jdbc-based tools
Programmer cares about data transformation - uses Pig or M/R
Project owner - cares about amount of resources used, data portability, data connectors
Ops - needs to manage data storage, cluster management, need to control data expiry, replication, import and export.
Architecting the Future of Big Data Hortonworks Inc. 2011
(stealing slide from Alan's TriHUG talk)
Architecting the Future of Big Data Hortonworks Inc. 2011
Also :
People who help aforementioned people: Tool Writer - wants abstractions to deal with variances, wants to be able to store and retrieve relevant metadata and data, so they can focus on their user
Storage subsystem writer - wants standardization so that they can be used by other actors.
Architecting the Future of Big Data Hortonworks Inc. 2011
What do they all want?
Need it Working Correctness Speed, Efficiency
Interoperability, Convenience
Architecting the Future of Big Data Hortonworks Inc. 2011
Did somebody say Interoperability?
Hortonworks Inc. 2011
Making Your Structured Data Available to the MapReduce EngineMapReduce
Pig HCatalog
Hive
HDFS
HBase
MPP Store
Users can query data with Pig, Hive, or custom MapReduce jobs Standard HDFS formats available Q1 2012 HBase data by early Q2 201214
Architecting the Future of Big Data Hortonworks Inc. 2011
Hcatalog underlying architecture
HCatLoader HCatInputFormat
HCatStorer HCatOutputFormat Hive MetaStore Client Generated Thrift Client CLI Notification
Hive MetaStore
RDBMS
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 15
Problem: Need to Know Where Data IsPIG HIVE MapReduce
Storage
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 16
Solution: Register Through HCatalogPIG HIVE MapReduce
HCatalog
Storage
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 17
Problem: Data in variety of formats Data files maybe organized in different formats Data files may contain different formats in different partitions
Storage (HDFS, HBASE , etc)
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 18
Solution: HCat provides common abstractionHadoop Application Registered Data w/ Schema HCat normalizes data to application
HCatalog
Storage
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 19
Getting Involved
Incubator site : http://incubator.apache.org/hcatalog User list: [email protected] Dev list: [email protected]
Architecting the Future of Big Data Hortonworks Inc. 2011
TODO
HCATALOG-8 : HCatalog needs a logo HBase integration, trying to nail down a better table metaphor Hive integration interoperability between the notion of StorageDriver and StorageHandler, project dependency management 0.23 Work HCATALOG-182 : Improve the and friends bit.
Architecting the Future of Big Data Hortonworks Inc. 2011
Waitaminnit... what was that about friends ?
Architecting the Future of Big Data Hortonworks Inc. 2011
TempletonA Webservices API for Hadoop
Photo credit : PKMousie on flickr
Architecting the Future of Big Data Hortonworks Inc. 2011
Templeton: ISV Front-door for Hadoop
Insulation from interface changes release to release Opens the door to languages other than Java Thin clients through webservices vs forced fat-clients in gatewa
Still prototyping! But see a common need.
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 24
Templeton Specific SupportMove data directly into/out-of HDFS through WebHDFS Webservice calls to HCatalog
Register table relationships for data (e.g., createTable, createDatabase) Adjust tables (e.g., AlterTable) Look at a statistics (e.g., ShowTable)
Webservice calls to start work
MapReduce, Pig, Hive Poll for job status Notification URL when job completes (optional)
Stateless Server
Horizontally scale for load Configurable for HA Currently Requires ZooKeeper to track job status info
Architecting the Future of Big Data Hortonworks Inc. 2011
Page 25
ANY QUESTIONS ?
Architecting the Future of Big Data Hortonworks Inc. 2011
Top Related