Sparkling Water 5 28-14

download Sparkling Water 5 28-14

of 46

Embed Size (px)

description

 

Transcript of Sparkling Water 5 28-14

  • Meetup 5/28/2014 Sparkling Water Michal Malohlava! @mmalohlava! @hexadata
  • Who am I? Background PhD in CS from Charles University in Prague, 2012 1 year PostDoc at Purdue University experimenting with algos for large computation 1 year at 0xdata helping to develop H2O engine for big data computation ! Experience with domain-specific languages, distributed system, software engineering, and big data.
  • Overview 1.Towards H2O and Spark integration 2.Details and demo 3.Next steps
  • Vision Towards Spark and H2O integration
  • User-friendly API ! Large and active community ! Platform components - SQL ! Multitenancy Memory efficient ! Performance of computation ! Machine learning algorithms ! Parser, R-interface
  • Combine benefits of both tools and makes H2O a killer application for Spark
  • Steps towards ! interoperability 1.Data sharing between Spark to H2O 2.Optimize & improve ! 3.Low-level integration
  • Steps towards ! interoperability 1.Data sharing between Spark to H2O 2.Optimize & improve ! 3.Low-level integration
  • Data sharing scenario
  • Data sharing scenario
  • Data sharing scenario RDD
  • Data sharing scenario RDD SQL query
  • Data sharing scenario RDD SQL query
  • Data sharing scenario FrameRDD SQL query
  • Data sharing scenario FrameRDD SQL query Algo
  • Data sharing scenario FrameRDD SQL query Algo
  • Data sharing strategies Possible solutions Direct Distributed Socket-based File-based Tachyon-based Spark to H2O
  • Data sharing strategies Possible solutions Direct Distributed Socket-based File-based Tachyon-based Spark to H2O
  • Data sharing strategies Possible solutions Direct Distributed Socket-based File-based Tachyon-based Spark to H2O
  • Data sharing strategies Possible solutions Direct Distributed Socket-based File-based Tachyon-based Spark to H2O
  • Data sharing strategies Possible solutions Direct Distributed Socket-based File-based Tachyon-based Spark to H2O
  • Data sharing strategies Possible solutions Direct Distributed Socket-based File-based Tachyon-based Spark to H2O
  • Data sharing! via Tachyon
  • Data sharing! via Tachyon Tachyon
  • Data sharing! via Tachyon H2O node with Spark driver Tachyon
  • Data sharing! via Tachyon Tachyon Invoke Spark driver
  • Data sharing! via Tachyon Tachyon Load data
  • Data sharing! via Tachyon Query Tachyon
  • Data sharing! via Tachyon Tachyon Persist data to Tachyon
  • Data sharing! via Tachyon Tachyon Load data into H2O frame
  • Data sharing! via Tachyon Tachyon Invoke GBM on data
  • Spark 1.0-rc11 ! SQL component ! Implemented proper parser/ serializer to satisfy H2O parser
  • Latest H2O version - 2.5-SNAPSHOT ! With Tachyon support included ! Embedded Spark driver
  • Key requirements Transparent approach ! Work with many columns ! Preserve NAs ! Preserve headers
  • Key requirements Transparent approach ! Work with many columns ! Preserve NAs ! Preserve headers
  • Key requirements Transparent approach ! Work with many columns ! Preserve NAs ! Preserve headers
  • Key requirements Transparent approach ! Work with many columns ! Preserve NAs ! Preserve headers
  • Solved challenges Large number of columns in SQL schema >22 columns (case class restriction) Solved via Product interface class Airlines( year :Option[Int], // 0! month :Option[Int], // 1! dayOfMonth :Option[Int], // 2! dayOfWeek :Option[Int], // 3! crsDepTime :Option[Int], // 5! crsArrTime :Option[Int], // 7! uniqueCarrier :Option[String], // 8! flightNum :Option[Int], // 9! tailNum :Option[Int], // 10! crsElapsedTime:Option[Int], // 12! origin :Option[String], // 16! dest :Option[String], // 17! distance :Option[Int], // 18! isArrDelayed :Option[Boolean],// 29! isDepDelayed :Option[Boolean] // 30! ) extends Product { }
  • Solved challenges Handling NAs during load Store them in SQL RDD Solved by https://github.com/apache/spark/pull/ 658 Use Option[T] or non-primitive Java type ! Handling NAs during save A simple sql.Row serializer handling NA values
  • Time for Demo
  • Step-by-step Start Spark cloud - 1 worker Start Tachyon storage Start H2O slave node Start H2O master node with Scala master program override def run(conf: DemoConf): Unit = {! // Dataset! val dataset = "data/allyears2k_headers.csv"! // Row parser! val rowParser = AirlinesParser! // Table name for SQL! val tableName = "airlines_table"! // Select all flights with destination == SFO! val query = """SELECT * FROM airlines_table WHERE dest="SFO" """! ! // Connect to shark cluster and make a query over prostate, transfer data into H2O! val frame:Frame = executeSpark[Airlines](dataset, rowParser, ! ! ! ! conf.extractor, tableName, query, local=conf.local)! ! // Now make a blocking call of GBM directly via Java API! gbm(frame, frame.vec("isDepDelayed"), 100, true)! }
  • Step-by-step Start Spark cloud - 1 worker Start Tachyon storage Start H2O slave node Start H2O master node with Scala master program override def run(conf: DemoConf): Unit = {! // Dataset! val dataset = "data/allyears2k_headers.csv"! // Row parser! val rowParser = AirlinesParser! // Table name for SQL! val tableName = "airlines_table"! // Select all flights with destination == SFO! val query = """SELECT * FROM airlines_table WHERE dest="SFO" """! ! // Connect to shark cluster and make a query over prostate, transfer data into H2O! val frame:Frame = executeSpark[Airlines](dataset, rowParser, ! ! ! ! conf.extractor, tableName, query, local=conf.local)! ! // Now make a blocking call of GBM directly via Java API! gbm(frame, frame.vec("isDepDelayed"), 100, true)! } Demo code
  • Next steps Optimize data transfers Have notion of H2O RDD inside Spark H2O Backend for MLlib Based on H2O RDD Use H2O algos
  • Open challenges See http://jira.0xdata.com and Sparkling component PUB-730 Transfer results from H2O frame into RDD PUB-732 Parquet support for H2O PUB-733 MLlib backend PUB-734 H2O-based RDD
  • Time for questions Thank you!
  • Learn more about H2O at 0xdata.com or Thank you! Follow us at @hexadata neo> for r in h2o h2o-sparkling; do ! git clone git@github.com:0xdata/$r.git! done