df: Dataframe on Spark

20

description

A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored. Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines. This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.

Transcript of df: Dataframe on Spark

Page 1: df: Dataframe on Spark
Page 2: df: Dataframe on Spark

df: Dataframe on Spark

Mohit Jaggi Code Ninja and Troublemaker

at

Page 3: df: Dataframe on Spark

Agenda

• About Ayasdi

• df

• Brief Demo (if time allows)

• Conclusion

Page 4: df: Dataframe on Spark

About Ayasdi

Page 5: df: Dataframe on Spark

CODE

Hypothesis?

Traditional Analytics

Page 6: df: Dataframe on Spark

Automated Insights

Page 7: df: Dataframe on Spark

Ayasdi Solution

ETL

UX

Ayasdi PlatformDistributed Computing Algorithmic Reach

Page 8: df: Dataframe on Spark

df

Page 9: df: Dataframe on Spark

Day in a data scientist’s life• Get data

• Need more/something else

• Data wrangling

• Rinse, repeat

• Load into analysis software like Ayasdi Core

• Actual data analysis, model-building etc

Page 10: df: Dataframe on Spark

Data Wrangling Tools

• grep, cut, wc -l, head, tail

• Python Pandas

• Most useful construct: pandas data frame ala Excel with CLI

Page 11: df: Dataframe on Spark

Challenges

• Applying data science techniques to data larger than single machine’s memory

• Easy to procure cluster of small machines than one big machine

• Processing takes too long

Page 12: df: Dataframe on Spark

Solution: Distribute

• Hadoop ecosystem: Spark is great

• Learning curve, what is this RDD thing? where is my familiar data frame?

• There is pyspark but to get the best out of Spark use Scala, another learning curve

Page 13: df: Dataframe on Spark

df: Gentle Incline“I want to put my projects on hold, and learn several new things simultaneously”

- No One Ever

• Attempts to provide an API on Spark that looks and feels like pandas data frame

e.g. in pandas

df[“a”]

in df

df(“a”)

• Also intuitive for R programmers

Page 14: df: Dataframe on Spark

Advantages• Quite transparently runs on Spark: Distributed processing

• Is in Scala: No layering overhead

• Is in Scala: Can directly call cutting edge Spark libraries like MLLib [pyspark wrappers usually a bit behind]

• Is an “internal DSL”: Advanced users can augment with arbitrary Scala code. [python wrapper still possible]

• Is an “internal DSL”: Fast without resorting to code-generation

• Fully open sourced, Apache license

Page 15: df: Dataframe on Spark

Real Life ExamplesSnippets of data scientist code that was “converted” from Pandas to df to make it scale to larger data

Add a column with totalmppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] —> mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”)

Remove $ and , from numbers representing money mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) —> mppu(“de-dollar”) = mppu(“dollar”).map { x: String => x.replace("$", "").replace(",","").toDouble }

Page 16: df: Dataframe on Spark

Demo

Page 17: df: Dataframe on Spark

Future

• pyspark wrapper

• more data sources like SQL, parquet, HDF5 etc

• charts and graphs

• contributors welcome!

Page 18: df: Dataframe on Spark

Conclusion

Page 19: df: Dataframe on Spark

Summary• pandas is awesome

• df scales to bigger data, looks and feels like pandas

• fully open source

https://github.com/AyasdiOpenSource/df

• Check out our website. We are hiring!

http://engineering.ayasdi.com/

http://www.ayasdi.com/careers/

Page 20: df: Dataframe on Spark

Acknowledgements• Max Song for introducing me to Pandas

• Jean-Ezra Young for insurance claims example

• Ayasdi for open-sourcing this work

• Hadoop and Spark communities for the awesome platform

• Pandas team for the awesome tool