df: Dataframe on Spark
-
Upload
ayasdi-engineering-team -
Category
Engineering
-
view
1.487 -
download
6
description
Transcript of df: Dataframe on Spark
df: Dataframe on Spark
Mohit Jaggi Code Ninja and Troublemaker
at
Agenda
• About Ayasdi
• df
• Brief Demo (if time allows)
• Conclusion
About Ayasdi
CODE
Hypothesis?
Traditional Analytics
Automated Insights
Ayasdi Solution
ETL
UX
Ayasdi PlatformDistributed Computing Algorithmic Reach
df
Day in a data scientist’s life• Get data
• Need more/something else
• Data wrangling
• Rinse, repeat
• Load into analysis software like Ayasdi Core
• Actual data analysis, model-building etc
Data Wrangling Tools
• grep, cut, wc -l, head, tail
• Python Pandas
• Most useful construct: pandas data frame ala Excel with CLI
Challenges
• Applying data science techniques to data larger than single machine’s memory
• Easy to procure cluster of small machines than one big machine
• Processing takes too long
Solution: Distribute
• Hadoop ecosystem: Spark is great
• Learning curve, what is this RDD thing? where is my familiar data frame?
• There is pyspark but to get the best out of Spark use Scala, another learning curve
df: Gentle Incline“I want to put my projects on hold, and learn several new things simultaneously”
- No One Ever
• Attempts to provide an API on Spark that looks and feels like pandas data frame
e.g. in pandas
df[“a”]
in df
df(“a”)
• Also intuitive for R programmers
Advantages• Quite transparently runs on Spark: Distributed processing
• Is in Scala: No layering overhead
• Is in Scala: Can directly call cutting edge Spark libraries like MLLib [pyspark wrappers usually a bit behind]
• Is an “internal DSL”: Advanced users can augment with arbitrary Scala code. [python wrapper still possible]
• Is an “internal DSL”: Fast without resorting to code-generation
• Fully open sourced, Apache license
Real Life ExamplesSnippets of data scientist code that was “converted” from Pandas to df to make it scale to larger data
Add a column with totalmppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] —> mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”)
Remove $ and , from numbers representing money mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) —> mppu(“de-dollar”) = mppu(“dollar”).map { x: String => x.replace("$", "").replace(",","").toDouble }
Demo
Future
• pyspark wrapper
• more data sources like SQL, parquet, HDF5 etc
• charts and graphs
• contributors welcome!
Conclusion
Summary• pandas is awesome
• df scales to bigger data, looks and feels like pandas
• fully open source
https://github.com/AyasdiOpenSource/df
• Check out our website. We are hiring!
http://engineering.ayasdi.com/
http://www.ayasdi.com/careers/
Acknowledgements• Max Song for introducing me to Pandas
• Jean-Ezra Young for insurance claims example
• Ayasdi for open-sourcing this work
• Hadoop and Spark communities for the awesome platform
• Pandas team for the awesome tool