df: Dataframe on Spark

df: Dataframe on Spark

Mohit Jaggi Code Ninja and Troublemaker

at

Agenda

• About Ayasdi

• df

• Brief Demo (if time allows)

• Conclusion

About Ayasdi

CODE

Hypothesis?

Traditional Analytics

Automated Insights

Ayasdi Solution

ETL

UX

Ayasdi PlatformDistributed Computing Algorithmic Reach

Day in a data scientist’s life• Get data

• Need more/something else

• Data wrangling

• Rinse, repeat

• Load into analysis software like Ayasdi Core

• Actual data analysis, model-building etc

Data Wrangling Tools

• grep, cut, wc -l, head, tail

• Python Pandas

• Most useful construct: pandas data frame ala Excel with CLI

Challenges

• Applying data science techniques to data larger than single machine’s memory

• Easy to procure cluster of small machines than one big machine

• Processing takes too long

Solution: Distribute

• Hadoop ecosystem: Spark is great

• Learning curve, what is this RDD thing? where is my familiar data frame?

• There is pyspark but to get the best out of Spark use Scala, another learning curve

df: Gentle Incline“I want to put my projects on hold, and learn several new things simultaneously”

- No One Ever

• Attempts to provide an API on Spark that looks and feels like pandas data frame

e.g. in pandas

df[“a”]

in df

df(“a”)

• Also intuitive for R programmers

Advantages• Quite transparently runs on Spark: Distributed processing

• Is in Scala: No layering overhead

• Is in Scala: Can directly call cutting edge Spark libraries like MLLib [pyspark wrappers usually a bit behind]

• Is an “internal DSL”: Advanced users can augment with arbitrary Scala code. [python wrapper still possible]

• Is an “internal DSL”: Fast without resorting to code-generation

• Fully open sourced, Apache license

Real Life ExamplesSnippets of data scientist code that was “converted” from Pandas to df to make it scale to larger data

Add a column with totalmppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] —> mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”)

Remove $ and , from numbers representing money mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) —> mppu(“de-dollar”) = mppu(“dollar”).map { x: String => x.replace("$", "").replace(",","").toDouble }

Future

• pyspark wrapper

• more data sources like SQL, parquet, HDF5 etc

• charts and graphs

• contributors welcome!

Conclusion

Summary• pandas is awesome

• df scales to bigger data, looks and feels like pandas

• fully open source

https://github.com/AyasdiOpenSource/df

• Check out our website. We are hiring!

http://engineering.ayasdi.com/

http://www.ayasdi.com/careers/

https://github.com/AyasdiOpenSource/df

http://engineering.ayasdi.com/

http://www.ayasdi.com/careers/

Acknowledgements• Max Song for introducing me to Pandas

• Jean-Ezra Young for insurance claims example

• Ayasdi for open-sourcing this work

• Hadoop and Spark communities for the awesome platform

• Pandas team for the awesome tool

df: Dataframe on Spark

Engineering

Transcript of df: Dataframe on Spark