The QuantCell Big Data Spreadsheet -...

16
1 The QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science [email protected] Saturday, March 23, 2013 * Image cropped from article about QuantCell Research in Java Magazine JULY/AUGUST 2012 *

Transcript of The QuantCell Big Data Spreadsheet -...

Page 1: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

1

The QuantCell Big Data Spreadsheet

Agust Egilsson, PhD Big Data Science

[email protected] Saturday, March 23, 2013

* Image cropped from article about QuantCell Research in Java Magazine JULY/AUGUST 2012

*

Page 2: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

2

We will talk about ….

Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions

The QuantCell Big Data Spreadsheet

Page 3: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

3 The QuantCell Big Data Spreadsheet

Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions

Why spreadsheets for data-scientists and domain experts?

• shorter turnaround times (e.g. financial products) • dynamic execution, debugging and testing • integrated runtime and development environments • experiment driven programming • expression-oriented programming • minimum or no GUI design • by far the most widely used programming system

Page 4: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

4 The QuantCell Big Data Spreadsheet

Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions

Why Java?

• large ecosystem of analytical tools and resources • explosive growth in publicly available APIs • concurrency support • big data analytics & technologies are mostly Java based • HPC and cloud ready • performance • optimization

Page 5: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

5 The QuantCell Big Data Spreadsheet

Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions

The QuantCell big data spreadsheet supports

• high performance and access to Hadoop clusters • intuitive access to local and remote data-sources • access to a variety of algorithms and methods • simplified programming already familiar to the expert • effortless deployment of solutions to Hadoop and into

production

Page 6: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

6 The QuantCell Big Data Spreadsheet

Java based programming for big data scientists & end-users How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions

Common use cases include

• big data analytics • data mining using Mahout or weka etc • risk analysis, pricing and trading strategies

Live demo: Simple Java spreadsheet expressions: Data Market, Bio Data and simple analysis.

Page 7: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming

7 The QuantCell Big Data Spreadsheet

Explosive growth in publicly available Java analytical and visualization libraries

Page 8: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

8 The QuantCell Big Data Spreadsheet

How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming

Explosive growth in publicly available Java analytical and visualization libraries For example:

• OpenGamma (695,000 lines) • Weka (507,000 lines) • RapidMiner/YALE (535,000 lines) • BioJava (270,000 lines) • Chemistry Development Kit (861,000 lines) • NASA WorldWind (420,000 lines) • and so on …

Page 9: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

9 The QuantCell Big Data Spreadsheet

How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming

Same is true for analytical Java based frameworks

• Apache Hadoop (2,200,000 lines – Java and XML) • Apache Pig (320,000 lines, analyzing large datasets) • Apache Hive (420,000 lines, data warehousing) …

Taking advantage of these libraries from the spreadsheet is simple and in many cases possible by non-developers Live demo: OpenGamma Financial API example.

Page 10: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

10 The QuantCell Big Data Spreadsheet

Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs

Analytical projects require top performance

expressions are compiled to byte code expressions are optimized by Java dynamically loaded into the JVM for execution just-in-time compilation

Live demo: Let’s look at a few Java optimization tricks and confirm that these are used dynamically in the spreadsheet to optimize user expressions/functions

Page 11: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

11 The QuantCell Big Data Spreadsheet

Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs

Let’s run code turning interpreted mode on and off (-Xint) and play with the expression to eliminate optimization

double (c2) = {

long start = System.nanoTime();

double add = c2;

for (int i = 0; i < 2000_000_000; i++)

add++;

return (System.nanoTime() - start)/1000000000.0;

}

Hint: replace “start” with “start + add - add” in the last line

Page 12: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

12 The QuantCell Big Data Spreadsheet

Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer

Deployment

• deployment depends on the production environment • the user logic should be created independent of the

eventual deployment path chosen Live demo: Deploying MapReduce algorithms to Cloudera’s CDH or to EMR

Page 13: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

13 The QuantCell Big Data Spreadsheet

Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production

Spreadsheets are becoming more and more popular for various Big Data tasks

• driven by high demand and low supply of Big Data experts (deep analytic talents and data-savvy managers)

• using the Java based spreadsheet tool is also beneficial for other reasons

Live demo: How the spreadsheet uses both local cycles and Hadoop cloud resources in the spreadsheet

Page 14: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

14 The QuantCell Big Data Spreadsheet

Long running operations, multithreading and garbage collection Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users

The spreadsheet banks on Java to support

• multiple threading techniques • reclaim memory • distribute work between hardware resources/cores

Live demo: Long running operations and let’s write and execute a fork and join algorithm in the spreadsheet …

Page 15: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

15 The QuantCell Big Data Spreadsheet

Simplifying coding for the data scientists Questions Java end-user (big data scientists, quants) based programming How the experts benefit directly from open source Java APIs Transferring JVM performance to the expert or programmer Deployment of solutions into production Big Data analytics created and consumed by end-users Long running operations, multithreading and garbage collection

End-user coding

• encourage Java spreadsheet like functions • allow Java code expressions common to C like

languages (Java, C, C++, C# ….) • include Scala, SQL, Hive and Impala expressions • use wizards to generate the more complex expressions

Page 16: The QuantCell Big Data Spreadsheet - Meetupfiles.meetup.com/3168962/QuantCell_SIG_March_23_2013.pdfThe QuantCell Big Data Spreadsheet Agust Egilsson, PhD Big Data Science agust@quantcell.com

16

Thank you for attending.

Q & A

Signup for our upcoming Beta release www.quantcell.com

Agust Egilsson Bjorn Jonsson [email protected] [email protected]

The QuantCell Big Data Spreadsheet