R na prática: linguagem, plataforma e um case de Telecom André Andrade Baceti 25 de Março de 2015...
-
Upload
oscar-newton -
Category
Documents
-
view
215 -
download
1
Transcript of R na prática: linguagem, plataforma e um case de Telecom André Andrade Baceti 25 de Março de 2015...
R na prática: linguagem, plataforma e um case de Telecom
André Andrade Baceti
25 de Março de 2015
R in practice: language, environment and a Telecom case
Start-up About 1 year old
In company maintenance
Traditional project structure(heritage from traditional IT
projects)
Billing by forecast and assertiveness
Creation of a basic structure for mining modeling
opportunities in company
Development of explanatory and
prediction models
…
Call center demand
Credit score
Identification of optimality frontiers
Project delivery/billing methodologies
Development of in company data science
labs
Pre-defined scope projects
Forecast as service
Company Structure
Eugenio Caner
Electric Engineer (San Peterburbg State University – Russia) - PHD by UFRJ
Currently working as board president at Murabei
Worked as reseasher in Atlas experiment at CERN ( Boson de Higgs)
Roberto Ono
Physic (USP) with specialization in astrophysics
Headed projects with multinationals in different economic sectors
Currently focused with modeling and dealing with big data
André Baceti
Biologist, master in microbiology with emphasis in biosensor development, currently finishing 2o degree in math applied to signal and system control
Worked in different data science projects
Currently focused in the development of PumpWood (this lecture)
Boar
d
Other members of Murabei
Statisticians
Web Developer
IT professional
Summary
R in practice
Language SyntaxOverview Speed Memory
ConsumptionLearning
CurveIn Company
Usage
Environment DevelopmentPlatforms Community Packages
Telecom case
Project overview
R Development Challenges
Proposed solution: PumpWood
PumpWood Overall Results
R EnvironmentR is a language and environment for statistical computing and graphics
Operational Systems
Community
78.792 R 3.651 SAS
1.137 Stata 550 SPSS
Stackoverflow
O'Reilly Data Scientist Survey for 2014
Nice graphics
IntegrationDatabase quering
Access
MySQLOracle
PostgreSQL Sqlite
Teradata
Haddop
MongoDB …
SQL Server
R Environment
In database executionOracle Exasol …“Tableau”
In other program. lang. execution
SAS Stata PythonSPSS …Perl
Other Programming languages in RFortran
C++
Java
JuliaC
Python
…
Packages
6218PackagesAvailable
2015
Hornik, 2012
But it has its owns ways…
All primitive variables are arrays
by definition
No need to define size of arrays in
declaration
R LanguageR Syntax is similar to other programing languages like C and Java, statements are surrounded by brackets.
If(condition){ statement 1}else if(cond.) { statement 2 }
If Else
For(I in 1:5){ statement}
For
Class programing is available too
Class definition are tricky, not so easy to work with
There are some tutorials at internet
R is optimized to deal with arrays
For and while loops are slow
Operations made with entire arrays and apply
functions are much faster
Generic functions(This one is nice)
Ex.: summary(obj)predict(model)
Summary de obj with a different behavior
depending of the obj class
R LanguageSpeed
R is slow
There are some paid alternative implementations of the language
You can always go back
Implement time demanding steps in C or Fortran and wrap in a R
function
Low quality codes
Just analyze data one time
No need for optimizing and documentation
Memory consumption
R runs on RAM
Have problems for managing
memory
Limiting factor for
GLMs mainly
Solutions
PaidClosed source
Open source packages
ff Disk storage datasets
ffbase Stats for ff objs
biglm In disk GLM models
Memory consumption… still a problem?
In memory database era…
R LanguageLearning Curve
SPSS SAS R
More difficult
Point and click
Some R point and click
R is difficult?
When you don’t have any programming skills… yeah!
Can you build a data science project without programming skills…
I don’t think so!
In Company UsageResistance to open source technologiesIT Departments
Ok… 6218 packages are too much to
homologate
FDA recommends basic packages for drug trial studies
Like to pay? Closed source
Telecom case: Project overview
Call center demand forcast
Predict 60 days ahead total incoming calls by day
Two different areas
Area 1
Area 2
24 different skills
15 different skills
Project overview2x per month
Alignment with affected areas
Follow up meetings
Model improvement
Client tested a proprietary modeling toll
“Modeling Tool box”
Large tool boxes do not imply in a correct repair
Need a data scientist to pilot the system
Model used to verify impacts
Models were also used to verify impacts of different actions
Client mailing
Promotions
Expire dates
Telecom case: R Development ChallengesR
Model versioning
Model evaluation
Easily create/modify models
Model sharing and validation
NopeNo point and click interface to define
and modify models. Have to be done in the script file
Nope
Models are defined in script files. You can store old file or work with version control softwares (git, mercurial)… not
so easy
Yahh
Can develop a script to bring models deviance and other parameters. Have
to open file to chance constants like file name, date, series, etc…
Nope
Models on script, model sharing means sending scripts files and data one to
another… in limited time this will tend to caos
Programming language No out of the box solutions
We need an interface for the users
Model sharing and validation
A way that everybody can see the same model and results Web interface!
Telecom case: R Development ChallengesR
Have all we needed for statistical models and analysis for the project
We had to find a way to manage the draw backs
So let’s divide the tasks R Statistical language Do the stats
Advantages
Platform independent No recompiling and easily go mobile!
No code on users No installing anything
Central data service Everybody see the same results
Easier to find developers Much easier to find a web developer than R, C++, etc…
Telecom case: Proposed solutionWeb interface!
…
Nice catch phase The web framework for perfectionist with deadlines
N-array objects and linear algebra methods
Data ploting, optimization, stat. models, etc …
Possible to open R objects and codes in Python
Database
PostGres plays well with Django and R
Have a spatial extension
Open sourceAlso plays well with Django!
Telecom case: Proposed solutionModel / View / Template
FrameworkTask division helps to
keep the code minimal
Model
View
Template
Each model define an information unity (much like a class) and each entry of this
unity (class object) is saved in DB
Define functions which are responsible for the user interaction
Stores the web site html code. It is possible to heritage from pages, helps
to keep the coding not repetitive
Auto-generated admin web page
Change, remove and add database entries
Specify functions associated with models
Ex.: DescriptionModel -> run model
Object-relational mapper
Automatic database management
Table / field / constrains / 1 keys / etc…
Field in DB are like obj attributes
Telecom case: Proposed solutionSolution overview: Job Division
Main user interface
Manage created models
Load data
Create predictions (RPy2)
Estimate models Store all data regarding models, results,
historical data and prediction
Scalability
Modular designEach part of the
system can be busted by adding new nodes
Still have to stress the system and finish
some implementations
Telecom case: Proposed solutionCommunication between parts
JSON DB connection
DB connection
Django -> R
Just confirmations
Model info
All saved in DB
Django login and Cross Site Request Forgery
Login Token
CSRF Stored in R as a local variable
Telecom case: Proposed solutionCommunication between parts
JSON DB connection
DB connection
Packages used in implementation
rjson Transform R objects to JSON and vice versa
Don’t remove JSON Vulnerability
(“)]},” at the string begging)
httr Make requests through R manage cookies too
gsubfn Modify strings, making easy to create urls Django-style in R /descriptionmodel/%(id)s
Telecom case: Proposed solutionCommunication between parts
JSON DB connection
DB connection
Packages used in implementation
RPostgreSQL PostGres database connection, queries, inserts, etc…
reshape2 Easily pivot and unpivot tables Makes it easy o build regressive matrix
gsubfn Modify strings, making easy to create dynamic queries
Select *From table_1Where id = %(id)s
Telecom case: Proposed solutionCommunication between parts
JSON DB connection
DB connection
Packages used in implementation
Pandas Create R like dataframes, pivot and unpivot tables
A little tricky to work with… Pivoting and grouping are faster than R and
SQL
Rest FrameWork
Easily create rest servicesReally this one is awesome! Define
serializers to the models
SouthAutomatically create
database migrations based in model modification
Custom complex migration can be built too
Telecom case: Proposed solutionModels are run according to an hierarchical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Queen is associated with a mold
Mold stores the model parameters and which variables should be used in
stepwise
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Model queue hierarchy determines that the queen model have to wait breeding
model finish to run
Red arrows indicate dependencies
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
R calls Django (JSON) asking to
breed models according to the
mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Django sets new models under queen and above
breeding in hierarchy
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
R calls Django to execute
queen method
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
R calls Django to execute
queen method
Queen chooses the best model
under it
This best is better than the
one before?
No one before
Next breeding
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Breeding model compare mold and previous best model to see which variables still have to be tested
Queen is set as waiting again
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Telecom case: Proposed solutionModels are run according to an hieraquical queue
Model is only run if all its dependencies are
finished
Rest communication between R and Django
R can call Django to generate new models
Loops of interactions between R and Django
More complex model structures
Stepwise Implementation
Queen Model
Breeding Model
Coordinates model breading and check if the
best model is achieved
Creates new models based on the best
previous layer modelQueen Breeding Ordinary Best one Mold
Final Model
Russian word for an ant colony
Why ants?
Telecom case: PumpWoodBorn of a new point and click (part of it at least) statistical system
An ant tell little about the
colony
Have to look at the BIG
picture… Data Science
Big Data
Name of a pioneer tropical tree which have developed
a symbioses with ants
Tree houses and feeds the ants
Ants protect the tree from predators
Telecom case: PumpWoodProject PumpWood’s overall results
System stability
Django Last months with no need of restarting by crash R
There are some hiccup in implementation
(multicolinearity of models), but ok, stable too
Hardware usage
All nodes in same machine
Usually running with
3 Django process
3 R process
PostGres 12 Gb of disk usage
8 Gb RAM
Less than R$5k
Hardware investment
Telecom case: PumpWoodProject PumpWood’s overall results
In one year
That is too many models!Model is defined by its inputs and output
Each change in a model`s inputs leads to a new model
40193 estimated models
35711 different models created 2411 without stepwise ones
6200 without stepwise ones
Would senior analysts be more economical when using PumpWood?
(24 x 2 + 15)x12 = 756 2411 / 756 = 3.1891 (retries by update)
Telecom case: PumpWoodCurrent development
Migrating out of Django’s admin
Use Django as a REST-Full service, which can receive CORS
Django admin is very useful, but limited
Easier to change GUI platform
Development already in course Advantages
Single page app
More freedom for designing
Graphics and other data visualization options
Working on a touch friendly design
Less net trafic
R Trends
Bibliografia
R project: www.r-project.org
Kurt Hornik, Austrian Journal OF Statistics Volume 41-1 p59–66 (2012)
Companies using R:http://www.revolutionanalytics.com/companies-using-r
Pumpwood Photos:http://espacepourlavie.ca/en/biodome-flora/shield-leaf-pumpwoodhttps://treesandfish.wordpress.com/2011/06/25/trees-of-puerto-rico-part-1-cecsch-and-schmor/
Thank you
Contact: [email protected]
PumpWood: A data science tool
Huge amount of data
Usually associated with user information
Profile
Service usage
Navigation path
Geospacial data
…
Challenges
Fast algorithms
Dat
aBi
g
Reproducible experiments
Testable hypothesis
Diffusion of the results
Scie
nce Science is a method
Infrastructure
Parallel processing
PumpWood: A data science toolModel evolution history
Files and reports can be attached to de deliveries
Model’s objects holds the information
necessary to rerun it over again
Create deliveries inside of PumpWood (helps with the information diffusion)
Study correlation between sales and price for food and beverage
Delivery Basquet
Description
Price and sales correlation shown significantive negative of price effect on sales. Despite that, high value products have a reduced elasticity. To see more check reports attached to basquet
Notes
Delivery date: 2014-03-05 User: abaceti
Insights!!
PumpWood: A data science toolOverview data analysts development
Check how the job is done by your best ones
Grant and ungrant production status to different models
Models can also be used in frequent tasksThis helps to keep track on which ones is
been used in production or is part of a development
Improve the rest of the team with the learned
lessons!
Thank you
Contact: [email protected]