PyData Paris 2015 - Track 4.2 Jean Maynier

Python, SQLAlchemy

and Scrapy at Kpler

Jean Maynier

Business : commodities markets

� Domain: trading / maritime transport

� Customers: physical traders / analysts

� Needs: follow cargoes flows and prices impact

Commercial Database

“Meta” AIS (10 networks)

Ports Authorities

Gas Inventories

Methodology : 4 data layers

The first “Cargo-Tracking” system, bridging the gap between LNG Shipping and Trading.

Why python ?

� Simple, fast to prototype

� Data libraries (sqlAlchemy, Pandas, Scikit-learn)

� Data oriented community

� Scrapy: best scraping framework

Step 1

Scrapy

� Simplify crawling

� Reusable extractors (xpath, css selectors)

� PaaS: Scrapinghub

� Blacklist: Tor, crawlera for IP rotation

Step 2

Process raw data

� Clean and normalize data � Positions

� Port calls

� Contracts

� Prices

� Python PubSub system

� Persist data : sqlAlchemy & Postgresql

Step 3

Enrich data

� Vessel dock/undock events

� Predict next destination

� Find buyer/seller

� Basic routing

Todo / WIP

� Estimate price

� Routing with weather

� Portfolio optimisation

� Fleet optimisation

Step 4

Architecture

Websites sources

API sources

Scraping Python/scrapy

Elasticseach

ETL + models python

REST API scala

client

mobile client

Metrics trends

Present

� DBs < 100Gb

� 50 sources

� 500 vessels

� Positions every 3min

Future

� 1-10 Tb

� 100 sources

� 10k to 100k vessels

� Position every 30s

Performance problem !

Batch to data streaming

� Parallelization

� Granularity : item vs source

� Handle failure, no data loss, monitoring

� Akka streaming, Spark streaming, Celery, Storm

Step 5

Storm at Kpler: POC

Websites sources

API sources

Scraping Python/scrapy

ETL + models

Position

Dock/undock

Custom prices Match prices

Next destination

Trades

Cargo volumes

We are recruiting, join us !

jobs@kpler.com

PyData Paris 2015 - Track 4.2 Jean Maynier

Data & Analytics

Transcript of PyData Paris 2015 - Track 4.2 Jean Maynier

Shogun 2.0 @ PyData NYC 2012

D3 in Jupyter : PyData NYC 2015

2018 - PyData · 2018. What is PyData? PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from one another.

PyData London CNN Lightning Talk

Memex - Pydata, New York 2015

PyData London January 2017

Extracting Knowledge from Pydata London 2015

A Map of the PyData Stack

PyData London News 2nd August 2014

Pydata-Python tools for webscraping

New Capabilities in the PyData Ecosystem

Orange Canvas - PyData 2013

Introduction to NumPy (PyData SV 2013)

PyData London News 7th October 2014

Hacking Human Language (PyData London)

Networkx & Gephi Tutorial #Pydata NYC

PyData NYC 2014 talk

PyData Lonon - Finding Planets with Python

PyData NYC by Akira Shibata

2020 Sponsor Prospectus - PyData · 2020 Sponsor Prospectus. An educational program of. ABOUT PYDATA. PyData is an educational program of NumFOCUS, a 501(c)3 non- ... What are the