Big Data Analysis with Crate and Python
-
Upload
matthias-wahl -
Category
Data & Analytics
-
view
178 -
download
0
description
Transcript of Big Data Analysis with Crate and Python
Big Data Analysis with Crate and Python
Matthias Wahl - developer @ crate.io !
Email: [email protected]
Crate
shared nothing massively scalable datastore
standing on the shoulders of giants
Crate
get it at: https://crate.io/download
# bash -c "$(curl -L try.crate.io)"
Crate
automatic sharding and replication
(semi-) structured models
single table only
SQL query language
Crate
all common SQL types(and more)
powerful aggregations (‘GROUP BY’)
linear scalability - data and query execution is distributed
basic arithmetics (next release 0.39)
Crate
Aggregation Execution
SELECT station_name, max(temp), avg(temp), min(temp), count(distinct date) FROM weather_de WHERE temp != -999 GROUP BY station_name ORDER BY station_name ASC;
Aggregation Execution
H
M
M
M
R
R
R
collect
Request
Aggregation Execution
H
M
M
M
R
R
R
collect
hash based distribution
Aggregation Execution
H
M
M
M
R
R
R
group results
Aggregation Execution
H
M
M
M
R
R
R
final reduceResponse
Aggregation Execution
Using the python client
>>> from crate.client.http import Client >>> client = Client([“127.0.0.1:4200”]) >>> response = client.sql(“select * from weather_de limit 1”) >>> print(response) { u'duration': 659, u'rowcount': 1, u'rows': [ [1303365600000, 82.0, None, None, None, 0, u'954', 54.1667, 7.45, u'UFS Deutsche Bucht', 60.0, 10.9, 100, 5.2] ], u'cols': [u'date', ...] }
Using SQLAlchemy
>>> import sqlalchemy as sa >>> from sqlalchemy.ext.declarative import declarative_base >>> from sqlalchemy.orm import sessionmaker >>> engine = sa.create_engine(“crate://localhost:4200”) >>> Base = declarative_base()
Using SQLAlchemy
>>> class Weather(Base): ... ... __tablename__ = 'weather_de' ... ... station_id = Column('station_id', String, primary_key=True) ... station_name = Column('station_name', String) ... station_lat = Column('station_lat', Float) ... station_long = Column('station_lon', Float) ... station_height = Column('station_height', Integer) ... date = Column('date', DateTime, primary_key=True) ... temp = Column('temp', Float) ... humility = Column(Float) ... sunshine_hours = Column(Float) ... wind_speed = Column(Float) ... wind_direction = Column(Integer) ... rainfall_fallen = Column(Integer) ... rainfall_height = Column(Float) ... rainfall_form = Column(Integer)
Using SQLAlchemy
>>> from sa import func >>> res = DBSession.query( ... Weather.station_name, ... func.avg(Weather.temp) ... ).group_by(Weather.station_name) ... .order_by(Weather.station_name) ... .limit(10).all()
SELECT station_name, avg(temp) from weather group by station_name order by station_name limit 10;
Using SQLAlchemy
#Average sunshine hours from sqlalchemy.sql import func DBSession.query(func.avg(Weather.sunshine_hours)).scalar() # Average sunshine hours in Konstanz DBSession.query(func.avg(Weather.sunshine_hours)).filter(Weather.station_name==‘Konstanz’).scalar()
Feature Requests
I’m no data scientist
Feature Requests
Please tell us what you would like to see in crate.
I’m no data scientist
CRATE
Thank you
web: https://crate.io/
github: https://github.com/crate
twitter: @cratedata
IRC: #crate
stackoverflow tag: cratedata