Handle TBs with $1500 per month
-
Upload
hung-lin -
Category
Data & Analytics
-
view
75 -
download
2
Transcript of Handle TBs with $1500 per month
Handle TBs with $1500/M (or less)
By @hunglin
Because We Are All Curious
And We Have (some useful) Tools Now
What If Data Can Be Easy...
Story at VideoBlocks
Context: Storyteller
● Data Handyman at VideoBlocks
● Organizer of DC Scala meetup
● I LOVE DATA
● Also love Scala and Spark
Context: VideoBlocks
● A media company○ Creative Content Everyone Can Afford
● 3 websites, 100K paid customers.● Hosted on AWS● 16 engineers (total 80 employees)● 9M requests per day, peak at 300 reqs/sec● deploy about 5 times a week
We Want to Know Everything About Our (Potential) Customers
Our Data Issues
● Data everywhere (data silo)
● Data integration (mismatched
format like "" or 0 vs null)
● Data latency: sub-second,
sub-minute, sub-hour or
sub-day are very different.
● Use S3 as data lake - load mysql, mongo, click stream, adwords, facebook ad, ... onto S3. it's the source of truth.
● Use redshift as SQL interface of S3 data.● Use SQL to process data.● Run nightly job to create materialized views (aggregated
data) for query speed.● S3/redshift is the engine of all data tools: spark, python,
R, dashboard, alert system.
Our Solutions
Click streams to redshift, How?
an EC2 instance
loggly container
fluentd container
webhead containerkinesis-firehose
Event-Log-Loader
Loader
pull
Wait! data format doesn't match
create_temp_table.sql
create table {{tempEventTable}}_dup ( "name" varchar(40), "uuid" varchar(40), "requestid" varchar(40), "country" varchar(40), "subdomain" varchar(40), "vid" varchar(70), "mid" int, "payload" varchar(65535), "date" timestamp, primary key ("uuid"))distkey("uuid")sortkey("uuid");
load.sql
copy {{tempEventTable}}_dup from '{{dataUrl}}' credentials '{{credentials}}' json 'auto' gzip timeformat 'epochmillisecs' maxerror 10000;
select distinct * into table {{tempEventTable}} from {{tempEventTable}}_dup;
process.sql
insert into event.{{ siteName }}_page_view (uuid, request_id, vid, mid, date, uri, referrer_uri, campaign, ...) select uuid, requestid, vid, mid, date, etl_text(json_extract_path_text(payload, 'uri'), 1000), etl_text(json_extract_path_text(payload, 'referrerUri'), 200), etl_text(json_extract_path_text(payload, 'utm', 'campaign'), 80),... from {{ tempEventTable }} where name = '{{ eventName }}' and length(vid)=64 and uuid not in (select uuid from event.{{ siteName }}_page_view where date >= (select min(date) from {{ tempEventTable }}));
SQL?! Is it 1990? Aren't we in NoSql era already?!
NoSql means Not yet SQL
● Scalable by default
● Understandable / Editable by
product/analytics/management team
● Scalable Cost - $115/M per node, for
videoblocks, we started from 2 nodes cluster
($230/M) to 12 nodes cluster ($1380/M)
● On demand processing power - Teams can bring
up cluster from snapshot to run data test, and
kill it after get the result.
Benefits of this approach
Things to improve
● SQL code is ugly, hard to unit test and debug● Performance issues
○ mismatched sortkey or distkey
○ inefficient queries
● Read and write on the same cluster (resource management on redshift cluster)
○ Write at night, read in the morning (if one day data latency is OK)
○ Use multiple redshift clusters (more expensive)
In Conclusion
● Redshift is cost efficient.
● SQL is "still" the most common
data language.
● SQL is also the most supported
data language.
● Scalable by default (with
caveats like all other systems)
● On demand data + processing
power using snapshot -
multiple stage of deployment.
● Good enough UI for you to get
a high level idea of the cluster.
● Can only use SQL (compare to
spark cluster)
● SQL is not the ideal
programming language
● Monitoring, performance
tuning, and debugging need
some trial and learns (better
than other systems IMO)
Questions?