Building Audience Analytics Platform
-
Upload
inmobi-technology -
Category
Technology
-
view
364 -
download
3
Transcript of Building Audience Analytics Platform
Building Audience Analytics Platform
Jothi PadmanabhanInmobi
6-Sep-2014
Motivation➔ Audience Analytics platform is extremely critical➔ Segmentation
➔Rule Based
➔Inferred based on Sciences Modeling
➔Third Party
➔ Targeting➔Maximize CTR and CVR
Challenges➔ Scale
➔Billions of Ad requests/day, Peak 25K rps, 800M Users
➔ Multiple Input Sources and Types➔Fact Data, Dimension Data
➔ Multiple Consumers➔Reporting, Segmentation and Targeting, Inferences
Challenges➔ Data Curation➔ Define and Measure Data Quality
➔Track sources and possibly assign confidence
➔ Governance and Licensing restrictions➔ Consistent Querying Interface
Challenges
● Storage capacity and retention● Optimal usage of grid resources
Activity Data➔ Records actual activity ➔ Time-series data➔ Immutable, actual facts➔ Comprises Dimensions and Measures➔ Measures
➔Ad requests, Impressions, Clicks, Conversion, ...
Dimension Data➔ Domain specific Metadata (user, location, app
etc) ➔ Each domain will have its own schema
➔User (uid, age, gender, interests etc)
➔Location (Lat/Long – zip/city/country, etc)
➔Device (Handset model, OS, version etc)
➔ Mutable (but possibly slowly changing)
ETL➔ Need to ingest data from different
sources ➔ Transform the data into a format for
optimized storage and easy queriability➔ Query interface for different consumers
ETL - Ingestion➔ Naive -- Have custom ingestion flows
➔Quick to develop
➔Could be highly optimized
➔Not scalable
➔ Have a generic framework➔Streamlined and scalable
➔Might need more processing
ETL - Storage➔ Naive -- Storage schema closely coupled
with ingestion schema➔Multiple representations of same data. Age
could be DOB or years
➔ Consitent representation a must➔Would require transformation from input
schema to storage schema
ETL - Storage➔ Location – Lat/Long, Zip, City, Country➔ Need to store in the lowest possible granularity
(Lat/Long)➔ GPS readings come with accuracy that needs to
be recorded➔ Queries are almost always nearness queries,
not exact matches➔
ETL - Storage➔ Quadtile representation➔ Use leading bits for tile id, remaining for storing
accuracy➔ Transform all location information to such ids➔ Nearness with Lat/Long distance is a cross-product
join➔ With Tiles, we can translate this into equi-joins (of
course with some loss of accuracy)
ETL - Querying➔ Naive -- Users aware of multiple feeds
and schemas, query appropriately➔Extremely difficult as schemas change,
new feeds get added
➔Closely coupled with internal representation, not good
ETL - Querying➔ Having a consistent, published schema
➔Enables exploration and discovery
➔Well defined querying interfaces that abstract out internal representation
➔Provide primitives (for example UDFs for nearness calculations) for easier querying
Ingestion Server
● Curation to filter out dubious records● Adapters for transformation● REST based ingestion server
– Support multiple compression types
– Support multiple serialization formats
– Handle rate-limiting/throttling
– Bulk/Streaming inputs
●
Storage and Querying
● Possibly different schema than ingestion schema
● Columnar storage format (Parquet/ORC)● Predominantly Hive friendly● No direct access to internal storage, access
only through a HQL-like query layer● Export option for other use case (online store)
Tech Stack
● Pig for most pipeline tasks● Grill for analytics interface● Hive as the primary execution engine● Tez as the runtime environment● ORC/Parquet for the storage format●
Questions