Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
-
Upload
denny-lee -
Category
Technology
-
view
198 -
download
1
Transcript of Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
April 10-12, Chicago, IL
Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherDianne Cantwell and Denny Lee
April 10-12, Chicago, IL
Please silence cell phones
3
AgendaYahoo! Business Case for Hadoop and BIBig Data, Fast Queries
Big Data / BI ThemesGet the Hardware Balance RightPartitioning, Partitioning, PartitioningKeep it SimpleIt is the order of things
4
Yahoo! manages a powerful scalable advertising exchange that includes publishers and advertisers
Yahoo! TAO Business Challenge
5
Advertisers want to get the best bang for their buck by reaching their targeted audiences effectively and efficiently
Yahoo! TAO Business Challenge
6
Yahoo! needs visibility into how consumers are responding to ads
along many dimensions: web sites, creatives, time of day, user segments (e.g. gender, age, location) to make the exchange work as efficiently
and effectively as possible
Yahoo! TAO Business Challenge
7
Yahoo! TAO Technical Requirements
680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(per day)
Refresh Frequency: Hourly464,000,000,000(per qtr)Rows Loaded:
Average Query Time: <10 seconds
8
Yahoo! TAO Platform ArchitectureHow did we load so much so quickly?
Data Archive & StagingOracle 11G RAC
File 1
File 2
File N
Partition 1
Partition 2
Partition N
Partition 1
Partition 2
Partition N 24TB
Cube/qtr
1.2TB
/day135GB/daycompressed
2PB cluste
r
Data Aggregation & ETLHadoop
BI ServerSQL Server Analysis
Services 2008 R2
9
BI Query ServersSQL Server AnalysisServices 2008 R2
24TB Cube/qtr
Adhoc Query/VisualizationTableau Desktop 7
Optimization ApplicationCustom J2EE App
Yahoo! TAO Platform ArchitectureQueries at the “speed of thought”
464B rows of event level
data/qtr
• Dimensions: 42• Attributes: 296• Measures: 278
Avg Query Time:2 secs
Avg Query Time:5 secs
10
Yahoo! TAO Return on Investment
For campaigns optimized using TAO, advertisers spent more with Yahoo! than before
For campaigns optimized using TAO, more eCPMs (revenue)!
11
Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment performance to campaign
managers and advertisers for the first time! No longer “flying audience
blind”
12
Yahoo! TAO Future DirectionIncrease Segments by 3xIncrease data size and cartesian
No longer doing distinct countBuilt frequency reports and sampling to deliver this due to the inherent
complexity!
Current ChallengeHadoop to SSAS cube (more later)External access to cubesMore disk due to need for more IO
13
Big Data Analytics Challenges
CubeF
14
Get the data out!
15
Extracting the dataFile GenerationHadoop jobs create many files that are exported / dumped to disk in tabular
format
File StagingFiles are propped to a staging folder for relational dB access
Oracle External TablesGenerate external tables that point to the staged filesNo need to import the dataProcessing is slow
16
AS on Oracle CaseOracle OLEDB10K rows/sec
100K rows/sec
SSIS Connector20K rows/sec
Oracle Analysis Services
Oracle SQL Analysis Services
17
Passthrough Query to Linked Server
http://msdn.microsoft.com/en-us/library/jj710329.aspx
18
Partitioning, Partitioning, Partitioning
19
PartitionsPartitions
• Data is streamed in to Oracle to files• To get max processing, 30 threads are fired because all T (temp)
partitions are processed concurrently• Super fast data loads• Problem is that it requires constant merging of partitions
Files are streamed in as they become available10/10/10 T360772
10/10/10 T360773…
10/10/10 T361645
10/10/10 T360772Oracle 10g
10/10/10 T360773
10/10/10 T361645…
10/10/10 T36077210/10/10 T360773
10/10/10 T361645…
SSAS10/10/10
Merge
20
Partitions – Directly Merging
Partitions
10/10/10 00:00Oracle 10g
10/10/10 01:00
10/10/10 23:00…
• New model allows for set hourly partitions• No more streaming data but with hourly partitions, cannot have as many
threads for fast data loads, unless…• Process multiple cubes or measure groups in parallel
Partitions
10/10/10 00:0010/10/10 01:00
10/10/10 23:00…
SSAS
Segments
10/10/10 00:0010/10/10 01:00
10/10/10 23:00…
Activities
10/10/10 00:0010/10/10 01:00
10/10/10 23:00…
Uniques
21
It is the order of things
22
It is the order of things“I am a Jem'Hadar. He is a Vorta. It is the order of things.""Do you really want to give up your life for the 'order of things'?""It is not my life to give up, Captain – and it never was.”
Rocks and Shoals, Deep Space NineWritten by Ronald D. Moore
23
Segments and the importance of sort order
Data File Sorted Not Sorted % Difffact.data 195,708,592 344,502,968 43.19%agg.rigid.data 106,825,677 106,825,677 0.00%dim1.dim2.fact.map 17,332,729 32,989,946 47.46%dim1.dim3.fact.map 16,923,276 32,222,813 47.48%dim1.dim4.fact.map 6,079,396 12,286,978 50.52%dim5.dim6.fact.map 2,630,888 6,057,334 56.57%dim1.dim7.fact.map 1,809,725 3,904,004 53.64%dim8.dim9.fact.map 1,592,886 3,793,452 58.01%dim1.dim10.fact.map 1,419,255 3,108,248 54.34%dim8.dim11.fact.map 1,301,221 3,042,638 57.23%dim1.dim12.fact.map 2,949,432 2,949,432 0.00%dim1.dim13.fact.map 2,934,836 2,934,836 0.00%dimA.dimA.fact.map 1,101,552 2,716,289 59.45%dim8.dimB.fact.map 961,332 2,451,956 60.79%dim1.dimC.fact.map 1,027,305 2,323,906 55.79%dim8.dim8.fact.map 1,592,886 2,308,232 30.99%dimA.dimD.fact.map 851,095 2,170,962 60.80%
Not Sorted
Sorted
24
Across the Eighth Dimension!
How do you associate dimensions withStar Trek Into Darkness?
Cube
25
26
Back to cube dimensionsRunning ProcessUpdateTakes a long time to run because all of the fact partitions are re-indexed!
Minimize likelihood by building SCD-2 dimensionsComposite Key based on lowest level unique values to represent rowSometimes identity can be just as effective though hashing requires mapping or
lookuptablesCreate SK to allow for SCD-2 dimensionsKey is that we keep the memory space of the SK smallComposite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do not expect Type-2 for fact-based dimensionsImportant to call out restatement based on current data (high cost associated with keeping versioned history of dimension tables)
27
Let’s aggregate it up
April 10-12, Chicago, IL
Thank you!Diamond Sponsor