“Big” numbers for GP today
• 70K/day - Query Rate • 6.5PB – Dataset Size • +100GB/s – Analysis Rate • +3GB/s – Net Loading Rate • 100,000/s – Transaction Rate• 56 TB / kW, 1.6 GB/s/kW – Power Rate• 100s – Number of Data/Compute nodes
05/04/23 2
Things I’ve Heard
• Tiered computing– Organizational / Political / Geographic
boundaries require it• Metadata computing for HEP
– “10TB sounds small but it’s not easy”• Processing for Radio Astronomy, HEP
– Data intensive computing– Requires an efficient pipeline from raw to
consumables
05/04/23 3
Thoughts
• A lot of plumbing! Moving data around, pipeline processing– Core engine should do this so the plumbing
isn’t done over and over
• Need for specialized access methods and storage classes
• “Computing in data” is key to success
05/04/23 4
GP Basic Features
• Access Methods– Compression, Column Store, Heap Store, External
Tables, Indexes (GIST, GIN, Rtree, Bitmap, B-Tree, …)
– Network Ingest / Export directly into parallel pipeline
– Logical Partitioning by Range, List• Parallel Programming Languages
– SQL 2003 with Analytics– Map Reduce in Perl, Python, C, SQL, …– PL/R,python,perl,C,pgSQL,SQL, …
05/04/23 5
From Enterprise Data Clouds
• Elastic / adaptive infrastructure for data warehousing and analytics– IT Operations deploy pools of low-cost commodity infrastructure
• Physical servers, virtual infrastructure, or onramp to public cloud– DBAs and Analysts provision sandboxes and warehouses in minutes
• Assemble the data they need (common, private, etc) for agile analytics
05/04/23 6 Proprietary & Confidential
DBA
Analyst
ConsumerDivision
PackagedGoods Finance
408
16 16120Free 16 16 68
Free96 40 64
Free
Infrastructure
Warehouses
IT Operations
Use Case: Big TelcoData Mart Consolidation
05/04/23 7 Proprietary & Confidential
Goals:•Reduce maintenance and support costs from proliferation of data mart platforms•Reduce risks and exposure due to data in shadow IT systems•Break down silo walls - provide a unified way to find and access all data
Approach:•Embrace data – encourage ‘physical consolidation’ in advance of data model unification•Provide ‘self serve’ model to bring shadow IT into the light•Allow unified data access and pragmatic ‘logical’ data model unification incrementally
DataSources
US- West100 nodes
XX
X
X
XX
X
X
X
Use Case: Big Ad NetworkProject Sandboxes
05/04/23 8 Proprietary & Confidential
Goals:•Remove IT barriers to analyst productivity and value creation•Dramatically reduce IT resource constraints and delays – i.e. realize ideas sooner•Combine centralized ‘EDW’ data with freshly discovered feeds and other useful sources
Approach:•Self-serve creation of project warehouses in minutes – and elastically expand as needed•Load new data feeds without requiring formal modeling•Bring together any data within the EDC – even if globally distributed – and analyze
US- East100 nodes
Analyst’s New Warehouse
Analyst’s Private
Data Feed
EDC
Self-ServeDashboard
GP is Software – Develop Now
• Download at:– Gpn.greenplum.com– Get the VMWare image or use it on OSX, Linux,
Solaris
05/04/23 9
Think Big. Think Fast.
Top Related