Post on 23-Aug-2020
Compressing Tabular Data via Pairwise Dependencies
Amir Ingber, Yahoo! Research
TCE Conference, June 22, 2017
Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford)
Huge datasets: everywhere
- Internet - Science - Media - …
At Yahoo: - More than 100k servers in ~36 clusters - More than 800PB of storage - Lots of data, always want to store more
Compressing big data: does it matter?
It’s expensive Cost of storing 1 PB: around $300k/ year e.g. on AWS
It’s big: Example: storing event log ~1B events / day x 6 months Stored for analytics / machine learning
Lossless compression: dictionary methods Typical compression: gzip (DEFLATE)- Based on LZ77 + Huffman - Popular, fast - Recent variants: zstd (FB, 2015), Brotli (Google, 2015), …
- Good at: detecting temporal dependencies, e.g. text
26,9 25,5
Main idea: find repetitions in sliding window
the brown fox jumped over the brownish jumping bear
the brown fox jumped over ish ing bear
2625
Tabular data Typical dataset: A table - Each row has several fields, complex dependencies - Example:
- Temporal dependencies? - Cross-field dependencies?
UserID Age Location Device Time DocID
4324234 25 90210 iPhone 7 9pm 33221
1223231 49 94087 iPad pro 10am 66543
… … … … … …
Entropy coding 101 ▪ Given: a stream of i.i.d. symbols of a R.V. X: ▪ Encode each symbol as a (prefix free) bit string of variable length
› More frequent symbols à shorter codeword › Theorem: avg. code length
▪ Huffman code: optimal. Rate ▪ Better: Arithmetic coding
› Approaches entropy › Requires: distribution
▪ Black box
≥ H X( ) = p x( ) logx∑ 1
p x( )
≤ H X( )+1bit
p x( )
Assumptions: ▪ Records in the table are i.i.d. ▪ Only dependence between fields
▪ Example: independence
▪ Expected compression rate: per record
▪ Fine print: for each RV, need to save › The distribution and/or codebook (Huffman / Arith. coder) › A dictionary (to translate back to the original values)
X1,X2,...,Xn[ ]
PX1X2 ...Xn (x1, x2,..., xn ) = PXi (xi )i∏
H Xi( )i=1
n
∑
Fancier models: Bayesian networks
▪ Bayes net › DAG with n nodes › Nodes are the RVs, › Edges model (conditional) independence
X1
X2
X3
X4
P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2x3)
▪ Dense graph àmore general › Compression rate:
▪ Usage for compression: compress according to the graph edges › Metadata: larger codebooks / distributions (conditional!)
▪ Not a new idea [e.g. Davies & Moore, KDD’99]
H (X1)+H (X2 | X1)+H (X3 | X1)+H (X4 | X2X3)
How to choose a Bayes Net for compression?
▪ Another assumption: › Each node can only have a single parent
▪ DAG è Tree ▪ Simpler compression
› Conditioned only on a single RV › Compression rate:
▪ Best tree?
X1
X2
X3
X4
P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2 )
H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )
i=1
n
∑ − I Xi;Xj( )Edges(i, j )∑
Searching for the best tree
▪ Rate:
▪ Algorithm: › Calculate › Set › Find minimum spanning tree! › Efficient algorithms exist – [Fredman & Tarjan, 1987] › Also: minimizes the KL divergence w.r.t. to the true distribution.
▪ Known as a Chow-Liu Tree [Chow & Liu, 1968] › Extensions exist [e.g. Williamson, 2000]
H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )
i=1
n
∑ − I Xi;Xj( )Edges(i, j )∑
I Xi;Xj( ), ∀ 1≤ i, j ≤ n
O n2( )
wij = −I Xi,Xj( )
Example: MST with Mutual Information Weights
UserID Age Location Device Time DocID
4324234 25 90210 iPhone 7 9pm 33221
1223231 49 94087 iPad pro 10am 66543
… … … … … …
UserID
Age Device
Location
DocID
Time
Chow Liu compression in real life
▪ Compressing given : ▪ For each possible , store
▪ Dataset not infinite – metadata takes space!
▪ Example: › 1B records, two variables with size 10k, 100k à Conditional distribution of size 1B values (comparable to dataset itself) à Then maybe choosing these two is not the best idea…
Xi Xj
x j PXi |X j⋅ | x j( )
entropy code
metadata
Revised Chow-Liu tree
Take into account model size Actual rate: à Revised weights for the Chow-Liu tree:
Negative gain? à might opt to drop dependencies à forest
wij = −I Xi,Xj( )+ 1# rows
Size PXi |X j( )
H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )
i=1
n
∑ − I Xi;Xj( )Edges(i, j )∑ +
1# rows
Size PXi |X j( )
Example: MST with Mutual Information Weights
UserID
Age Device
Location
DocID
Time
wij = −I Xi,Xj( )+ 1# rows
Size PXi |X j( )wij = −I Xi,Xj( )
UserID
Age Device
Location
DocID
Time
entropy code
metadata
entropy code
metadata
Storing the metadata
How to store the distribution P(X|Y)? - Naïve: save entire matrix - Lossless compression: gzip / utilize sparsity - Lossy compression!
Improvements: Lossy model compression
▪ Compressing given : (compression is still lossless) ▪ True distribution: ▪ Lossy representation results in distorted distribution ▪ Code rate:
▪ Want to minimize both model storage size and divergence! › Related to MDL › Can be used to modify edge weights
X YPXY
QXY
H X |Y( )+D PX|Y ||QX|Y | PX( )+ 1# rows
Size QX|Y( )
QXY
Proposed approach:
▪ Add a virtual variable with a small alphabet size, s.t. X–Z–Y
▪ Storage size decreased from to ▪ : controls tradeoff between two objectives ▪ Finding
› Iterate through the three terms, minimize KL divergence, repeat until convergence • Not optimal! Optimization is hard • Similar in spirit to [Lee & Seung,NIPS 2001]
PXY (x, y) ≅ QY |Z (y | z)QX|Z (x | z)QZ (z)z∑
| X | ⋅ |Y | (| X |+ |Y |) | Z || Z |
QY |Z (y | z),QX|Z (x | z),QZ (z){ }
Example: Criteo dataset
▪ A Kaggle competition for click prediction by Criteo ▪ Dataset: 45M records
▪ Mutual information: Chow Liu Tree:
Example: Criteo dataset
▪ Variables 3 and 8 have large alphabet 5,500 and 14k (vs 16M records) à can’t store conditional distribution
Results of NNMF:
Experiments
▪ Datasets: machine learning, US census, etc. › #features: 10-68 › #lines: 60K – 45M
▪ Current version: › MST with adjusted weights › Sparse encoding of metadata + lossless comp.
Speed vs. compression efficiency
Summary
▪ Dataset compression via probabilistic assumptions › Bayes nets, Chow-Liu Trees › Metadata encoding +weight modification
▪ Lossless compression via lossy model compression › Add a new RV with a Markov restriction › Balance metadata size vs. model inaccuracy
▪ Take home message: › Choose right metric › Revisit old ideas