Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural...

ModelCompression

Joseph E. GonzalezCo-director of the RISE Lab

jegonzal@cs.berkeley.edu

What is the Problem Being Solved?

Ø Large neural networks are costly to deployØ Gigaflops of computation, hundreds of MB of storage

Ø Why are they costly?Ø Added computation requirements adversely affect

Ø throughput/latency/energyØ Added memory requirements adversely affect

Ø download/storage of model parameters (OTA)Ø throughput and latency through cachingØ Energy! (5pJ for SRAM cache read, 640pj for DRAM vs 0.9pJ for a FLOP)

Approaches to “Compressing” Models

Ø Architectural CompressionØ Layer Design à Typically using factorization techniques to

reduce storage and computationØ Pruning à Eliminating weights, layers, or channels to reduce

storage and computation from large pre-trained models

Ø Weight Compression Ø Low Bit Precision Arithmetic à Weights and activations are

stored and computed using low bit precisionØ Quantized Weight Encoding à Weights are quantized and

stored using dictionary encodings.

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices(2017)

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun

Megvii Inc. (Face++)

Related work (at the time)

Ø SqueezeNet (2016) – Aggressively leverage 1x1 (point-wise) convolution to reduce inputs to 3x3 convolutions.Ø 57.5% Acc (comparable to AlexNet)Ø 1.2M Parameters à compressed down to 0.47MB

Ø MobileNetV1 (2017) – Aggressively leverage depth-wise separable convolutions to achieve Ø 70.6 acc on ImageNetØ 569M – Mult-AddsØ 4.2M -- Parameters

Background

Regular ConvolutionH

InputChannels

DepthOutput

Channels

Kernel

Computation (for 3x3 kernel)

Computational Complexity:width, height, channel out, channel in, filter size

Combines information across space and across channels.

1x1 Convolution (Point Convolution)H

InputChannels

DepthOutput

Channels

Computation

Computational Complexity:

Combines information across channels only.

Depthwise ConvolutionH

InputChannels

DepthOutput Channels

=Input Channels

Computation (for 3x3 kernel)

Combines information across space only

MobileNet Layer Architecture

SpatialAggregation

ChannelAggregation

Computational Cost:

Observation from MobileNet Paper

“MobileNet spends 95% of it’s computation time in 1x1 convolutions which also has 75% of the parameters as can be seen in Table 2.”

Ø Idea, eliminate the 1x1 conv but still achieve mixing of channel information?Ø Pointwise (1x1) Group ConvolutionØ Channel Shuffle

Group ConvolutionH

DepthInput

Channels

DepthOutput

Channels

groups

Used in AlexNet to partition model across machines.

Combines some information across space and across channels.

Can we apply to 1x1 (pointwise) convolution?

Pointwise (1x1) Group ConvolutionH

DepthInput

Channels

DepthOutput

Channels

groups Computational Complexity:

Combines some information across channels.

Issue: If appliedrepeatedly channel remain independent.

Channel Shuffle

Ø Permute channels betweengroup convolution stagesØ Each group should get a channel

from each of the previous groups.

Ø No arithmetic operations but doesrequire data movementØ Good or bad for hardware?

Group Convolution

ShuffleNet ArchitectureShuffleNet Unit MobileNet Unit

Based on Residual Networks

Stride = 2Stride = 1

Alternative Visualization

Images from https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d

ShuffleNet Unit MobileNet Unit

SpatialDimension

ChannelDimension

SpatialDimension

ChannelDimension

Notice the cost of the 1x1 convolution

ShuffleNet Architecture

Increased width when increasing number of groups à Constant FLOPs

Observation:

Shallow networks need more channels (width) to maintain accuracy.

What are the Metrics of Success?

Ø Reduction in network size (parameters)

Ø Reduction in computation (FLOPS)

Ø Accuracy

Ø Runtime (Latency)

Comparisons to Other Architectures

Ø Less computationand more accurate than MobileNet

Ø Can be configuredto match accuracy of other models while using less compute.

Runtime Performance on a Mobile Processor(Qualcomm Snapdragon 820)

Ø Faster and more accurate than MobileNet

Ø Caveats Ø Evaluated using single threadØ Unclear how this would perform on a GPU … no numbers reported

Model Size

Ø They don’t report memory footprint of modelØ Onnx implementation is 5.6MB à ~1.4M parameters

Ø MobileNet reports model sizeØ 4.2M Parameters à ~16MB

Ø Generally relatively small

Limitations and Future Impact

Ø LimitationsØ Decreases arithmetic intensityØ They disable Pointwise Group Convolution on smaller input

layers due to “performance issues”

Ø Future ImpactØ Not yet a widely used as MobileNet

Ø Discussion:Ø Could potentially benefit from hardware optimization?

Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural...

Documents

Transcript of Model Compression · 2019-08-30 · Approaches to “Compressing” Models ØArchitectural...

Sonos® Digital Music Tutorial Music - The Basics 2 Lossy vs. Lossless Lossy compression means compressing music in such a way that data is lost. This is the most popular form of compression

Compressing Pattern Databases

Data Compressing

Compression Techniques Images & Video. Compressing Images ● GIF (Graphic Interchange Format) Codec – employs LZW method for lossless compression ● TIFF.

Deep Compression: Compressing Deep Neural Networks with … · 2016-03-01 · Huffman coding. Pruning reduces the number of weights by 10x, quantization further improves the compression

Compression & Encryption in Data Disc Projectsimg.roxio.com/enu/pdf/creator2010/compressing-data.pdfRoxio Creator helps you save valuable disc space and protect your sensitive !les

D COMPRESSION: COMPRESSING DEEP NEURAL ETWORKS …

Compressing over-the-counter markets...pants adopt both portfolio compression and central clearing. Despite the multilateral netting opportunities brought by centralization, clearing

QoS: Header Compression Configuration Guide, Cisco IOS XE ... · • Compressing TCP/IP Headers for Low-Speed Serial Links • IP Header Compression • Compressing IP/UDP/RTP Headers

Deep Compression: Compressing Deep Neural Networks with …forum.stanford.edu/events/posterslides/DeepCompression... · 2016-03-01 · Deep Compression: Compressing Deep Neural Networks

Interactive compression to entropy · Compression Compression is about identifying the essential parts of objects of interest (and discarding the rest). Compressing is often an evidence

Compressing Moving Images Compression and File Formats ... · – Windows Media Player, QuickTime, HTML5 Autumn 2013 ITNP80: Multimedia 20. Video for Windows AVI: Audio

Compressing PowerPoint Slides

PCA For Image Compression Extended with Shearing and ...PCA [1] is a matrix factorization algorithm used for dimensionality reduction. In the case of image compression, we take the

Lecture 10 Data Compression. Data Compression is a reduction in the number of bits needed to represent data. Compressing data can save storage capacity,

14 Digital Audio Compression - Kentreichel/courses/intr.num.comp.1/fall09/lecture... · 14 Digital Audio Compression The compression of digital audio data is an important topic. Compressing

Compressing color-indexed images by dynamically … · compression algorithms for coding color-indexed images in terms of compression ... In vector form, the prediction result ...

D COMPRESSION: COMPRESSING DEEP NEURAL ETWORKS … · 2016-02-16 · Published as a conference paper at ICLR 2016 DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,

Compressing Information

Optimization. HTTP Compression © 2012 Citrix | Confidential – Do Not Distribute Compression NetScaler supports various ways of compressing traffic HTTP.