VXA: A Virtual Architecture for Durable Compressed ArchivesVXA: A Virtual Architecture for Durable...

Post on 26-Aug-2021

5 views 0 download

Transcript of VXA: A Virtual Architecture for Durable Compressed ArchivesVXA: A Virtual Architecture for Durable...

VXA: A Virtual Architecture forDurable Compressed Archives

Bryan FordComputer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

http://pdos.csail.mit.edu/~baford/vxa/

The Ubiquity of Data Compression

Everything is compressed these days– Archive/Backup/Distribution: ZIP, tar.gz, ...

– Multimedia streams: mp3, ogg, wmv, ...

– Office documents: XML­in­ZIP

– Digital cameras: JPEG, proprietary RAW, ...

– Video camcorders: DV, MPEG­2, ...

Compressed Data Formats

Observation #1:Data compression formats evolve rapidly

— SQ

— L

U—

 ARC

— Z

OO

— L

Harc

— Z

IP

— bz

ip2

— gz

ip

— co

mpress

— 7z

Lossless Compression—

 RAR

1980 1985 1990 1995 2000 2005

Compressed Data Formats

Observation #1:Data compression formats evolve rapidly

— SQ

— L

U—

 ARC

— Z

OO

— L

Harc

— Z

IP

— bz

ip2

— gz

ip

— co

mpress

— 7z

Lossless Compression—

 RAR

— M

PEG­2

— D

V—

 WM

V7

Video Encoding

Audio Encoding

— M

PEG­1

— M

PEG­4

— M

P3

— R

ealA

udio

— A

AC

— FLAC

— V

orbis

— Sore

nson

— Q

uickT

ime

— JP

EG

— G

IF

— PNG

— JP

EG 2000

Image Encoding—

 TIF

F

— W

MA7

— W

MA9

— W

MV9

— W

MV8

— FLIC

— W

AV

— A

IFF

— 8S

VX

— A

NIM

— A

IFF­C

— IL

BM

— PCX

— T

GA

— B

MP

1980 1985 1990 1995 2000 2005

Compressed Data Formats

Observation #1:Data compression formats evolve rapidly

Problems:– Inconvenient:

each new algorithm requires decoder install/upgrade

– Impedes data portability:data unusable on systems without supported decoder

– Threatens long­term data usability:old decoders may not run on new operating systems

Archiving Compressed Data

Observation #2:Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005

x86 Architecture—

 8086

— 80

386

  (32­b

it)

— x8

6­64

  (64­b

it)

— SSE

  (FP ve

ctor)

— M

MX

  (int 

vecto

r)

Archiving Compressed Data

Observation #2:Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005

x86 Architecture—

 8086

— 80

386

  (32­b

it)

— x8

6­64

  (64­b

it)

— SSE

  (FP ve

ctor)

— M

MX

  (int 

vecto

r)

Fully Backward Compatible Extensions

Archiving Compressed Data

Observation #2:Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005

x86 Architecture—

 8086

— 80

386

  (32­b

it)

— x8

6­64

  (64­b

it)

— SSE

  (FP ve

ctor)

— M

MX

  (int 

vecto

r)

— 68

000

— M

IPS

— A

RM

— PA­R

ISC

— Pow

erPC

— D

EC Alph

a

— It

anium

Other Architectures—

 SPARC

— 68

000

— M

IPS

— A

RM

— PA­R

ISC

— Pow

erPC

— D

EC Alph

a

— It

anium

Other Architectures—

 SPARC

Archiving Compressed Data

Observation #2:Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005

x86 Architecture—

 8086

— 80

386

  (32­b

it)

— x8

6­64

  (64­b

it)

— SSE

  (FP ve

ctor)

— M

MX

  (int 

vecto

r)

— 68

000

— M

IPS

— A

RM

— PA­R

ISC

— Pow

erPC

— D

EC Alph

a

— It

anium

Other Architectures—

 SPARC

Archiving Compressed Data

Observation #2:Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005

x86 Architecture—

 8086

— 80

386

  (32­b

it)

— x8

6­64

  (64­b

it)

— SSE

  (FP ve

ctor)

— M

MX

  (int 

vecto

r)

— 68

000

— M

IPS

— A

RM

— PA­R

ISC

— Pow

erPC

— D

EC Alph

a

— It

anium

Other Architectures—

 SPARC

Archiving Compressed Data

Observation #2:Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005

x86 Architecture—

 8086

— 80

386

  (32­b

it)

— x8

6­64

  (64­b

it)

— SSE

  (FP ve

ctor)

— M

MX

  (int 

vecto

r)

— 68

000

— M

IPS

— A

RM

— PA­R

ISC

— Pow

erPC

— D

EC Alph

a

— It

anium

Other Architectures—

 SPARC

Archiving Compressed Data

Observation #2:Processor architectures evolve more conservatively

1980 1985 1990 1995 2000 2005

x86 Architecture—

 8086

— 80

386

  (32­b

it)

— x8

6­64

  (64­b

it)

— SSE

  (FP ve

ctor)

— M

MX

  (int 

vecto

r)

Itani

c

VXA: Virtual Executable Archives

Observation 1+2: Instruction formats are historically more durable than compressed data formats

Make archive self­extracting (data + executable decoder)

To extract data, archive reader runs embedded decoder

ArchiveWriter

ArchiveReader

D D

Archive

Encoder

Decoder

Goals of VXA

Make self­extracting archives...

ArchiveWriter

ArchiveReader

D D

Archive

Encoder

Decoder

Goals of VXA

Make self­extracting archives...

1. Safe: malicious decoders can't compromise host

2. Future­proof: simple, well­defined architecture [Lorie]

ArchiveWriter

ArchiveReader

D

Emulator

D

Archive

Encoder

Decoder

Goals of VXA

Make self­extracting archives...

1. Safe: malicious decoders can't compromise host

2. Future­proof: simple, well­defined architecture [Lorie]

3. Easy: allow reuse of existing code, languages, tools

ArchiveWriter

ArchiveReader

D

x86Emulator

D

Archive

Encoder

Decoder

Goals of VXA

Make self­extracting archives...

1. Safe: malicious decoders can't compromise host

2. Future­proof: simple, well­defined architecture [Lorie]

3. Easy: allow reuse of existing code, languages, tools

4. Efficient: practical for short term data packaging too

ArchiveWriter

ArchiveReader

D

Fast x86Emulator

D

Archive

Encoder

Decoder

Outline

● Archiver Operation● vxZIP Archive Format● Decoder Architecture● Emulator Design & Implementation● Evaluation (performance, storage overhead)● Conclusion

Archive Writer Operation

Archive

VXA Archiver

Archive Writer Operation

Archive

D1

Uncompressed Input Files

GeneralCompressor

Decoder1

VXA Archiver

Archive Writer Operation

Archive

D1

Uncompressed Input Files

GeneralCompressor

Decoder1

VXA Archiver

Archive Writer Operation

Archive

D1 D2 D3

Uncompressed Input Files

GeneralCompressor

Decoder1

ImageCompressor

Decoder2

AudioCompressor

Decoder3

Archive Writer Operation

Archive

Uncompressed Input Files Pre­Compressed Input Files

D1 D2 D3

GeneralCompressor

ImageCompressor

AudioCompressor

Decoder1 Decoder2 Decoder3

Archive Writer Operation

Archive

D4 D5

Uncompressed Input Files Pre­Compressed Input Files

Image FormatRecognizer

Decoder4

Audio FormatRecognizer

Decoder5

D1 D2 D3

GeneralCompressor

ImageCompressor

AudioCompressor

Decoder1 Decoder2 Decoder3

Archive Reader Operation

Archive

VXA Archive Readerx86 Emulator

D4 D5D1 D2 D3

VXA Archive Reader

Archive Reader Operation

Archive

Original Uncompressed Files

x86 Emulator

Decoder1

D4 D5D1 D2 D3

VXA Archive Reader

Archive Reader Operation

Archive

Original Uncompressed Files

x86 Emulator

Decoder1 Decoder2 Decoder3

D4 D5D1 D2 D3

VXA Archive Reader

Archive Reader Operation

Archive

Original Uncompressed Files Original Pre­Compressed Files

x86 Emulator

D4 D5D1 D2 D3

Decoder1 Decoder2 Decoder3

VXA Archive Reader

Archive Reader Operation

Archive

Original Uncompressed Files De­compressed Files

x86 Emulator

Decoder4 Decoder5

D4 D5D1 D2 D3

Decoder1 Decoder2 Decoder3

vxZIP Archive Format

● Backward compatible with legacy ZIP format

Central Directory

Audio file

Audio file

Image file

vxZIP Archive

vxZIP Archive Format

● Backward compatible with legacy ZIP format

● Decoders intermixed with archived files

Central Directory

Audio file

FLAC Decoder

Audio file

Image file

JP2 Decoder

vxZIP Archive

vxZIP Archive Format

● Backward compatible with legacy ZIP format

● Decoders intermixed with archived files

● Archived files havenew extension header pointing to decoder

Central Directory

Audio file(FLAC-encoded)

FLAC Decoder

Audio file(FLAC-encoded)

Image file(JP2-encoded)

JP2 Decoder

vxZIP Archive

vxZIP Archive Format

● Backward compatible with legacy ZIP format

● Decoders intermixed with archived files

● Archived files havenew extension header pointing to decoder

● Decoders are hidden,“deflated” (gzip)

Central Directory

Audio file(FLAC-encoded)

FLAC Decoder(deflated)

Audio file(FLAC-encoded)

Image file(JP2-encoded)

JP2 Decoder(deflated)

vxZIP Archive

vxZIP Decoder Architecture

● Decoders are ELF executables for x86­32– Can be written in any language, safe or unsafe

– Compiled using ordinary tools (GCC)

● Decoders have access to five “system calls”:– read stdin, write stdout, malloc, next file, exit

● Decoders cannot:– open files, windows, devices, network connections, ...

– get system info: user name, current time, OS type, ...

Decoders Ported So Far(using existing implementations in C, mostly unmodified)

General­purpose (lossless):● zlib: Classic gzip/deflate algorithm● bzip2: Burrows­Wheeler algorithm

Still image codecs:● jpeg: Classic lossy image compression scheme● jp2: JPEG 2000 wavelet­based algorithm, lossy or lossless

Audio codecs:● flac: Free Lossless Audio Codec● vorbis: Standard lossy audio codec for Ogg streams

vx32 Emulator Architecture

Runs in vxUnZIP process● Loads decoder into 

address space sandbox● Restricts decoder's 

memory accessesto sandbox

● Dispatches decoder's VXA “system calls”to vxUnZIP(not to host OS!)

VXA DecoderAddress Space

(up to 1GB)

vxUnZIPProcessAddress

Space

vxUnZIPApplication

DecoderAddress Space

0

0

VXA System Calls

vx32 Emulator library

vx32 Emulator Implementation

On x86­{32/64} hosts:– Secure fault isolation 

[Wahbe]

– Data sandboxing viacustom LDT segments

– Code sandboxing viainstruction rewriting[Sites, Nethercote]

– No privileges orkernel extensions

KernelAddress Space

CodeRewriting

VXA DecoderAddress Space

(up to 1GB)

Flat-ModelCode/Data

Segment

vxUnZIPApplication

DecoderData Segment(LDT)

0

0

ControlledProcedureCalls

vx32 Emulator library

Transformed code cache

vx32 Emulator Implementation

On other host architectures:– Portable but slow “fallback” instruction interpreter

(mostly done)

– Fast x86­to­PowerPC binary translator(in progress)

– Hopefully more in the future

Emulator implemented as generic library– Can be used for other sandboxing applications

Evaluation

Two issues to address:● Performance overhead of emulated decoders

– not important for long­term archival storage, but...

– very important for common short­term uses of archives:backups, software distribution, structured documents, ...

● Storage overhead of archived decoders

Performance Test Method

Run 6 ported decoders on appropriate data sets– Athlon 64 3000+ PC running SuSE Linux 9.3

– Measure user­mode CPU time (not wall­clock time)

Compare:– Emulated vs native execution

– Running on x86­32 vs x86­64 host environment

Performance Overhead

zlib bzip2 jpeg jp2 flac vorbis0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64

No

rma

lize

d U

ser-

mo

de

Exe

cutio

n T

ime

Performance Overhead

zlib bzip2 jpeg jp2 flac vorbis0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

native x86-32 vx32 on x86-32 native x86-64 vx32 on x86-64

No

rma

lize

d U

ser-

mo

de

Exe

cutio

n T

ime

Storage Overhead

Archiver stores only one copy of each decoder– Storage cost amortized over all files of same type

– Relative overhead depends on size of archive

Therefore, measure only absolute decoder size

(compressed, as stored in archive)

Storage Overhead

zlib bzip2 jpeg jp2 flac vorbis0

10

20

30

40

50

60

70

80

90

100

110

120

130

Decoder

C library

Co

mp

ress

ed

co

de

siz

e (

KB

yte

s)

Conclusion

VXA makes self­extracting archives...

● Safe: decoders fully sandboxed

● Future­proof: simple, OS­independent environment

● Easy: re­use existing decoders, languages, tools

● Efficient: ≤ 11% slowdown vs native x86­32

Available at: http://pdos.csail.mit.edu/~baford/vxa/