Compiler++ Evolving the compiler - C2.DLL

Post on 24-Feb-2016

163 views 0 download

description

Compiler++ Evolving the compiler - C2.DLL. Jim Radigan - Architect C ++ Optimizer. Mission: Evolving the C++ compiler. Evolve the red arrow. $87.7 B. 1. ~Absolute Correctness 2. Compiler throughput 3. Code size 4. Code quality. $100 .0B +. 3,100,000 Transistors. Ivy Bridge . - PowerPoint PPT Presentation

Transcript of Compiler++ Evolving the compiler - C2.DLL

Compiler++ Evolving the compiler - C2.DLL

Jim Radigan - Architect C++ Optimizer

Mission: Evolving the C++ compiler

1. ~Absolute Correctness 2. Compiler throughput3. Code size4. Code quality

$87.7 B

$100 .0B +

Evolve the red arrow

3,100,000 Transistors

Ivy Bridge

1.4 Billion Transistors

TEGRA 3 - 5 cores / 128 bit vector instructions

Haswell C++

Built with C++

Windows SQL Office

Mission critical correctness and compile time

Compiler++ “Evolving the compiler” • How we work

• Core Technologies

• Where we are going

Full compile, test build Windows – N hours24 cores + 32 Gb memory 3 raid 0 drives

… if you’re in a hurry – 40 cores

X86, ARM, X64 - retail and checked

N Applications - then stress a compiler’s build

Compiler developer – bad day

Win8 improved – but still a work/life balance thing

Compiler++ “Evolving the compiler” • How we work

• Core Technologies

• Where we are going

“Compiler Business”

• Absolutely NO new compiler optimization switches

• Each switch would cost millions $$

Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

• FOR ALL HARDWARE & RUNTIMES!!

Code Size / Stack SizeFoo (int p1, int p2, int p3) { int w,x,y,z …. if (flag) { w = x = w + z … return x } else { y = }

[ebp+10] Parameter 3 [ebp+0C] Parameter 2 [ebp+08] Parameter 1 [ebp+04] Return address [ebp+00] Old ebp [ebp -04] Local 1 // w[ebp -08] Local 2 // x[ebp -0C] Local 3 // z or y

Stack PackingStack Packing

  ?Bind_DeterminePinned@CBase@@UAEXXZ:638643E0: 8B FF mov edi,edi638643E2: 53 push ebx638643E3: 56 push esi638643E4: 8B F1 mov esi,ecx638643E6: 8B 5E 18 mov ebx,dword ptr[esi+18h]638643E9: 8B 46 04 mov eax,dword ptr [esi+4]638643EC: F6 C3 01 test bl,1638643EF: 74 08 je 638643F9638643F1: 3B 46 08 cmp eax,dword ptr [esi+8]638643F4: 76 1E jbe 63864414638643F6: 5E pop esi638643F7: 5B pop ebx638643F8: C3 ret MORE COLD CODE

No Stack Packing (R1 – R5 reasons for bad code) ?Bind_DeterminePinned@CBase@@UAEXXZ:639E2840: 8B FF mov edi,edi639E2842: 55 push ebp #R1639E2843: 8B EC mov ebp,esp639E2845: 51 push ecx #R2 639E2846: 53 push ebx639E2847: 56 push esi639E2848: 8B F1 mov esi,ecx639E284A: 57 push edi #R3639E284B: 8B 5E 18 mov ebx,dword ptr [esi+18h]639E284E: 8B 46 04 mov eax,dword ptr [esi+4]639E2851: F6 C3 01 test bl,1639E2854: 74 0C je 639E2862639E2856: 3B 46 08 cmp eax,dword ptr [esi+8]639E2859: 76 3F jbe 639E289A639E285B: 5F pop edi #R4639E285C: 5E pop esi639E285D: 5B pop ebx639E285E: 8B E5 mov esp,ebp #R5639E2860: 5D pop ebp639E2861: C3 ret  MORE COLD CODE

Its all about…

CACHE LINES

NTSTATUS

NtfsCommonRead ( PIRP_CONTEXT IrpContext, PIRP Irp, BOOLEAN AcquireScb){ NTSTATUS Status; PIO_STACK_LOCATION IrpSp; PFILE_OBJECT FileObject; TYPE_OF_OPEN TypeOfOpen; PVCB Vcb; PFCB Fcb; PSCB Scb; PCCB Ccb; ATTRIBUTE_ENUMERATION_CONTEXT AttrContext; EOF_WAIT_BLOCK EofWaitBlock; PFSRTL_ADVANCED_FCB_HEADER Header; PTOP_LEVEL_CONTEXT TopLevelContext; VBO StartingVbo; LONGLONG ByteCount; LONGLONG ByteRange; ULONG RequestedByteCount; PCOMPRESSION_SYNC CompressionSync = ((void *)0); BOOLEAN FoundAttribute = 0; BOOLEAN PostIrp = 0; BOOLEAN OplockPostIrp = 0; BOOLEAN ScbAcquired = 0; BOOLEAN ReleaseScb; BOOLEAN PagingIoAcquired = 0; BOOLEAN DoingIoAtEof = 0; BOOLEAN Wait; BOOLEAN PagingIo; BOOLEAN NonCachedIo; BOOLEAN SynchronousIo; BOOLEAN CompressedIo = 0;

__try { NtfsPrePostIrp( IrpContext, Irp ); if (( (((Fcb->FcbState) & ((0x00000004)))) ) && ( (((Scb->ScbState) & ((0x00000010)))) )) { FsRtlPostPagingFileStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); } else { FsRtlPostStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); } (void) KeWaitForSingleObject( Event, Executive, KernelMode, 0, ((void *)0) ); Status = ((NTSTATUS)0x00000103L);

} __finally { if (Resource != ((void *)0)) { (ExReleaseResourceLite(Resource)); } ExFreeToNPagedLookasideList( &NtfsKeventLookasideList, Event ); } } else { if (Irp->Tail.Overlay.AuxiliaryBuffer != ((void *)0)) { IrpContext->Union.AuxiliaryBuffer = (PFSRTL_AUXILIARY_BUFFER)Irp->Tail.Overlay.AuxiliaryBuffer; if (!( (((IrpContext->Union.AuxiliaryBuffer->Flags) & (0x00000001))) )) { Irp->Tail.Overlay.AuxiliaryBuffer = ((void *)0); } } Status = NtfsCommonRead( IrpContext, Irp, 1 ); } break; }

__except (NtfsExceptionFilter( IrpContext, (struct _EXCEPTION_POINTERS *)_exception_info() )) { NTSTATUS ExceptionCode; ExceptionCode = _exception_code(); if (ExceptionCode == ((NTSTATUS)0xC0000123L)) { IrpContext->ExceptionStatus = ExceptionCode = ((NTSTATUS)0xC0000011L); Irp->IoStatus.Information = 0; } }

TRY

EXCEPT

TRY

FINALLY

ROOT

Try Region Graph – asynchronous lifetimes

ROOT

TRY = x

EXCEPT

TRYX =

FINALLY

int x, y;

_try {

_try { x = } _finally {

} = x + … y = _except (filter()) { = y}

Recall …Compiler dev. primary concern

C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

C++ Compiler - Auto Parallelism

Vector - all loads before all stores

B[0] B[1] B[2] B[3]

A[0] A[1] A[2] A[3]

A[0] + B[0] A[1] + B[1] A[2] + B[2] A[3] + B[3]

xmm0

“addps xmm1, xmm0 “

xmm1

xmm1

+

Simple vector add loop - unaligned

for (i = 0; i < 1000/4; i++){

movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 }

for (i = 0; i < 1000; i++) A[i] = B[i] + C[i];

Compiler looks across loop iterations !

Auto Parallelism/Vectorization for C++For ( iv1 = 0; iv1 <= U1; iv1++)  For ( iv2 = 0; iv2 <= U2; iv2++)     ...      For ( ivn = 0; ivn <= Un; ivn++)               t13 = OPLOAD [ a1*iv1 + a2 *iv2 + ... an * ivn + sym_expression ]      }   }}

Math in the compiler - Legal to vectorize ?

FOR ( j = 2; j <= 5; j++) A( j ) = A (j-1) + A (j+1)

Not Equal !!

A (2:5) = A (1:4) + A (3:7)

A(3) = ?

Vector SemanticsALL loads before ALL stores

A (2:5) = A (1:4) + A (3:7)

VR1 = LOAD(A(1:5))VR2 = LOAD(A(3:7))VR3 = VR1 + VR2 // A(3) = F (A(2) A(4))STORE(A(2:5)) = VR3

Vector SemanticsInstead - load store load store ...

FOR ( j = 2; j <= 257; j++)A( j ) = A( j-1 ) + A( j+1 )

A(2) = A(1) + A(3)A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) )A(4) = A(3) + A(5)A(5) = A(4) + A(6) …

Doubled the optimizer

A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

for (size_t j = 0; j < numBodies; j++) { D3DXVECTOR4 r;

r.x = A[j].pos.x - pos.x; r.y = A[j].pos.y - pos.y; r.z = A[j].pos.z - pos.z;

float distSqr = r.x*r.x + r.y*r.y + r.z*r.z; distSqr += softeningSquared;

float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = fParticleMass * invDistCube;

acc.x += r.x * s; acc.y += r.y * s; acc.z += r.z * s;}

Complex C++ Not just arrays!

Legal math ?

  void foo(int n, float *a, float *b, float *c) {                   for (int j=0; j<n; j++) {                    *a++ = *b++ + *c++;               }       }

Legal ? Where’s the base of the array?

 void transform1(int * first1, int * last1, int * first2, int * result) {

while (first1 != last1) {             *result++ = *first1++ + *first2++; }}

   

…and where’s the IV?

STL – source code

A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

Parallelizing C++ requires transformation to analyze

   int synthetic_i;   int synthetic_upper  =  (last1 – first1 + 4)/4;

   for (synthetic_i = 0; synthetic_i < synthetic_upper; synthetic_i++) {     result[synthetic_i] = first1[synthetic_i] + first2[sythetic_i]; }

STL – source code

while (first1 != last1) {             *result++ = *first1++ + *first2++; }

Now …C++ vector code gen

• We don’t know if the array bases overlap• We don’t know what the target ISA is• We don’t know if the trip count is divisible by 4

if ( ! overlap (result, first1) && ! overlap(result ,first2)) if (_ISA_AVAILABLE(AVX2)) {

    for (i = 0; i < synthetic_upper/4; i+= 4) { // Vector + Parallel Loop    result[i : i +3] = first1[i : i + 3] + first2[i : i +3]; } j = synthetic_upper/4 }} for (j = 0; j < synthetic_upper; i++) { // Sequential or cleanup loop    result[j] = first1[j] + first2[j]; }

VectorVector + ParallelSPMD

Maps C++ to all forms of Parallelism

Don’t BSOD…its all about life style choices

C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

Heap overflow vulnerability

HRESULT CDocManager::IsValidWMToolsStream(bool* pfValid) { long cbSize; if(FAILED(hr = ExtractDataSize(strPath, &cbSize))) return S_OK;

CSmartPtr<BYTE> pBuffer = new BYTE[cbSize]; ExtractData(strPath, pBuffer, cbSize); long dwCheckSum = DwChecksumFromLpvCb(0, pBuffer, cbSize); long dwStreamCnt = GetStreamCount(m_pVisitedTree); if(FAILED(hr = ExtractDataSize(kszCheckSumStream, &cbSize))) { return S_OK; }

//ExtractData(kszCheckSumStream, pBuffer, cbSize); for(int i=0; i<cbSize; i++) {

*pBuffer++ = *kszCheckSumStream++; }}

1. cbSize assigned

4470

2. allocate buffer with 4470 bytes

3. cbSize re-assigned

4496

Heap Overflow!Leads to Hijack

IE Aurora - Dangling pointer vulnerability

<html><head><script>var e1;function f1(evt){ e1 = document.createEventObject(evt); document.getElementById("sp").innerHTML = ""; window.setInterval(f2, 50);}function f2(){ var t = e1.srcElement;}</script></head><body><span id="sp"> <img src=“any.gif" onload=“f1(evt)"></span></body></html>

1. Pass onload event

(evt) to f1

2. Copy evt, but fail to AddRef on CTreeNode!

3. Destroy img tag in span

leading to a free when evt falls out of scope4. Call f2

async so evt goes out of

scope

Hijack! Vtable call via freed

CTreeNode

• Red is C++ called from javascript

pointerheap

vtable

function_1

function_2

Vulnerability: “use after free”

attack code

attack code

attack code

attack data

attack data

attack data

Illegal - flow or writesWhat if the C++ compiler generated code to check?

• It would have to always be on• NOT degrade performance !!

Example for : Hardware + Language + Compiler co-design

C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

Control flow 12% win spec2k6\libquantum

quantum_reg_node *node = reg->node;

for (int i=0; i<reg->size; i++) {

  if (node[i].state & ((MAX_UNSIGNED) 1 << control1)) {    if (node[i].state & ((MAX_UNSIGNED) 1 << control2)) {

        node[i].state ^= ((MAX_UNSIGNED) 1 << target);  } }}

Nested Control flow - 300% win NumericalRecipes

 for (k=1;k<=nn;k++){

    if (yy[k] > y) {

        xx[k] > x ? ++na : ++nb;    } else{     xx[k] > x ? ++nd : ++nc;    } }

Vectorizing C++ Control Flow for ( int i = 0; i < 1000; i++) {

if ( cond[ i ] ) { Lhs1[ i ] = Rhs1[ i ] else Lhs2[ i ] = Rhs2[ i ]

} Bistry et al. 1997

Vectorizing C++ Control Flow for ( int i = 0; i < 1000; i++) {

if ( c[ i ] ) { Lhs1[ i ] = Rhs1[ i ] else Lhs2[ i ] = Rhs2[ i ]

}

Bistry et al. 1997

G[0:3] = bit_mask( c[0:3] ) Lhs[0:3] = (Lhs[0:3] & ! G[0:3]) | (Rhs1[0:3] & G[0:3])

G[0:3] = bit_mask(a[i] == b[i] )

27 13 2029 55

27 125 7 55

0xffffffff 0x00000000 0x00000000 0xffffffff

xmm0

“pcmpeq xmm1, xmm0 “

xmm1

xmm1

==

(Lhs[0:3] & ! G[0:3])

0xffffffff 0x00000000 0x00000000 0xffffffff

Lhs[0] Lhs[1] Lhs[2] Lhs[3]

0x0000000 Lhs[1] Lhs[2] 0x0000000

xmm0

“pandn xmm1, xmm0 “

xmm1

xmm1

&!

(Rhs[0:3] & G[0:3])

0xffffffff 0x00000000 0x00000000 0xffffffff

Rhs[0] Rhs[1] Rhs[2] Rhs[3]

Rhs[0] 0x0000000

0x0000000

Rhs[3]

xmm2

“pandn xmm1, xmm0 “

xmm3

xmm3

&

= (Lhs[0:3] & ! G[0:3]) | (Rhs[0:3] & G[0:3])

Rhs[0] 0x00000000 0x00000000 Rhs[2]

0x00000000 Lhs[1] Lhs[2] 0x00000000

Rhs[0] Lhs[1]

Lhs[2] Rhs[3]

xmm1

“por xmm1, xmm3 “

xmm3

xmm3

or

STORE

Rhs[0] Lhs[1]

Lhs[2] Rhs[3]

“movups [esi], xmm3 “

xmm3

New Fact of LifeThe system must never invent a write to a

variable that wouldn’t be written to in an SC execution.

Q: Why?If you the programmer can’t see

all the variables that get written to, you can’t possibly know what locks to take.

Herb Sutter C++11 Memory Model

Vectorizing Control Flow

- Hardware – design load/store instructions- C++ Language – defines semantics- Compiler’s vectorizer - Herb to Jim, “wait”

Example for EARLIER: Hardware + Language + Compiler co-design

C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

Alias analysis • Affects ALL compiler functionality

• Example - Security• Optimization for eliminating r• Hardware design

Compiler++ “Evolving the compiler” • How we work

• Core Technologies

• Where we are going

Alias analysis*p = 70;

*q = …

n = *p + 30

*p = 7

*q = …

n = 100

Points_To {p} ?= Points_To{q}

C++ Alias analysis*p = 70;

(*fptr) (a,b) … n = *p + 30

*p = 7

(*fptr)(a,b) … n = 100

Points_To {fptr} ?= Points_To{q}

C++ Alias analysis – double indirection Point3d** Fubar (void) {

Point3d *p, **x;

p = new Point3d; x = &p; …

*x = new Base; //change the type of p }

Visual – “out from underneath you”

0x12345678

p :

px :

*x = new BasePoint3D

Base

void Main ( ) { Shape **p, r; DerivedShape *q; q = new DerivedShape; p = &q; … *p = &r

q->foo(); …}

Types and alias analysis – “wicked cycle”

// Need alias <*p, q> “q is now made type-of (r)”// De-virtualizing this call depends on type-of (q)

Subset of C++ - at compile timeWhat if pointer indirections “restricted”…

“a pointer cannot be aliased to another pointer.”

No hidden updates!

Reject double indirection through a pointer that’s had its address taken.

Affects _all_ core technologies we covered

• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

• FOR ALL HARDWARE & RUNTIMES!!

Compiler++ “Evolving the compiler” • How we work

• Core Technologies

• Where we are going

Processors

32nm 22nm 22nm 14nm 10nm

NehalemNehalem Westmere

Sandy BridgeSandy Bridge Ivy

Bridge

HaswellHaswell Broadwell

SkylakeSkylake Skymont

256 bit AVX(2)256 bit AVX128 bit SSE

You are here (3D tri-state transistors)

Summary True size & scope of compiling C++ at Microsoft.

Programmers - Some core technologies

Hardware & System designers Maybe work directly with the C++ compiler team