Compiler++ Evolving the compiler - C2.DLL

Compiler++ Evolving the compiler - C2.DLL

Jim Radigan - Architect C++ Optimizer

Mission: Evolving the C++ compiler

1. ~Absolute Correctness 2. Compiler throughput3. Code size4. Code quality

$87.7 B

$100 .0B +

Evolve the red arrow

3,100,000 Transistors

Ivy Bridge

1.4 Billion Transistors

TEGRA 3 - 5 cores / 128 bit vector instructions

Haswell C++

Built with C++

Windows SQL Office

Mission critical correctness and compile time

Compiler++ “Evolving the compiler” • How we work

• Core Technologies

• Where we are going

Full compile, test build Windows – N hours24 cores + 32 Gb memory 3 raid 0 drives

… if you’re in a hurry – 40 cores

X86, ARM, X64 - retail and checked

N Applications - then stress a compiler’s build

Compiler developer – bad day

Win8 improved – but still a work/life balance thing

“Compiler Business”

• Absolutely NO new compiler optimization switches

• Each switch would cost millions $$

Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

• FOR ALL HARDWARE & RUNTIMES!!

Code Size / Stack SizeFoo (int p1, int p2, int p3) { int w,x,y,z …. if (flag) { w = x = w + z … return x } else { y = }

[ebp+10] Parameter 3 [ebp+0C] Parameter 2 [ebp+08] Parameter 1 [ebp+04] Return address [ebp+00] Old ebp [ebp -04] Local 1 // w[ebp -08] Local 2 // x[ebp -0C] Local 3 // z or y

Stack PackingStack Packing

?Bind_DeterminePinned@CBase@@UAEXXZ:638643E0: 8B FF mov edi,edi638643E2: 53 push ebx638643E3: 56 push esi638643E4: 8B F1 mov esi,ecx638643E6: 8B 5E 18 mov ebx,dword ptr[esi+18h]638643E9: 8B 46 04 mov eax,dword ptr [esi+4]638643EC: F6 C3 01 test bl,1638643EF: 74 08 je 638643F9638643F1: 3B 46 08 cmp eax,dword ptr [esi+8]638643F4: 76 1E jbe 63864414638643F6: 5E pop esi638643F7: 5B pop ebx638643F8: C3 ret MORE COLD CODE

No Stack Packing (R1 – R5 reasons for bad code) ?Bind_DeterminePinned@CBase@@UAEXXZ:639E2840: 8B FF mov edi,edi639E2842: 55 push ebp #R1639E2843: 8B EC mov ebp,esp639E2845: 51 push ecx #R2 639E2846: 53 push ebx639E2847: 56 push esi639E2848: 8B F1 mov esi,ecx639E284A: 57 push edi #R3639E284B: 8B 5E 18 mov ebx,dword ptr [esi+18h]639E284E: 8B 46 04 mov eax,dword ptr [esi+4]639E2851: F6 C3 01 test bl,1639E2854: 74 0C je 639E2862639E2856: 3B 46 08 cmp eax,dword ptr [esi+8]639E2859: 76 3F jbe 639E289A639E285B: 5F pop edi #R4639E285C: 5E pop esi639E285D: 5B pop ebx639E285E: 8B E5 mov esp,ebp #R5639E2860: 5D pop ebp639E2861: C3 ret MORE COLD CODE

Its all about…

CACHE LINES

NTSTATUS

NtfsCommonRead ( PIRP_CONTEXT IrpContext, PIRP Irp, BOOLEAN AcquireScb){ NTSTATUS Status; PIO_STACK_LOCATION IrpSp; PFILE_OBJECT FileObject; TYPE_OF_OPEN TypeOfOpen; PVCB Vcb; PFCB Fcb; PSCB Scb; PCCB Ccb; ATTRIBUTE_ENUMERATION_CONTEXT AttrContext; EOF_WAIT_BLOCK EofWaitBlock; PFSRTL_ADVANCED_FCB_HEADER Header; PTOP_LEVEL_CONTEXT TopLevelContext; VBO StartingVbo; LONGLONG ByteCount; LONGLONG ByteRange; ULONG RequestedByteCount; PCOMPRESSION_SYNC CompressionSync = ((void *)0); BOOLEAN FoundAttribute = 0; BOOLEAN PostIrp = 0; BOOLEAN OplockPostIrp = 0; BOOLEAN ScbAcquired = 0; BOOLEAN ReleaseScb; BOOLEAN PagingIoAcquired = 0; BOOLEAN DoingIoAtEof = 0; BOOLEAN Wait; BOOLEAN PagingIo; BOOLEAN NonCachedIo; BOOLEAN SynchronousIo; BOOLEAN CompressedIo = 0;

__try { NtfsPrePostIrp( IrpContext, Irp ); if (( (((Fcb->FcbState) & ((0x00000004)))) ) && ( (((Scb->ScbState) & ((0x00000010)))) )) { FsRtlPostPagingFileStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); } else { FsRtlPostStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); } (void) KeWaitForSingleObject( Event, Executive, KernelMode, 0, ((void *)0) ); Status = ((NTSTATUS)0x00000103L);

} __finally { if (Resource != ((void *)0)) { (ExReleaseResourceLite(Resource)); } ExFreeToNPagedLookasideList( &NtfsKeventLookasideList, Event ); } } else { if (Irp->Tail.Overlay.AuxiliaryBuffer != ((void *)0)) { IrpContext->Union.AuxiliaryBuffer = (PFSRTL_AUXILIARY_BUFFER)Irp->Tail.Overlay.AuxiliaryBuffer; if (!( (((IrpContext->Union.AuxiliaryBuffer->Flags) & (0x00000001))) )) { Irp->Tail.Overlay.AuxiliaryBuffer = ((void *)0); } } Status = NtfsCommonRead( IrpContext, Irp, 1 ); } break; }

__except (NtfsExceptionFilter( IrpContext, (struct _EXCEPTION_POINTERS *)_exception_info() )) { NTSTATUS ExceptionCode; ExceptionCode = _exception_code(); if (ExceptionCode == ((NTSTATUS)0xC0000123L)) { IrpContext->ExceptionStatus = ExceptionCode = ((NTSTATUS)0xC0000011L); Irp->IoStatus.Information = 0; } }

TRY

EXCEPT

TRY

FINALLY

ROOT

Try Region Graph – asynchronous lifetimes

ROOT

TRY = x

EXCEPT

TRYX =

FINALLY

int x, y;

_try {

_try { x = } _finally {

} = x + … y = _except (filter()) { = y}

Recall …Compiler dev. primary concern

C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

C++ Compiler - Auto Parallelism

Vector - all loads before all stores

B[0] B[1] B[2] B[3]

A[0] A[1] A[2] A[3]

A[0] + B[0] A[1] + B[1] A[2] + B[2] A[3] + B[3]

xmm0

“addps xmm1, xmm0 “

xmm1

xmm1

+

Simple vector add loop - unaligned

for (i = 0; i < 1000/4; i++){

movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 }

for (i = 0; i < 1000; i++) A[i] = B[i] + C[i];

Compiler looks across loop iterations !

Auto Parallelism/Vectorization for C++For ( iv1 = 0; iv1 <= U1; iv1++) For ( iv2 = 0; iv2 <= U2; iv2++) ... For ( ivn = 0; ivn <= Un; ivn++) t13 = OPLOAD [ a1*iv1 + a2 *iv2 + ... an * ivn + sym_expression ] } }}

Math in the compiler - Legal to vectorize ?

FOR ( j = 2; j <= 5; j++) A( j ) = A (j-1) + A (j+1)

Not Equal !!

A (2:5) = A (1:4) + A (3:7)

A(3) = ?

Vector SemanticsALL loads before ALL stores

A (2:5) = A (1:4) + A (3:7)

VR1 = LOAD(A(1:5))VR2 = LOAD(A(3:7))VR3 = VR1 + VR2 // A(3) = F (A(2) A(4))STORE(A(2:5)) = VR3

Vector SemanticsInstead - load store load store ...

FOR ( j = 2; j <= 257; j++)A( j ) = A( j-1 ) + A( j+1 )

A(2) = A(1) + A(3)A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) )A(4) = A(3) + A(5)A(5) = A(4) + A(6) …

Doubled the optimizer

A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

for (size_t j = 0; j < numBodies; j++) { D3DXVECTOR4 r;

r.x = A[j].pos.x - pos.x; r.y = A[j].pos.y - pos.y; r.z = A[j].pos.z - pos.z;

float distSqr = r.x*r.x + r.y*r.y + r.z*r.z; distSqr += softeningSquared;

float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = fParticleMass * invDistCube;

acc.x += r.x * s; acc.y += r.y * s; acc.z += r.z * s;}

Complex C++ Not just arrays!

Legal math ?

void foo(int n, float *a, float *b, float *c) { for (int j=0; j<n; j++) { *a++ = *b++ + *c++; } }

Legal ? Where’s the base of the array?

void transform1(int * first1, int * last1, int * first2, int * result) {

while (first1 != last1) { *result++ = *first1++ + *first2++; }}

…and where’s the IV?

STL – source code

A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

Parallelizing C++ requires transformation to analyze

int synthetic_i; int synthetic_upper = (last1 – first1 + 4)/4;

for (synthetic_i = 0; synthetic_i < synthetic_upper; synthetic_i++) { result[synthetic_i] = first1[synthetic_i] + first2[sythetic_i]; }

STL – source code

while (first1 != last1) { *result++ = *first1++ + *first2++; }

Now …C++ vector code gen

• We don’t know if the array bases overlap• We don’t know what the target ISA is• We don’t know if the trip count is divisible by 4

if ( ! overlap (result, first1) && ! overlap(result ,first2)) if (_ISA_AVAILABLE(AVX2)) {

for (i = 0; i < synthetic_upper/4; i+= 4) { // Vector + Parallel Loop result[i : i +3] = first1[i : i + 3] + first2[i : i +3]; } j = synthetic_upper/4 }} for (j = 0; j < synthetic_upper; i++) { // Sequential or cleanup loop result[j] = first1[j] + first2[j]; }

VectorVector + ParallelSPMD

Maps C++ to all forms of Parallelism

Don’t BSOD…its all about life style choices

Heap overflow vulnerability

HRESULT CDocManager::IsValidWMToolsStream(bool* pfValid) { long cbSize; if(FAILED(hr = ExtractDataSize(strPath, &cbSize))) return S_OK;

CSmartPtr<BYTE> pBuffer = new BYTE[cbSize]; ExtractData(strPath, pBuffer, cbSize); long dwCheckSum = DwChecksumFromLpvCb(0, pBuffer, cbSize); long dwStreamCnt = GetStreamCount(m_pVisitedTree); if(FAILED(hr = ExtractDataSize(kszCheckSumStream, &cbSize))) { return S_OK; }

//ExtractData(kszCheckSumStream, pBuffer, cbSize); for(int i=0; i<cbSize; i++) {

*pBuffer++ = *kszCheckSumStream++; }}

1. cbSize assigned

4470

2. allocate buffer with 4470 bytes

3. cbSize re-assigned

4496

Heap Overflow!Leads to Hijack

IE Aurora - Dangling pointer vulnerability

<html><head><script>var e1;function f1(evt){ e1 = document.createEventObject(evt); document.getElementById("sp").innerHTML = ""; window.setInterval(f2, 50);}function f2(){ var t = e1.srcElement;}</script></head><body><span id="sp"> <img src=“any.gif" onload=“f1(evt)"></span></body></html>

1. Pass onload event

(evt) to f1

2. Copy evt, but fail to AddRef on CTreeNode!

3. Destroy img tag in span

leading to a free when evt falls out of scope4. Call f2

async so evt goes out of

scope

Hijack! Vtable call via freed

CTreeNode

• Red is C++ called from javascript

pointerheap

vtable

function_1

function_2

Vulnerability: “use after free”

attack code

attack code

attack code

attack data

attack data

attack data

Illegal - flow or writesWhat if the C++ compiler generated code to check?

• It would have to always be on• NOT degrade performance !!

Example for : Hardware + Language + Compiler co-design

Control flow 12% win spec2k6\libquantum

quantum_reg_node *node = reg->node;

for (int i=0; i<reg->size; i++) {

if (node[i].state & ((MAX_UNSIGNED) 1 << control1)) { if (node[i].state & ((MAX_UNSIGNED) 1 << control2)) {

node[i].state ^= ((MAX_UNSIGNED) 1 << target); } }}

Nested Control flow - 300% win NumericalRecipes

for (k=1;k<=nn;k++){

if (yy[k] > y) {

xx[k] > x ? ++na : ++nb; } else{ xx[k] > x ? ++nd : ++nc; } }

Vectorizing C++ Control Flow for ( int i = 0; i < 1000; i++) {

if ( cond[ i ] ) { Lhs1[ i ] = Rhs1[ i ] else Lhs2[ i ] = Rhs2[ i ]

} Bistry et al. 1997

Vectorizing C++ Control Flow for ( int i = 0; i < 1000; i++) {

if ( c[ i ] ) { Lhs1[ i ] = Rhs1[ i ] else Lhs2[ i ] = Rhs2[ i ]

}

Bistry et al. 1997

G[0:3] = bit_mask( c[0:3] ) Lhs[0:3] = (Lhs[0:3] & ! G[0:3]) | (Rhs1[0:3] & G[0:3])

G[0:3] = bit_mask(a[i] == b[i] )

27 13 2029 55

27 125 7 55

0xffffffff 0x00000000 0x00000000 0xffffffff

xmm0

“pcmpeq xmm1, xmm0 “

xmm1

xmm1

==

(Lhs[0:3] & ! G[0:3])


Lhs[0] Lhs[1] Lhs[2] Lhs[3]

0x0000000 Lhs[1] Lhs[2] 0x0000000

xmm0

“pandn xmm1, xmm0 “

xmm1

xmm1

&!

(Rhs[0:3] & G[0:3])


Rhs[0] Rhs[1] Rhs[2] Rhs[3]

Rhs[0] 0x0000000

0x0000000

Rhs[3]

xmm2

“pandn xmm1, xmm0 “

xmm3

xmm3

&

= (Lhs[0:3] & ! G[0:3]) | (Rhs[0:3] & G[0:3])

Rhs[0] 0x00000000 0x00000000 Rhs[2]

0x00000000 Lhs[1] Lhs[2] 0x00000000

Rhs[0] Lhs[1]

Lhs[2] Rhs[3]

xmm1

“por xmm1, xmm3 “

xmm3

xmm3

or

STORE

Rhs[0] Lhs[1]

Lhs[2] Rhs[3]

“movups [esi], xmm3 “

xmm3

New Fact of LifeThe system must never invent a write to a

variable that wouldn’t be written to in an SC execution.

Q: Why?If you the programmer can’t see

all the variables that get written to, you can’t possibly know what locks to take.

Herb Sutter C++11 Memory Model

Vectorizing Control Flow

- Hardware – design load/store instructions- C++ Language – defines semantics- Compiler’s vectorizer - Herb to Jim, “wait”

Example for EARLIER: Hardware + Language + Compiler co-design

Alias analysis • Affects ALL compiler functionality

• Example - Security• Optimization for eliminating r• Hardware design

Alias analysis*p = 70;

*q = …

n = *p + 30

*p = 7

*q = …

n = 100

Points_To {p} ?= Points_To{q}

C++ Alias analysis*p = 70;

(*fptr) (a,b) … n = *p + 30

*p = 7

(*fptr)(a,b) … n = 100

Points_To {fptr} ?= Points_To{q}

C++ Alias analysis – double indirection Point3d** Fubar (void) {

Point3d *p, **x;

p = new Point3d; x = &p; …

*x = new Base; //change the type of p }

Visual – “out from underneath you”

0x12345678

p :

px :

*x = new BasePoint3D

Base

void Main ( ) { Shape **p, r; DerivedShape *q; q = new DerivedShape; p = &q; … *p = &r

q->foo(); …}

Types and alias analysis – “wicked cycle”

// Need alias <*p, q> “q is now made type-of (r)”// De-virtualizing this call depends on type-of (q)

Subset of C++ - at compile timeWhat if pointer indirections “restricted”…

“a pointer cannot be aliased to another pointer.”

No hidden updates!

Reject double indirection through a pointer that’s had its address taken.

Affects _all_ core technologies we covered

• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

• FOR ALL HARDWARE & RUNTIMES!!

Processors

32nm 22nm 22nm 14nm 10nm

NehalemNehalem Westmere

Sandy BridgeSandy Bridge Ivy

Bridge

HaswellHaswell Broadwell

SkylakeSkylake Skymont

256 bit AVX(2)256 bit AVX128 bit SSE

You are here (3D tri-state transistors)

Summary True size & scope of compiling C++ at Microsoft.

Programmers - Some core technologies

Hardware & System designers Maybe work directly with the C++ compiler team

Compiler++ Evolving the compiler - C2.DLL

Documents

Transcript of Compiler++ Evolving the compiler - C2.DLL