Compiler++ Evolving the compiler - C2.DLL

Jim Radigan - Architect C++ Optimizer

Mission: Evolving the C++ compiler

1. ~Absolute Correctness 2. Compiler throughput3. Code size4. Code quality

$87.7 B

$100 .0B +

Evolve the red arrow

3,100,000 Transistors

Ivy Bridge

1.4 Billion Transistors

TEGRA 3 - 5 cores / 128 bit vector instructions

Haswell C++

Built with C++

Windows SQL Office

Mission critical correctness and compile time

Compiler++ “Evolving the compiler” • How we work

• Core Technologies

• Where we are going

Full compile, test build Windows – N hours24 cores + 32 Gb memory 3 raid 0 drives

… if you’re in a hurry – 40 cores

X86, ARM, X64 - retail and checked

N Applications - then stress a compiler’s build

Compiler developer – bad day

Win8 improved – but still a work/life balance thing

“Compiler Business”

• Absolutely NO new compiler optimization switches

• Each switch would cost millions $$

Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

• FOR ALL HARDWARE & RUNTIMES!!

Code Size / Stack SizeFoo (int p1, int p2, int p3) { int w,x,y,z …. if (flag) { w = x = w + z … return x } else { y = }

[ebp+10] Parameter 3 [ebp+0C] Parameter 2 [ebp+08] Parameter 1 [ebp+04] Return address [ebp+00] Old ebp [ebp -04] Local 1 // w[ebp -08] Local 2 // x[ebp -0C] Local 3 // z or y

Stack PackingStack Packing

?Bind_DeterminePinned@CBase@@UAEXXZ:638643E0: 8B FF mov edi,edi638643E2: 53 push ebx638643E3: 56 push esi638643E4: 8B F1 mov esi,ecx638643E6: 8B 5E 18 mov ebx,dword ptr[esi+18h]638643E9: 8B 46 04 mov eax,dword ptr [esi+4]638643EC: F6 C3 01 test bl,1638643EF: 74 08 je 638643F9638643F1: 3B 46 08 cmp eax,dword ptr [esi+8]638643F4: 76 1E jbe 63864414638643F6: 5E pop esi638643F7: 5B pop ebx638643F8: C3 ret MORE COLD CODE

No Stack Packing (R1 – R5 reasons for bad code) ?Bind_DeterminePinned@CBase@@UAEXXZ:639E2840: 8B FF mov edi,edi639E2842: 55 push ebp #R1639E2843: 8B EC mov ebp,esp639E2845: 51 push ecx #R2 639E2846: 53 push ebx639E2847: 56 push esi639E2848: 8B F1 mov esi,ecx639E284A: 57 push edi #R3639E284B: 8B 5E 18 mov ebx,dword ptr [esi+18h]639E284E: 8B 46 04 mov eax,dword ptr [esi+4]639E2851: F6 C3 01 test bl,1639E2854: 74 0C je 639E2862639E2856: 3B 46 08 cmp eax,dword ptr [esi+8]639E2859: 76 3F jbe 639E289A639E285B: 5F pop edi #R4639E285C: 5E pop esi639E285D: 5B pop ebx639E285E: 8B E5 mov esp,ebp #R5639E2860: 5D pop ebp639E2861: C3 ret MORE COLD CODE

Its all about…

CACHE LINES

NTSTATUS

NtfsCommonRead ( PIRP_CONTEXT IrpContext, PIRP Irp, BOOLEAN AcquireScb){ NTSTATUS Status; PIO_STACK_LOCATION IrpSp; PFILE_OBJECT FileObject; TYPE_OF_OPEN TypeOfOpen; PVCB Vcb; PFCB Fcb; PSCB Scb; PCCB Ccb; ATTRIBUTE_ENUMERATION_CONTEXT AttrContext; EOF_WAIT_BLOCK EofWaitBlock; PFSRTL_ADVANCED_FCB_HEADER Header; PTOP_LEVEL_CONTEXT TopLevelContext; VBO StartingVbo; LONGLONG ByteCount; LONGLONG ByteRange; ULONG RequestedByteCount; PCOMPRESSION_SYNC CompressionSync = ((void *)0); BOOLEAN FoundAttribute = 0; BOOLEAN PostIrp = 0; BOOLEAN OplockPostIrp = 0; BOOLEAN ScbAcquired = 0; BOOLEAN ReleaseScb; BOOLEAN PagingIoAcquired = 0; BOOLEAN DoingIoAtEof = 0; BOOLEAN Wait; BOOLEAN PagingIo; BOOLEAN NonCachedIo; BOOLEAN SynchronousIo; BOOLEAN CompressedIo = 0;

__try { NtfsPrePostIrp( IrpContext, Irp ); if (( (((Fcb->FcbState) & ((0x00000004)))) ) && ( (((Scb->ScbState) & ((0x00000010)))) )) { FsRtlPostPagingFileStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); } else { FsRtlPostStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); } (void) KeWaitForSingleObject( Event, Executive, KernelMode, 0, ((void *)0) ); Status = ((NTSTATUS)0x00000103L);

} __finally { if (Resource != ((void *)0)) { (ExReleaseResourceLite(Resource)); } ExFreeToNPagedLookasideList( &NtfsKeventLookasideList, Event ); } } else { if (Irp->Tail.Overlay.AuxiliaryBuffer != ((void *)0)) { IrpContext->Union.AuxiliaryBuffer = (PFSRTL_AUXILIARY_BUFFER)Irp->Tail.Overlay.AuxiliaryBuffer; if (!( (((IrpContext->Union.AuxiliaryBuffer->Flags) & (0x00000001))) )) { Irp->Tail.Overlay.AuxiliaryBuffer = ((void *)0); } } Status = NtfsCommonRead( IrpContext, Irp, 1 ); } break; }

__except (NtfsExceptionFilter( IrpContext, (struct _EXCEPTION_POINTERS *)_exception_info() )) { NTSTATUS ExceptionCode; ExceptionCode = _exception_code(); if (ExceptionCode == ((NTSTATUS)0xC0000123L)) { IrpContext->ExceptionStatus = ExceptionCode = ((NTSTATUS)0xC0000011L); Irp->IoStatus.Information = 0; } }

EXCEPT

FINALLY

Try Region Graph – asynchronous lifetimes

TRY = x

EXCEPT

TRYX =

FINALLY

int x, y;

_try {

_try { x = } _finally {

} = x + … y = _except (filter()) { = y}

Recall …Compiler dev. primary concern

C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

C++ Compiler - Auto Parallelism

Vector - all loads before all stores

B[0] B[1] B[2] B[3]

A[0] A[1] A[2] A[3]

A[0] + B[0] A[1] + B[1] A[2] + B[2] A[3] + B[3]

“addps xmm1, xmm0 “

Simple vector add loop - unaligned

for (i = 0; i < 1000/4; i++){

movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 }

for (i = 0; i < 1000; i++) A[i] = B[i] + C[i];

Compiler looks across loop iterations !

Auto Parallelism/Vectorization for C++For ( iv1 = 0; iv1 <= U1; iv1++) For ( iv2 = 0; iv2 <= U2; iv2++) ... For ( ivn = 0; ivn <= Un; ivn++) t13 = OPLOAD [ a1*iv1 + a2 *iv2 + ... an * ivn + sym_expression ] } }}

Math in the compiler - Legal to vectorize ?

FOR ( j = 2; j <= 5; j++) A( j ) = A (j-1) + A (j+1)

Not Equal !!

A (2:5) = A (1:4) + A (3:7)

A(3) = ?

Vector SemanticsALL loads before ALL stores

A (2:5) = A (1:4) + A (3:7)

VR1 = LOAD(A(1:5))VR2 = LOAD(A(3:7))VR3 = VR1 + VR2 // A(3) = F (A(2) A(4))STORE(A(2:5)) = VR3

Vector SemanticsInstead - load store load store ...

FOR ( j = 2; j <= 257; j++)A( j ) = A( j-1 ) + A( j+1 )

A(2) = A(1) + A(3)A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) )A(4) = A(3) + A(5)A(5) = A(4) + A(6) …

Doubled the optimizer

A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

for (size_t j = 0; j < numBodies; j++) { D3DXVECTOR4 r;

r.x = A[j].pos.x - pos.x; r.y = A[j].pos.y - pos.y; r.z = A[j].pos.z - pos.z;

float distSqr = r.x*r.x + r.y*r.y + r.z*r.z; distSqr += softeningSquared;

float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = fParticleMass * invDistCube;

acc.x += r.x * s; acc.y += r.y * s; acc.z += r.z * s;}

Complex C++ Not just arrays!

Legal math ?

void foo(int n, float *a, float *b, float *c) { for (int j=0; j<n; j++) { *a++ = *b++ + *c++; } }

Legal ? Where’s the base of the array?

void transform1(int * first1, int * last1, int * first2, int * result) {

while (first1 != last1) { *result++ = *first1++ + *first2++; }}

…and where’s the IV?

STL – source code

A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)

Parallelizing C++ requires transformation to analyze

int synthetic_i; int synthetic_upper = (last1 – first1 + 4)/4;

for (synthetic_i = 0; synthetic_i < synthetic_upper; synthetic_i++) { result[synthetic_i] = first1[synthetic_i] + first2[sythetic_i]; }

STL – source code

while (first1 != last1) { *result++ = *first1++ + *first2++; }

Now …C++ vector code gen

• We don’t know if the array bases overlap• We don’t know what the target ISA is• We don’t know if the trip count is divisible by 4

if ( ! overlap (result, first1) && ! overlap(result ,first2)) if (_ISA_AVAILABLE(AVX2)) {

for (i = 0; i < synthetic_upper/4; i+= 4) { // Vector + Parallel Loop result[i : i +3] = first1[i : i + 3] + first2[i : i +3]; } j = synthetic_upper/4 }} for (j = 0; j < synthetic_upper; i++) { // Sequential or cleanup loop result[j] = first1[j] + first2[j]; }

VectorVector + ParallelSPMD

Maps C++ to all forms of Parallelism

Don’t BSOD…its all about life style choices

Heap overflow vulnerability

HRESULT CDocManager::IsValidWMToolsStream(bool* pfValid) { long cbSize; if(FAILED(hr = ExtractDataSize(strPath, &cbSize))) return S_OK;

CSmartPtr<BYTE> pBuffer = new BYTE[cbSize]; ExtractData(strPath, pBuffer, cbSize); long dwCheckSum = DwChecksumFromLpvCb(0, pBuffer, cbSize); long dwStreamCnt = GetStreamCount(m_pVisitedTree); if(FAILED(hr = ExtractDataSize(kszCheckSumStream, &cbSize))) { return S_OK; }

//ExtractData(kszCheckSumStream, pBuffer, cbSize); for(int i=0; i<cbSize; i++) {

*pBuffer++ = *kszCheckSumStream++; }}

1. cbSize assigned

2. allocate buffer with 4470 bytes

3. cbSize re-assigned

Heap Overflow!Leads to Hijack

IE Aurora - Dangling pointer vulnerability

<html><head><script>var e1;function f1(evt){ e1 = document.createEventObject(evt); document.getElementById("sp").innerHTML = ""; window.setInterval(f2, 50);}function f2(){ var t = e1.srcElement;}</script></head><body><span id="sp"> <img src=“any.gif" onload=“f1(evt)"></span></body></html>

1. Pass onload event

(evt) to f1

2. Copy evt, but fail to AddRef on CTreeNode!

3. Destroy img tag in span

leading to a free when evt falls out of scope4. Call f2

async so evt goes out of

Hijack! Vtable call via freed

CTreeNode

• Red is C++ called from javascript

pointerheap

vtable

function_1

function_2

Vulnerability: “use after free”

attack code

attack data

Illegal - flow or writesWhat if the C++ compiler generated code to check?

• It would have to always be on• NOT degrade performance !!

Example for : Hardware + Language + Compiler co-design

Control flow 12% win spec2k6\libquantum

quantum_reg_node *node = reg->node;

for (int i=0; i<reg->size; i++) {

if (node[i].state & ((MAX_UNSIGNED) 1 << control1)) { if (node[i].state & ((MAX_UNSIGNED) 1 << control2)) {

node[i].state ^= ((MAX_UNSIGNED) 1 << target); } }}

Nested Control flow - 300% win NumericalRecipes

for (k=1;k<=nn;k++){

if (yy[k] > y) {

xx[k] > x ? ++na : ++nb; } else{ xx[k] > x ? ++nd : ++nc; } }

Vectorizing C++ Control Flow for ( int i = 0; i < 1000; i++) {

if ( cond[ i ] ) { Lhs1[ i ] = Rhs1[ i ] else Lhs2[ i ] = Rhs2[ i ]

} Bistry et al. 1997

Vectorizing C++ Control Flow for ( int i = 0; i < 1000; i++) {

if ( c[ i ] ) { Lhs1[ i ] = Rhs1[ i ] else Lhs2[ i ] = Rhs2[ i ]

Bistry et al. 1997

G[0:3] = bit_mask( c[0:3] ) Lhs[0:3] = (Lhs[0:3] & ! G[0:3]) | (Rhs1[0:3] & G[0:3])

G[0:3] = bit_mask(a[i] == b[i] )

27 13 2029 55

27 125 7 55

0xffffffff 0x00000000 0x00000000 0xffffffff

“pcmpeq xmm1, xmm0 “

(Lhs[0:3] & ! G[0:3])

Lhs[0] Lhs[1] Lhs[2] Lhs[3]

0x0000000 Lhs[1] Lhs[2] 0x0000000

“pandn xmm1, xmm0 “

(Rhs[0:3] & G[0:3])

Rhs[0] Rhs[1] Rhs[2] Rhs[3]

Rhs[0] 0x0000000

0x0000000

Rhs[3]

“pandn xmm1, xmm0 “

= (Lhs[0:3] & ! G[0:3]) | (Rhs[0:3] & G[0:3])

Rhs[0] 0x00000000 0x00000000 Rhs[2]

0x00000000 Lhs[1] Lhs[2] 0x00000000

Rhs[0] Lhs[1]

Lhs[2] Rhs[3]

“por xmm1, xmm3 “

Rhs[0] Lhs[1]

Lhs[2] Rhs[3]

“movups [esi], xmm3 “

New Fact of LifeThe system must never invent a write to a

variable that wouldn’t be written to in an SC execution.

Q: Why?If you the programmer can’t see

all the variables that get written to, you can’t possibly know what locks to take.

Herb Sutter C++11 Memory Model

Vectorizing Control Flow

- Hardware – design load/store instructions- C++ Language – defines semantics- Compiler’s vectorizer - Herb to Jim, “wait”

Example for EARLIER: Hardware + Language + Compiler co-design

Alias analysis • Affects ALL compiler functionality

• Example - Security• Optimization for eliminating r• Hardware design

Alias analysis*p = 70;

*q = …

n = *p + 30

*p = 7

*q = …

n = 100

Points_To {p} ?= Points_To{q}

C++ Alias analysis*p = 70;

(*fptr) (a,b) … n = *p + 30

*p = 7

(*fptr)(a,b) … n = 100

Points_To {fptr} ?= Points_To{q}

C++ Alias analysis – double indirection Point3d** Fubar (void) {

Point3d *p, **x;

p = new Point3d; x = &p; …

*x = new Base; //change the type of p }

Visual – “out from underneath you”

0x12345678

*x = new BasePoint3D

void Main ( ) { Shape **p, r; DerivedShape *q; q = new DerivedShape; p = &q; … *p = &r

q->foo(); …}

Types and alias analysis – “wicked cycle”

// Need alias <*p, q> “q is now made type-of (r)”// De-virtualizing this call depends on type-of (q)

Subset of C++ - at compile timeWhat if pointer indirections “restricted”…

“a pointer cannot be aliased to another pointer.”

No hidden updates!

Reject double indirection through a pointer that’s had its address taken.

Affects _all_ core technologies we covered

• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis

• FOR ALL HARDWARE & RUNTIMES!!

Processors

32nm 22nm 22nm 14nm 10nm

NehalemNehalem Westmere

Sandy BridgeSandy Bridge Ivy

Bridge

HaswellHaswell Broadwell

SkylakeSkylake Skymont

256 bit AVX(2)256 bit AVX128 bit SSE

You are here (3D tri-state transistors)

Summary True size & scope of compiling C++ at Microsoft.

Programmers - Some core technologies

Hardware & System designers Maybe work directly with the C++ compiler team

Compiler++ Evolving the compiler - C2.DLL

Documents

Transcript of Compiler++ Evolving the compiler - C2.DLL

DLL Side-loading: A Thorn in the Side of the Anti …...Order Hijacking, DLL-Hijacking, DLL pre-loading, and DLL side-loading. A technical analysis of the Trojan PlugX variant used

DLL Mother Tongue.docx

DLL Independent Development

2_IT800 DLL Overview

Section 184A...6 DLL - 2020 - 04: Foreclosure / Eviction Moratorium DLL - 2020 –05: Appraisal, VOE, Tax Transcripts DLL - 2020 - 02: Loss Mitigation options/COVID-19 DLL - 2020 -

Label Cuti Dll

Dll Hijacking

fischertechnik Interfaces umFish40 - ftCommunity Interfaces umFish40.DLL umFish40.DLL - 3 umFish40.DLL Common umFish40.DLL v4.3.75.0 is based on the FtLib module v1.70a supplied by

Code/DLL Injection

Writing Dll Files

Connect!@dll Mise en place du NWOW chez DLL (Athlon Car Lease)

Dll injection

Attack Osi Dll

Full page photoSharma Pooja Narayan 2014 12 14 Bajaj Madhuri Vasudev Wankhede Bharati Ramkrishna Agrawal Navin Vishwanath Kasar Hemant Ravindra DLL&L DLL&L DLL&L DLL&L DLL&L Diploma

Using OpenGL in Visual C++ Opengl32.dll and glu32.dll should be in the system folder Opengl32.dll and glu32.dll should be in the system folder Opengl32.lib.

compiler opt and code generation lecture2-1bears.ece.ucsb.edu/class/ece253/compiler_opt/c2.pdf · In the analysis-synthesis model of a compiler, ... a source program and creates an

01 dll basics

DLL Algorithms and Resolution Proofs - Computer Science · Section 2.3 the algorithm schema DLL-Learn is introduced which is a new natural generalization of both, DLL and DLL-L-UP.

Toku Dorama Dll

NLREG DLL Interface