Compiler++ Evolving the compiler - C2.DLL
description
Transcript of Compiler++ Evolving the compiler - C2.DLL
Compiler++ Evolving the compiler - C2.DLL
Jim Radigan - Architect C++ Optimizer
Mission: Evolving the C++ compiler
1. ~Absolute Correctness 2. Compiler throughput3. Code size4. Code quality
$87.7 B
$100 .0B +
Evolve the red arrow
3,100,000 Transistors
Ivy Bridge
1.4 Billion Transistors
TEGRA 3 - 5 cores / 128 bit vector instructions
Haswell C++
Built with C++
Windows SQL Office
Mission critical correctness and compile time
Compiler++ “Evolving the compiler” • How we work
• Core Technologies
• Where we are going
Full compile, test build Windows – N hours24 cores + 32 Gb memory 3 raid 0 drives
… if you’re in a hurry – 40 cores
X86, ARM, X64 - retail and checked
N Applications - then stress a compiler’s build
Compiler developer – bad day
Win8 improved – but still a work/life balance thing
Compiler++ “Evolving the compiler” • How we work
• Core Technologies
• Where we are going
“Compiler Business”
• Absolutely NO new compiler optimization switches
• Each switch would cost millions $$
Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis
• FOR ALL HARDWARE & RUNTIMES!!
Code Size / Stack SizeFoo (int p1, int p2, int p3) { int w,x,y,z …. if (flag) { w = x = w + z … return x } else { y = }
[ebp+10] Parameter 3 [ebp+0C] Parameter 2 [ebp+08] Parameter 1 [ebp+04] Return address [ebp+00] Old ebp [ebp -04] Local 1 // w[ebp -08] Local 2 // x[ebp -0C] Local 3 // z or y
Stack PackingStack Packing
?Bind_DeterminePinned@CBase@@UAEXXZ:638643E0: 8B FF mov edi,edi638643E2: 53 push ebx638643E3: 56 push esi638643E4: 8B F1 mov esi,ecx638643E6: 8B 5E 18 mov ebx,dword ptr[esi+18h]638643E9: 8B 46 04 mov eax,dword ptr [esi+4]638643EC: F6 C3 01 test bl,1638643EF: 74 08 je 638643F9638643F1: 3B 46 08 cmp eax,dword ptr [esi+8]638643F4: 76 1E jbe 63864414638643F6: 5E pop esi638643F7: 5B pop ebx638643F8: C3 ret MORE COLD CODE
No Stack Packing (R1 – R5 reasons for bad code) ?Bind_DeterminePinned@CBase@@UAEXXZ:639E2840: 8B FF mov edi,edi639E2842: 55 push ebp #R1639E2843: 8B EC mov ebp,esp639E2845: 51 push ecx #R2 639E2846: 53 push ebx639E2847: 56 push esi639E2848: 8B F1 mov esi,ecx639E284A: 57 push edi #R3639E284B: 8B 5E 18 mov ebx,dword ptr [esi+18h]639E284E: 8B 46 04 mov eax,dword ptr [esi+4]639E2851: F6 C3 01 test bl,1639E2854: 74 0C je 639E2862639E2856: 3B 46 08 cmp eax,dword ptr [esi+8]639E2859: 76 3F jbe 639E289A639E285B: 5F pop edi #R4639E285C: 5E pop esi639E285D: 5B pop ebx639E285E: 8B E5 mov esp,ebp #R5639E2860: 5D pop ebp639E2861: C3 ret MORE COLD CODE
Its all about…
CACHE LINES
NTSTATUS
NtfsCommonRead ( PIRP_CONTEXT IrpContext, PIRP Irp, BOOLEAN AcquireScb){ NTSTATUS Status; PIO_STACK_LOCATION IrpSp; PFILE_OBJECT FileObject; TYPE_OF_OPEN TypeOfOpen; PVCB Vcb; PFCB Fcb; PSCB Scb; PCCB Ccb; ATTRIBUTE_ENUMERATION_CONTEXT AttrContext; EOF_WAIT_BLOCK EofWaitBlock; PFSRTL_ADVANCED_FCB_HEADER Header; PTOP_LEVEL_CONTEXT TopLevelContext; VBO StartingVbo; LONGLONG ByteCount; LONGLONG ByteRange; ULONG RequestedByteCount; PCOMPRESSION_SYNC CompressionSync = ((void *)0); BOOLEAN FoundAttribute = 0; BOOLEAN PostIrp = 0; BOOLEAN OplockPostIrp = 0; BOOLEAN ScbAcquired = 0; BOOLEAN ReleaseScb; BOOLEAN PagingIoAcquired = 0; BOOLEAN DoingIoAtEof = 0; BOOLEAN Wait; BOOLEAN PagingIo; BOOLEAN NonCachedIo; BOOLEAN SynchronousIo; BOOLEAN CompressedIo = 0;
__try { NtfsPrePostIrp( IrpContext, Irp ); if (( (((Fcb->FcbState) & ((0x00000004)))) ) && ( (((Scb->ScbState) & ((0x00000010)))) )) { FsRtlPostPagingFileStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); } else { FsRtlPostStackOverflow( IrpContext, Event, NtfsStackOverflowRead ); } (void) KeWaitForSingleObject( Event, Executive, KernelMode, 0, ((void *)0) ); Status = ((NTSTATUS)0x00000103L);
} __finally { if (Resource != ((void *)0)) { (ExReleaseResourceLite(Resource)); } ExFreeToNPagedLookasideList( &NtfsKeventLookasideList, Event ); } } else { if (Irp->Tail.Overlay.AuxiliaryBuffer != ((void *)0)) { IrpContext->Union.AuxiliaryBuffer = (PFSRTL_AUXILIARY_BUFFER)Irp->Tail.Overlay.AuxiliaryBuffer; if (!( (((IrpContext->Union.AuxiliaryBuffer->Flags) & (0x00000001))) )) { Irp->Tail.Overlay.AuxiliaryBuffer = ((void *)0); } } Status = NtfsCommonRead( IrpContext, Irp, 1 ); } break; }
__except (NtfsExceptionFilter( IrpContext, (struct _EXCEPTION_POINTERS *)_exception_info() )) { NTSTATUS ExceptionCode; ExceptionCode = _exception_code(); if (ExceptionCode == ((NTSTATUS)0xC0000123L)) { IrpContext->ExceptionStatus = ExceptionCode = ((NTSTATUS)0xC0000011L); Irp->IoStatus.Information = 0; } }
TRY
EXCEPT
TRY
FINALLY
ROOT
Try Region Graph – asynchronous lifetimes
ROOT
TRY = x
EXCEPT
TRYX =
FINALLY
int x, y;
_try {
_try { x = } _finally {
} = x + … y = _except (filter()) { = y}
Recall …Compiler dev. primary concern
C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis
C++ Compiler - Auto Parallelism
Vector - all loads before all stores
B[0] B[1] B[2] B[3]
A[0] A[1] A[2] A[3]
A[0] + B[0] A[1] + B[1] A[2] + B[2] A[3] + B[3]
xmm0
“addps xmm1, xmm0 “
xmm1
xmm1
+
Simple vector add loop - unaligned
for (i = 0; i < 1000/4; i++){
movps xmm0, [ecx] movps xmm1, [eax] addps xmm0, xmm1 movps [edx], xmm0 }
for (i = 0; i < 1000; i++) A[i] = B[i] + C[i];
Compiler looks across loop iterations !
Auto Parallelism/Vectorization for C++For ( iv1 = 0; iv1 <= U1; iv1++) For ( iv2 = 0; iv2 <= U2; iv2++) ... For ( ivn = 0; ivn <= Un; ivn++) t13 = OPLOAD [ a1*iv1 + a2 *iv2 + ... an * ivn + sym_expression ] } }}
Math in the compiler - Legal to vectorize ?
FOR ( j = 2; j <= 5; j++) A( j ) = A (j-1) + A (j+1)
Not Equal !!
A (2:5) = A (1:4) + A (3:7)
A(3) = ?
Vector SemanticsALL loads before ALL stores
A (2:5) = A (1:4) + A (3:7)
VR1 = LOAD(A(1:5))VR2 = LOAD(A(3:7))VR3 = VR1 + VR2 // A(3) = F (A(2) A(4))STORE(A(2:5)) = VR3
Vector SemanticsInstead - load store load store ...
FOR ( j = 2; j <= 257; j++)A( j ) = A( j-1 ) + A( j+1 )
A(2) = A(1) + A(3)A(3) = A(2) + A(4) // A(3) = F ( A(1)A(2)A(3)A(4) )A(4) = A(3) + A(5)A(5) = A(4) + A(6) …
Doubled the optimizer
A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)
for (size_t j = 0; j < numBodies; j++) { D3DXVECTOR4 r;
r.x = A[j].pos.x - pos.x; r.y = A[j].pos.y - pos.y; r.z = A[j].pos.z - pos.z;
float distSqr = r.x*r.x + r.y*r.y + r.z*r.z; distSqr += softeningSquared;
float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = fParticleMass * invDistCube;
acc.x += r.x * s; acc.y += r.y * s; acc.z += r.z * s;}
Complex C++ Not just arrays!
Legal math ?
void foo(int n, float *a, float *b, float *c) { for (int j=0; j<n; j++) { *a++ = *b++ + *c++; } }
Legal ? Where’s the base of the array?
void transform1(int * first1, int * last1, int * first2, int * result) {
while (first1 != last1) { *result++ = *first1++ + *first2++; }}
…and where’s the IV?
STL – source code
A ( a1 * I + c1 ) ?= A ( a2 * I’ + c2)
Parallelizing C++ requires transformation to analyze
int synthetic_i; int synthetic_upper = (last1 – first1 + 4)/4;
for (synthetic_i = 0; synthetic_i < synthetic_upper; synthetic_i++) { result[synthetic_i] = first1[synthetic_i] + first2[sythetic_i]; }
STL – source code
while (first1 != last1) { *result++ = *first1++ + *first2++; }
Now …C++ vector code gen
• We don’t know if the array bases overlap• We don’t know what the target ISA is• We don’t know if the trip count is divisible by 4
if ( ! overlap (result, first1) && ! overlap(result ,first2)) if (_ISA_AVAILABLE(AVX2)) {
for (i = 0; i < synthetic_upper/4; i+= 4) { // Vector + Parallel Loop result[i : i +3] = first1[i : i + 3] + first2[i : i +3]; } j = synthetic_upper/4 }} for (j = 0; j < synthetic_upper; i++) { // Sequential or cleanup loop result[j] = first1[j] + first2[j]; }
VectorVector + ParallelSPMD
Maps C++ to all forms of Parallelism
Don’t BSOD…its all about life style choices
C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis
Heap overflow vulnerability
HRESULT CDocManager::IsValidWMToolsStream(bool* pfValid) { long cbSize; if(FAILED(hr = ExtractDataSize(strPath, &cbSize))) return S_OK;
CSmartPtr<BYTE> pBuffer = new BYTE[cbSize]; ExtractData(strPath, pBuffer, cbSize); long dwCheckSum = DwChecksumFromLpvCb(0, pBuffer, cbSize); long dwStreamCnt = GetStreamCount(m_pVisitedTree); if(FAILED(hr = ExtractDataSize(kszCheckSumStream, &cbSize))) { return S_OK; }
//ExtractData(kszCheckSumStream, pBuffer, cbSize); for(int i=0; i<cbSize; i++) {
*pBuffer++ = *kszCheckSumStream++; }}
1. cbSize assigned
4470
2. allocate buffer with 4470 bytes
3. cbSize re-assigned
4496
Heap Overflow!Leads to Hijack
IE Aurora - Dangling pointer vulnerability
<html><head><script>var e1;function f1(evt){ e1 = document.createEventObject(evt); document.getElementById("sp").innerHTML = ""; window.setInterval(f2, 50);}function f2(){ var t = e1.srcElement;}</script></head><body><span id="sp"> <img src=“any.gif" onload=“f1(evt)"></span></body></html>
1. Pass onload event
(evt) to f1
2. Copy evt, but fail to AddRef on CTreeNode!
3. Destroy img tag in span
leading to a free when evt falls out of scope4. Call f2
async so evt goes out of
scope
Hijack! Vtable call via freed
CTreeNode
• Red is C++ called from javascript
pointerheap
vtable
function_1
function_2
Vulnerability: “use after free”
attack code
attack code
attack code
attack data
attack data
attack data
Illegal - flow or writesWhat if the C++ compiler generated code to check?
• It would have to always be on• NOT degrade performance !!
Example for : Hardware + Language + Compiler co-design
C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis
Control flow 12% win spec2k6\libquantum
quantum_reg_node *node = reg->node;
for (int i=0; i<reg->size; i++) {
if (node[i].state & ((MAX_UNSIGNED) 1 << control1)) { if (node[i].state & ((MAX_UNSIGNED) 1 << control2)) {
node[i].state ^= ((MAX_UNSIGNED) 1 << target); } }}
Nested Control flow - 300% win NumericalRecipes
for (k=1;k<=nn;k++){
if (yy[k] > y) {
xx[k] > x ? ++na : ++nb; } else{ xx[k] > x ? ++nd : ++nc; } }
Vectorizing C++ Control Flow for ( int i = 0; i < 1000; i++) {
if ( cond[ i ] ) { Lhs1[ i ] = Rhs1[ i ] else Lhs2[ i ] = Rhs2[ i ]
} Bistry et al. 1997
Vectorizing C++ Control Flow for ( int i = 0; i < 1000; i++) {
if ( c[ i ] ) { Lhs1[ i ] = Rhs1[ i ] else Lhs2[ i ] = Rhs2[ i ]
}
Bistry et al. 1997
G[0:3] = bit_mask( c[0:3] ) Lhs[0:3] = (Lhs[0:3] & ! G[0:3]) | (Rhs1[0:3] & G[0:3])
G[0:3] = bit_mask(a[i] == b[i] )
27 13 2029 55
27 125 7 55
0xffffffff 0x00000000 0x00000000 0xffffffff
xmm0
“pcmpeq xmm1, xmm0 “
xmm1
xmm1
==
(Lhs[0:3] & ! G[0:3])
0xffffffff 0x00000000 0x00000000 0xffffffff
Lhs[0] Lhs[1] Lhs[2] Lhs[3]
0x0000000 Lhs[1] Lhs[2] 0x0000000
xmm0
“pandn xmm1, xmm0 “
xmm1
xmm1
&!
(Rhs[0:3] & G[0:3])
0xffffffff 0x00000000 0x00000000 0xffffffff
Rhs[0] Rhs[1] Rhs[2] Rhs[3]
Rhs[0] 0x0000000
0x0000000
Rhs[3]
xmm2
“pandn xmm1, xmm0 “
xmm3
xmm3
&
= (Lhs[0:3] & ! G[0:3]) | (Rhs[0:3] & G[0:3])
Rhs[0] 0x00000000 0x00000000 Rhs[2]
0x00000000 Lhs[1] Lhs[2] 0x00000000
Rhs[0] Lhs[1]
Lhs[2] Rhs[3]
xmm1
“por xmm1, xmm3 “
xmm3
xmm3
or
STORE
Rhs[0] Lhs[1]
Lhs[2] Rhs[3]
“movups [esi], xmm3 “
xmm3
New Fact of LifeThe system must never invent a write to a
variable that wouldn’t be written to in an SC execution.
Q: Why?If you the programmer can’t see
all the variables that get written to, you can’t possibly know what locks to take.
Herb Sutter C++11 Memory Model
Vectorizing Control Flow
- Hardware – design load/store instructions- C++ Language – defines semantics- Compiler’s vectorizer - Herb to Jim, “wait”
Example for EARLIER: Hardware + Language + Compiler co-design
C++ Core Technologies• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis
Alias analysis • Affects ALL compiler functionality
• Example - Security• Optimization for eliminating r• Hardware design
Compiler++ “Evolving the compiler” • How we work
• Core Technologies
• Where we are going
Alias analysis*p = 70;
*q = …
n = *p + 30
*p = 7
*q = …
n = 100
Points_To {p} ?= Points_To{q}
C++ Alias analysis*p = 70;
(*fptr) (a,b) … n = *p + 30
*p = 7
(*fptr)(a,b) … n = 100
Points_To {fptr} ?= Points_To{q}
C++ Alias analysis – double indirection Point3d** Fubar (void) {
Point3d *p, **x;
p = new Point3d; x = &p; …
*x = new Base; //change the type of p }
Visual – “out from underneath you”
0x12345678
p :
px :
*x = new BasePoint3D
Base
void Main ( ) { Shape **p, r; DerivedShape *q; q = new DerivedShape; p = &q; … *p = &r
q->foo(); …}
Types and alias analysis – “wicked cycle”
// Need alias <*p, q> “q is now made type-of (r)”// De-virtualizing this call depends on type-of (q)
Subset of C++ - at compile timeWhat if pointer indirections “restricted”…
“a pointer cannot be aliased to another pointer.”
No hidden updates!
Reject double indirection through a pointer that’s had its address taken.
Affects _all_ core technologies we covered
• Code size / stack size / data alignment• Vectorization/Parallelization of existing C++• Security• Parallelizing C++ control flow• Alias analysis
• FOR ALL HARDWARE & RUNTIMES!!
Compiler++ “Evolving the compiler” • How we work
• Core Technologies
• Where we are going
Processors
32nm 22nm 22nm 14nm 10nm
NehalemNehalem Westmere
Sandy BridgeSandy Bridge Ivy
Bridge
HaswellHaswell Broadwell
SkylakeSkylake Skymont
256 bit AVX(2)256 bit AVX128 bit SSE
You are here (3D tri-state transistors)
Summary True size & scope of compiling C++ at Microsoft.
Programmers - Some core technologies
Hardware & System designers Maybe work directly with the C++ compiler team