faster password recovery
Transcript of faster password recovery
| Faster Password Recovery with Modern GPUs | June 14, 20113
WHO ARE WE
§Founded in 1990§Privately owned§Doing password recovery (software) since 1998§HQ and development in Moscow, Russia§Brought GPUs to password recovery in 2007§5 US patents issued, more in queue–2 are about GPU-accelerated password recovery
| Faster Password Recovery with Modern GPUs | June 14, 20114
WHO NEEDS PASSWORD RECOVERY?
§Ordinary users–Passwords of their own
§IT Departments–Passwords of the employees
§Security auditors, consultants and penetration testers–Customer/contractor passwords
§Law enforcement & government agencies–Passwords of suspects
§Hackers usually don’t!
| Faster Password Recovery with Modern GPUs | June 14, 20115
WHY SPEED COUNTS?
§Users and IT Departments:–«We needed those passwords yesterday»
§Auditors, consultants and pentesters:–«Time is Money»
§Law Enforcement and investigators–Legal time limits
The slow part
| Faster Password Recovery with Modern GPUs | June 14, 20116
PASSWORD RECOVERY | The Loop
Generate trial password
Transform password
(compute hash or encryption key)
Validate hash/key
Success
Try next password
Failure
| Faster Password Recovery with Modern GPUs | June 14, 20117
PASSWORD RECOVERY | The Slow Part
§Designed to be slow–50ms verification time has no impact on usability but HUGE impact on password recovery performance
§Usually designed around well-known hash functions–MD5 (old days)–SHA-1 (most popular so far)–SHA-2 (still exotic)
§Thousands to millions of hash computations per password
| Faster Password Recovery with Modern GPUs | June 14, 20118
FAST PASSWORD RECOVERY | The CPU Way
Before GPGPU era most optimizations focused on:
§SIMD (MMX, SSE, AVX)
§Multi-core
§Distributed computing (think distributed.net)–Communication overhead–Difficult to manage–Not power-efficient
Done by GPU
| Faster Password Recovery with Modern GPUs | June 14, 201111
FAST PASSWORD RECOVERY | The GPU Way
§Password recovery constitutes “embarrassingly parallel” workload§Each processing unit verifies own password, independently from other processing units§Linear scalability in practice
Generate trial passwords
Validate hashes/ keys
Try next password
Failure
Success
Transform password
Transform password
Transform password
Transform password
Transform password
| Faster Password Recovery with Modern GPUs | June 14, 201112
FAST PASSWORD RECOVERY | The GPU Way
Generate trial passwords
Compute keys from passwords
Validate keys
Passwords[] Passwords[]
Keys[]Keys[]
GPUCPUPCIe
| Faster Password Recovery with Modern GPUs | June 14, 201113
LIMITATIONS
§Works good for “slow” algorithms
§For “fast” algorithms PCIe becomes the bottleneck–e.g. for SHA-1 theoretical limit is 8 Gbps / (20 bytes in + 20 bytes out) ≈ 214 million passwords per second
§Need to offload everything to the GPU–password generation and key validation on GPU are bigger challenges than crypto itself–especially so without OpenCL
| Faster Password Recovery with Modern GPUs | June 14, 201114
ALTERNATIVE WAY
Generate trial passwords
Compute keys from passwords
Validate keys
Initial password
Passwords[]
Keys[]
Result
GPUCPUPCIe
| Faster Password Recovery with Modern GPUs | June 14, 201115
PASSWORD RECOVERY
Generate trial passwords
Compute keys from passwords
Validate keys
Passwords[] Passwords[]
Keys[]Keys[]
GPUCPUPCIe
| Faster Password Recovery with Modern GPUs | June 14, 201116
OVERLAPPING CPU AND GPU
Gen
Compute
Vfy Gen
Compute
Vfy GenCPU
GPU Compute
Vfy
Gen
Compute
VfyGen
Compute
VfyCPU
GPU Compute
VfyGen
§In straightforward implementation it may look like this:
§But CPU and GPU can work simultaneously, so overlap their operations:
Profit!
| Faster Password Recovery with Modern GPUs | June 14, 201117
PERFORMANCE | PBKDF2-SHA1 x 10’000
Intel i7-970
NVIDIA GTX 590
AMD HD 6990
0K 15K 30K 45K 60K
50300
23500
3120
Computations per second
| Faster Password Recovery with Modern GPUs | June 14, 201118
HEY, WHY NO 100X SPEEDUP?
Be fair!
§CPUs are not single core any more–Even Atoms are not
§Extended instruction sets were introduced for performance reasons–So why ignore them?
§Will usually get ~10x on comparable hardware for well-suited compute-bound tasks
| Faster Password Recovery with Modern GPUs | June 14, 201119
CPU LAYOUT
§1.2 billions transistors–Most are L3/L2 caches
§Less than 10% are in execution and/or ALU units
Memory Controller
IO &
QPI
IO &
QPI L3 Cache L3 Cache
Que
ue
CoreCore Core CoreCoreCore
| Faster Password Recovery with Modern GPUs | June 14, 201120
GPU LAYOUT
§3 billions transistors (2.5x)
§About 30% are execution and/or ALU units (3x)
§7.5x more transistors dedicated to execution units
§Core frequency is about lower (~0.4x)
§3x estimated speedup
In fair real-world comparison this GPU is 4x faster than CPU on compute-bound task
| Faster Password Recovery with Modern GPUs | June 14, 201121
HEY, WHY NO 100X SPEEDUP?
Be fair!
§CPUs are not single core any more–Even Atoms are not
§Extended instruction sets were introduced for performance reasons–So why ignore them?
§Will usually get ~10x on comparable hardware for well-suited tasks
In our case:§SSE2 code + processor-specific compiler optimizations§12 threads to fully utilize 6 cores + HT§16x over high-end CPU
| Faster Password Recovery with Modern GPUs | June 14, 201122
PERFORMANCE | PBKDF2-SHA1 x 10’000
Intel i7-970
NVIDIA GTX 590
AMD HD 6990
0K 15K 30K 45K 60K
50300
23500
3120
Computations per second
| Faster Password Recovery with Modern GPUs | June 14, 201123
WHY AMD IS SO FAST?
§Most password transformations are bounded by integer performance–AMD cards exhibit awesome integer performance
§Many password transformations (=crypto) make heavy use of bit rotations (=cyclic shifts)–There is a special instruction for this!–Cyclic shift in 1 instruction instead of 3, up to 30% overall speedup in practice
§GPU code written in IL–Utilize all GPU devices under Windows–(Recent APP SDK versions allow this with OpenCL)
| Faster Password Recovery with Modern GPUs | June 14, 201124
PERFORMANCE | bitalign
§AMD IL Specification, section 7.13:
Aligns bit data for video. This is a special instruction for multi-media video.bitalign dst, src0, src1, src2dst = (src0 >> src2.x) | (src1 << (32-src2.x))
§Can be used to implement cyclic bit shift in 1 instruction–VERY useful for many crypto algorithms
§Introduced in Evergreen
§Exposed at the IL level
| Faster Password Recovery with Modern GPUs | June 14, 201125
PERFORMANCE | Bitfield Insert
§AMD Evergreen ISA Reference, page 9-61:
BFI_INT dst, src0, src1, src2dst = (src1 & src0) | (src2 & -src0)
§This is vector bit selectdsti = (maski != 0 ) ? arg1i : arg2i
§Very useful for accelerating various crypto algorithms–And especially for breaking them
§Introduced in Evergreen
§NOT exposed at the IL level–OpenCL bitselect() is not using it either–No documented way to emit this instruction directly
| Faster Password Recovery with Modern GPUs | June 14, 201126
WHY INTERMEDIATE LANGUAGE
§We chose IL over Brook+–OpenCL has not existed yet–Brook+ programming model was not quite suited for password recovery–ISA provided no significant benefit over IL
§“Early” OpenCL support couldn’t compete with IL either–Limited support for binary (pre-compiled) kernels–Limited support for multi-GPU in OpenCL–(Those issues seems to be fixed in APP SDK 2.4)
§AMD is going to deprecate CAL in next SDK (2.5)–IL will almost certainly be deprecated altogether–This is very bad news for us–Need to decide whether to go up (OpenCL) or down (ISA)–Morning Keynote mentinoed FSAIL which seems like a great alternative!
| Faster Password Recovery with Modern GPUs | June 14, 201127
WRITING IN INTERMEDIATE LANGUAGE
§IL doesn’t seem to be designed to be human-friendly–Use scripting languages to generate IL code–And handle platform-specific optimizations (i.e. emulate bitalign on older GPUs)
§Compile kernels at program build time–Avoids runtime compilation –Solves (partially) IP problem – no source code needs to be distributed–Need to provide new binaries for new devices
§Use CAL at runtime to load, configure and launch pre-compiled kernel
| Faster Password Recovery with Modern GPUs | June 14, 201128
SCALABILITY
§Not all GPUs are equally powerful
§Program should scale nicely with number of processing cores in installed GPU–Query number of processors at runtime–Partition task proportionally to number of processors–Helps to reduce UI update “freezes”–Also helps to avoid TDR
| Faster Password Recovery with Modern GPUs | June 14, 201129
SCALABILITY
§8 GPUs are not uncommon today
§Program should scale nicely with number of GPUs–Query number of devices in system–Spawn thread for each device–Partition task as appropriate
§Speedup should be linear unless you hit PCIe bandwidth limits
| Faster Password Recovery with Modern GPUs | June 14, 201130
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.