Case Study of the Unexplained
-
Upload
shannomc -
Category
Technology
-
view
204 -
download
2
Transcript of Case Study of the Unexplained
![Page 1: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/1.jpg)
![Page 2: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/2.jpg)
Case Study of Programmer Nightmares
Shannon’s Edition 20120624
![Page 3: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/3.jpg)
What is the talk about
• Inspired by Mark Russinovich’s Presentation– Case of the Unexplained– http://technet.microsoft.com/en-us/sysinternals/b
b963887• Here are my cases
– Mainly fixing programming problem– Mostly C++, some interop & cross-platform.– Most are from my bad memory.– Sorry about the boring slides.
![Page 4: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/4.jpg)
Steps to Debug a problem
1. There is no step 12. See Step 13. ???4. Profit
![Page 5: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/5.jpg)
General Guidelines
• Reproducible test case.• Learn the tools.• Make a Wild A** Guess (WAG) on source• Persistent
– Grind through it.• Ask someone else to handle it. (NOT ME)
![Page 6: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/6.jpg)
Case: WOMM
![Page 7: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/7.jpg)
Problem: Debug vs Release
• Program is not drawing the circle around the cursor but is where the user clicks.
• Same class does both drawings, different location• Did work previously
Debug Optimized
![Page 8: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/8.jpg)
Causes Optimization Problems
1. Undefined Behaviors1. Uninitialized Memory2. Overflows/underflows
2. Thread problems.3. Code or Data is wrong.….999.Complier Bug (not likely, see #1)1000.Hardware/OS/driver bug.
![Page 9: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/9.jpg)
Step I took
• What’s changed.– Major merge with other branch.– Massive file and project settings changes.
• Build optimized with debug symbols & debug– Could jump around a lot– Local variables will not be present or wrong*– this pointer only valid on member function entry.
• Compare working/non-working objects
![Page 10: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/10.jpg)
Found the Function
• Formula:
• D, E, F = 0, so have this, and verified all inputs.
1**
**
FhEwD
CyBxAI
CyBxAI **
![Page 11: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/11.jpg)
Next Trick: Binary Search
• Turning on/off optimizations – Per Library– Per File– Per function– Per optimization
• Found, Global Optimization “Cause” problem– Last merge turned it on.– Turned it off. Everything works
![Page 12: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/12.jpg)
Extremely Important Rule
• Unless you understand why the problem is fixed, its not fixed. The problem is likely still there just hidden better.
![Page 13: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/13.jpg)
• Formula:
• D, E, F = 0, so have this, and verified all inputs
Missed something important
1
** CyBxAI
1**
**
FhEwD
CyBxAI
![Page 14: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/14.jpg)
Lets talk this Out
• w & h were uninitialized, but can’t be it.• 0 time any number is 0.• w & h are number.
– Double Precision IEEE 754• IEEE 754 only contains number.
– Contains ±0, ±INF, … NaNFALSE
MAYBETRUE
FALSE
![Page 15: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/15.jpg)
NaN is weird.
• Any operation with NaN results in NaN– *, +, -, /, sin, etc
• Most comparisons with NaN are false.– <, <=, >, ==, etc, so NaN == NaN is false
• Not equals is always true.– NaN != NaN is true.
• Multiple types– QNaN, SNaN
![Page 16: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/16.jpg)
Case Close
• Should have trusted 1st guess.• Gave up too soon with a quick wrong fix.
![Page 17: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/17.jpg)
Case: Works Everywhere Else
![Page 18: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/18.jpg)
Problem
• 6-8 high priority bugs from FAT.• All bugs had the same pattern.
– Only occurred on Window 2000 box.– Display wrong converted values.– Works on XP, and 2003.
• It a cross-platform assign to Me.
![Page 19: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/19.jpg)
Steps I took
• Start Debug Build of Integration Branch.• Get the release, and try to reproduce bug.
– Grabbed it from the build NFS share.– Didn’t “fail”
• Try the test box.– It “fails”, but can’t debug.– Copy it to dev box– It fails on my box.
![Page 20: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/20.jpg)
WAG time
• Cosmic Rays corrupted the Executable. – No replacing them with debug build still had bug.
![Page 21: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/21.jpg)
What Could It Be
• Diff installed w/ what should be there.– Should be No Differences
• Massive Differences.• Install CD didn’t have and Differences• I know what happened.
AAAAAAHH!!!Stupid Tester
![Page 22: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/22.jpg)
Here is What Happened
• Tester skipped using the Install Win2K CD.– Didn’t want to walk to other end of hall.
• TAR-ed up NFS install shared.• FTP it over.• Used WinZip to untar file.
![Page 23: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/23.jpg)
Why is WinZip Bad?
![Page 24: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/24.jpg)
Case Closed
• The “table.dat” file was converted to windows newlines.– Doesn’t work properly like that.
• All Test Follow Proper Procedures.• Don’t take Short Cuts.
– Especially During FAT.
![Page 25: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/25.jpg)
Case: Psychic Debugging
![Page 26: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/26.jpg)
Problem: Phone Call
1. Got a phone call2. Developer describe the problem and steps
taken to track down the problem.3. Answer with the root cause and how to fix.
Now its time for the interactive part of this talk.Pretend you me, ….
![Page 27: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/27.jpg)
Real Problem
• File parsing code incorrectly errors out.– Worked on following
• Windows 32/64-bits debug/release, • Irix 32/64-bits debug/release, • Solaris SPARC 32/64-bit debug/release• Linux 64-bit debug/release, 32-bit debug.
– Fails on Linux 32-bit x86 gcc optimize
![Page 28: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/28.jpg)
What does the code do?
• Read text like file– Contains repeated floating point numbers.– Lots of other data between repeated number.
• Parses data into native types (int, double)• Validate Data is sane
– Number are with spec.– Repeated doubles are the same with != check.
• This step failed.
![Page 29: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/29.jpg)
Code
double lat1 = atof(buff1);…double lat2 = atof(buff2);…if(lat1 != lat2) return -1;
![Page 30: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/30.jpg)
I’m 95% certain of problemWrite down your answer now.
More info from the developer
![Page 31: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/31.jpg)
Additional QA with develop
• Did they check input file is valid?• How did the developer track down it down?
– Printf debugging number same, but check failed.• Did adding/moving additional printf make
the problem go away?– This confirmed that I guessed right
YES
YES
![Page 32: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/32.jpg)
Your Turn
• Failed 32-bit x86 optimized linux• Deal with C++ native double types
– uses != to compare them.• Adding some printfs made problem go away.
Who know what happened.
![Page 33: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/33.jpg)
Additional Slide If No One Knows
• Root cause is 486• Specifically math co-processor• C++ doubles are 64-bits in memory• 486 math registers are 80-bits• Can’t store 80-bits in 64-bit• Round double when copied into memory.• Optimizer will speed up code
– Will attempt to reduce the # of memory copies.• Wait here until some guesses.
![Page 34: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/34.jpg)
Here is what happened
• Function converted 1st string to 80-bit double• Compiler moved result into 64-bit on stack• Function conerted 2nd string to 80-bit double• Compiler got smart and kept it in 80-bits.• Loaded 1st 64-bit double into 80-bit register.• 2nd number has more precision so it didn’t
match.
![Page 35: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/35.jpg)
Optimized ASM Code
call atof ; buff1 in eaxfstor [sp+20], ST(0)……call atof ; buff2 in eaxfload ST(1), [sp+20]fcmp ST(0), ST(1) ; compare 80 w/ 64-bits
jmpe +8 ; skip over next line if ==ret ; error
![Page 36: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/36.jpg)
Case Close
• Changed to use strcmp instead.• Never directly compare double without a
tolerance.• Round errors will cause mathematically
impossible to happen. • Stupid 80-bits.
![Page 37: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/37.jpg)
Case: Shoot Self in Foot
![Page 38: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/38.jpg)
Problem: Crash with no reasons
• New developed code• Crashed on Solaris while calling constructor• No “obvious” problem with code
![Page 39: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/39.jpg)
Code
class A { … A(A *d) { *this = d; } … A& operator=(const A &d) { … return *this; }};
![Page 40: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/40.jpg)
Steps I took
• Build code on Windows.– Visual Studio Debugger is 10x nicer
• Got a helpful warning– warning C4717: ‘A::A’ : recursive on all control
paths, function will cause runtime stack overflow
![Page 41: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/41.jpg)
Code Again
class A { A(A *d) { *this = d; } A& operator=(const A &d) {…}};
![Page 42: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/42.jpg)
What the Compiler Does
class A { A(A *d) { A __tempA(d); *this->operator=(__tempA); } A& operator=(const A &d) {…}};
![Page 43: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/43.jpg)
Solution #1
class A { A(A *d) { *this = *d; } A& operator=(const A &d) {…}};
![Page 44: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/44.jpg)
Problem With Solution #1
• What does the following code doA d = NULL;
• Compile does this followingA d = A(NULL);
• Which crashes.• “A d = 0” also crashes.
![Page 45: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/45.jpg)
Solution #2
class A { explicit A(A *d) { *this = *d; } A& operator=(const A &d) {…}};
![Page 46: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/46.jpg)
C++ “Rule of 3” Solution
class A { A(const A &d) {…} ~A() {…} A& operator=(const A &d) {…}};
![Page 47: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/47.jpg)
C++11 “Rule of 3,4, or 5” Solution
class A { A(const A &d) {…} A(A &&d) {…} ~A() {…} A& operator=(const A &d) {…} A& operator=(A &&d) {…}};
![Page 48: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/48.jpg)
Case Close
• Pay Attention to compiler warnings.– This particular warning appear in 3 other places.
• Use Compiler that give better warnings.– CLANG/LLVM has the best error/warnings.
![Page 49: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/49.jpg)
Case: “Random” Crashes
![Page 50: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/50.jpg)
Problem: GUI randomly crashes
Java
Automagic JNI Junk
C++
![Page 51: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/51.jpg)
Steps I took
• Build Debug– debug runtimes make it crash faster due to checks
• Use 2 Debugger Visual Studio & JBuilder• 4 hours of persistent.
![Page 52: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/52.jpg)
Track it down, but no clue
• Java had valid pointer to C++ object.• Pressed button, & pointer no longer valid• Trick time.
![Page 53: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/53.jpg)
Data Breakpoint.
• x86 has 4 hardware data breakpoints– Program runs at full speed.– 1 is reserved by OS
• Must take following form. (Old Info)– Memory address, length(must be 4).– 0x12345678,4
![Page 54: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/54.jpg)
How to do it VS2010
• Step 1
![Page 55: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/55.jpg)
How to do it VS2010
• Step 2
![Page 56: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/56.jpg)
How to do it VS2010
• Step 3 Done
![Page 57: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/57.jpg)
How to do it VS2010
• Step 4 See Results
![Page 58: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/58.jpg)
BAM Data Changed
• Java GC – > finalizer – > Automagic JNI junk– > delete object
• Why, leaky abstraction.
![Page 59: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/59.jpg)
Here is What Happened.
Java C++
ARRAY | | | | | | |AMJJArray
AMJJThing
![Page 60: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/60.jpg)
Case Close
• Data Breakpoints Rule.• All Abstraction Leak
– Know how before proceeding.
![Page 61: Case Study of the Unexplained](https://reader031.fdocuments.us/reader031/viewer/2022012922/55d55308bb61eb08598b474f/html5/thumbnails/61.jpg)
That’s all for Now
Questions, Comment, etc.