Bytecode Decompilation: Typing - McGill Universityhendren/621/2012/alexandre-jimpletyping.pdf ·...
Transcript of Bytecode Decompilation: Typing - McGill Universityhendren/621/2012/alexandre-jimpletyping.pdf ·...
Bytecode Decompilation: Typing
Etienne M. Gagnon, Laurie J. Hendren and Guillaume Marceau
McGill University
COMP 621: Static Analysis & TransformationsPresented by Alexandre Beaulieu
March 29, 2012
Preliminaries Type Inference Three Stage Algorithm Conclusion
Outline
1 Preliminaries
2 Type Inference
3 Three Stage Algorithm
4 Conclusion
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 2 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
Outline
1 Preliminaries
2 Type Inference
3 Three Stage Algorithm
4 Conclusion
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 3 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
Previously on Dava
Dava: Compiler-agnostic Java bytecode decompiler
Produces very clean, human readable high level output
Executes efficiently (Under 5 seconds per method decompilation)
Optimizes output for human readability
Handles obfuscated bytecode
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 4 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
Dava at a Glance
1 Bytecode
2 Jimple
3 Grimp
4 Control Flow Graph
5 Structure Encapsulation Tree
6 Abstract Syntax Tree
7 Java
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 5 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
Java Bytecode: A Refresher
Important IR (Lots of JIT compilers and interpreters for it. Popular)
Supported by modern web browsers
A lot of languages compile down to it (Ada, ML, Scheme, Eiffel,Perl, . . . )
Verifiable bytecode has interesting properties
Guaranteed to be well-behaved (not well-typed)Contains some basic type information (Method signatures, Classhierarchy)
However, bytecode has some negative aspects
Not ideal for program analysis and optimization (Expression Stack!)Does not work so well for register allocation (Expression Stack!)Not easy to understand (Low-level representation)
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 6 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
Decompiling Goals (Specific to Dava)
Compiler-Agnosticism Any verifiable bytecode should decompile properly
Efficiency Decompiling should be done within reasonable time
Readability Code should be easy to read for humans
Correctness Code should be correct and preserve original behaviour
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 7 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
Intermediate Representations
In order to facilitate decompilation, Dava works with multiple IRs
Need some useful type information in order to generate accurateoutput
Type information in bytecode is insufficient
We need some powerful type inference
Jimple: three-address code representation
This paper focuses on a static type inference algorithm for Jimple
Grimp: Aggregated Jimple, Dava’s input.
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 8 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
A Closer Look at Jimple
Three-address-code
Transforms stack based operations into variable based operations
Preserves all the type information provided by the bytecode
Program Analysis are much easier to run on Jimple
Makes it an ideal candidate for static type inference
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 9 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
And from Bytecode, He Created Jimple
Transforming bytecode to Jimple is very straight forward. Here is themagical recipe:
1 Compute stack depth at each program point (Those of you whotook COMP 520 are free to feel nostalgic now)
2 Introduce a new local variable for each stack depth
3 Rewrite the instruction stream using the shiny new local variables
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 10 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
prelim
Transformation For the Visual People
1 i l o ad_1 // 0−>12 i l o ad_2 // 1−>23 i add // 2−>14 i s t o r e _ 1 // 1−>0
1 s_1 = l_12 s_2 = l_23 s_1 = s_1 + s_24 l_1 = s_1
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 11 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
Outline
1 Preliminaries
2 Type Inference
3 Three Stage Algorithm
4 Conclusion
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 12 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
Type Inference: The General Idea
Using the limited type information available from bytecode, collecttype constraints for each identifier in the program
Using those constraints, build a constraint problem
Formulate the problem as a graph problem using the constraints andknown type hierarchy
Variable types are called soft nodesNodes belonging to the type hierarchy are called hard nodes
Find a coalescing of the graph such that there is only one hard nodeper group
Use the found coalescing of the graph to assign static types tovariables
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 13 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
A Simple Example:
p u b l i c j a v a . l ang . S t r i n g f ( ) {? a ;? b ;? c ;c = new C ( ) ;b = new B ( ) ;i f ( . . . )
a = c ;e l s e
a = b ;s = a . t o S t r i n g ( ) ;r e t u r n s ;
}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 14 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
A Simple Example: Solution
p u b l i c j a v a . l ang . S t r i n g f ( ) {A a ;B b ;C c ;S t r i n g s ;c = new C ( ) ;b = new B ( ) ;i f ( . . . )
a = c ;e l s e
a = b ;s = a . t o S t r i n g ( ) ;r e t u r n s ;
}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 15 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
Of course, it’s not that simple
Bytecode verification is program point specific
Multiple Inheritance due to interfaces makes type inference hairy
Arrays are not straightforward to correctly type
Solving a constraint problem is NP-Hard
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 16 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
Example: Multiple Definition and Use Points
c l a s s A ex t end s Object { f ( ){} . . . }c l a s s B ex t end s Object { g ( ){} . . . }
c l a s s Mu l t i e x t end s Object {vo i d hard ( ) {
? x ;i f ( . . . ) {
x = new A ( ) ; x . f ( ) ; }e l s e {
x = new B ( ) ; x . g ( ) ; }x . t o S t r i n g ( ) ;
}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 17 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
Example: Interfaces
c l a s s Hardes t {IC getC { r e t u r n new C ( ) ; }ID getD { r e t u r n new D( ) ; }
vo i d h a r d e s t ( ) {? oops ;i f ( . . . )
oops = getC ( ) ;e l s e
oops = getD ( ) ;oops . f ( ) ; // IA . foops . g ( ) ; // IB . g
}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 18 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
Solution Outline
Polynomial run-time multi-stage algorithm
Bypass the complexity by using program transforms to simplify hardcases
Algorithm preserves program semantics (One would hope so)
Algorithm uses two transformations (Stage 2 and 3, respectively)1 Variable splitting at object creation sites2 Insertion of type casts that are guaranteed to succeed at runtime
(Why?)
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 19 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
Step by Step Outline
1 Produce Bare Jimple (Jimple containing only type information fromthe bytecode)
2 Compute DU/UD chains (as seen in class)
3 Split all local variables (one per DU/UD web) (Why?)
4 Run the three-stage type inference algorithm
5 Clean up the code generated by DU/UD splitting using CopyPropagation and Elimination
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 20 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
ti
Example: DU/UD Splitting
s_1 = l_1s_2 = l_2s_1 = s_1 + s_2l_1 = s_1
s_1_0 = l_1_0s_2_0 = l_2_0s_1_1 = s_1_0 + s_2_0l_1_1 = s_1_1
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 21 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Outline
1 Preliminaries
2 Type Inference
3 Three Stage Algorithm
4 Conclusion
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 22 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Algorithm Overview
boolean, byte, char, short and int are all ints
GOAL: Find a static type assignment for each local variable thatsatisfies all of the use constraints
Each stage is run in order. Either it yields a solution, or thealgorithm moves to the next stage
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 23 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 1 Overview
1 Construct directed graph of program constraints
2 Merge connected components in the graph
3 Remove transitive constraints
4 Merge single constraints
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 24 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 1: Building the Constraint Graph
The constraint graph contains the following elements:
hard node: Represents an explicit type
soft node: Represents a type variable
directed edge: Represents a constraint between two nodes.
a← b: b is assignable to a according to Java assignment rules
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 25 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Contraint Graph: Some Examples
a = b T (a)← T (b)
a = b + 3 T (a)← T (b), T (a)← int, int ← T (b)
a = b.equals(c) java.lang .Object ← T (b), java.lang .Object ← T (c),T (a)← int
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 26 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 1: Merging Connected Components
There are three cases for merging connected components
All soft nodes ⇒ naive merging of all soft nodes into a single one
The component has a single hard node ⇒ Merge all soft nodes intothe hard node. (Verify constraints and fail if not satisfied)
More than one hard node in the component ⇒ fail and skip to stage2
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 27 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 1: Removing Transitive Constraints
Transitivity A constraint x ← y is said to be transitive if there existsanother constraint p ← y such that p 6= x and there existsa path from p to x in the directed graph.
We eliminate any such transitive edge regardless of node type, except inthe case of hard-hard constraints. We also take this opportunity to mergeprimitive types.
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 28 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 1: Merging Single Constraints
Single Parent Constraint A node x has a single parent constraint to y ify ← x and for any p 6= y there is no constraint p ← x
Single Child Constraint A node x has a single child constraint to y ifx ← y and for any p 6= y there is no constraint x ← p
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 29 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 1: Single Constraints Priority
1 Merge single child constraints
2 Merge with least common ancestor
3 Merge single soft parent constraints
4 Merge remaining single parent constraints
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 30 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 2 Overview
1 Apply variable splitting transformations (Only known case: x = new
A())
2 Run Stage 1
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 31 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 2: Applying Variable Splitting
c l a s s A ex t end s Object {}c l a s s B ex t end s Object {}c l a s s Mu l t i e x t end s Object {
vo i d j a v a ( ) {Object y ;i f ( . . . )
y = new A ( ) ;e l s e
y = new B ( ) ;y . t o S t r i n g ( ) ;
}}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 32 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 2: Variable Splitting (cont)
vo i d th ree_addr ( ) {? y ;i f ( . . . ) {
y = new A ( ) ;y . [ A.< i n i t > ( ) ] ( ) ;
} e l s e {y = new B ( ) ;y . [ B.< i n i t > ( ) ] ( ) ;
}y . t o S t r i n g ( ) ;
}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 33 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 2: Variable Splitting (cont)
vo i d t h r e e _ a d d r _ s p l i t ( ) {? y , y1 , y2 ;i f ( . . . ) {
y1 = new A ( ) ;y = y1 ;y1 . [ A.< i n i t > ( ) ] ( ) ;
} e l s e {y2 = new B ( ) ;y = y2 ;y2 . [ B.< i n i t > ( ) ] ( ) ;
}y . t o S t r i n g ( ) ;
}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 34 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Stage 3 Overview
1 Construct constraint graph with only variable definition constraints
2 Ignore use constraints and assume all interfaces inherit fromjava.lang.Object
3 Solve the system using the least common ancestor of classes andinterfaces
4 Add typecasts according to use constraints (Why can we do thissafely?)
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 35 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Handling Arrays
A 7→ B: A is an array of B
Represented in constraint graph with dashed lines
Java property that says: (A[]→ B[])⇔ (A← B) andA← B[]⇔ (A ∈ {Object, Serializable, Cloneable})
Build graph without array constraints
Solve normally
Use that solution to give arrays types
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 36 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
tsa
Inferring Integer Types
Two phase sub-algorithm that infers the proper types
Stage 1 (Fixed Point Computation)
Constraint CollectionMerge connected components (may fail)Merge single relations until fixed point is reached
Stage 2 (Similar, different constraints)
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 37 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Outline
1 Preliminaries
2 Type Inference
3 Three Stage Algorithm
4 Conclusion
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 38 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Results
16,492 methods extracted from JDK 1.1 were typed without everresorting to type casting (stage 3)
Out of those 16,492, only 29 required variable splitting (stage 2)
98.8% of methods typed successfully with stage 1
0.2% of methods typed successfully with stage 2
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 39 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Short Questions
What is the reason we are doing DU/UD web splitting beforerunning the type inference algorithm?
What enables us to typecast to the appropriate types without beingworried about runtime casting exceptions in stage 3 of thealgorithm?
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 40 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Short Questions
What is the reason we are doing DU/UD web splitting beforerunning the type inference algorithm?
Stack positions and local variables in the bytecode can store differenttype of values at different program points. Splitting along DU/UDensures that this won’t cause an issue for typing.
What enables us to typecast to the appropriate types without beingworried about runtime casting exceptions in stage 3 of thealgorithm?
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 41 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Short Questions
What is the reason we are doing DU/UD web splitting beforerunning the type inference algorithm?
Stack positions and local variables in the bytecode can store differenttype of values at different program points. Splitting along DU/UDensures that this won’t cause an issue for typing.
What enables us to typecast to the appropriate types without beingworried about runtime casting exceptions in stage 3 of thealgorithm?
We are working under the assumption that the bytecode passedverification. Because of that, we have a guarantee that the types wewill be casting to are valid subtypes at runtime.
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 42 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Assignment Question
You are tasked with typing the following method, given the classhierarchy. (Next slide)
Show your type constraint list
Show your type constraint graph
Show your final graph reduction
Show the typed output that the algorithm yielded.
Note: You do not have to run Integer Type Inference
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 43 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Assignment Question
p u b l i c c l a s s Topping {? i d ;? p r i c e ;i n t g e t I d ( ) { r e t u r n i d ; }i n t g e t P r i c e ( ) { r e t u r n p r i c e ; }Topping ( i n t id , i n t p r i c e ) {
t h i s . i d = i d ;t h i s . p r i c e = p r i c e ;
}}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 44 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Assignment Question
p u b l i c c l a s s P i z za {? t1 , t2 , t3 ;P i z za ( Topping a , Topping b , Topping c ) {
t1 = a ; t2 = b ; t3 = c ;}i n t buy ( ) {r e t u r n t1 . g e t P r i c e ( )
+ t2 . g e t P r i c e ( )+ t3 . g e t P r i c e ( ) ;
}}
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 45 / 46
Preliminaries Type Inference Three Stage Algorithm Conclusion
concl
Bibliography
Benjamin Bellamy, Magdalen College, and Trinity Term.Efficient local type inference 3rd year project report.
Etienne M. Gagnon, Laurie J. Hendren, and Guillaume Marceau.Efficient inference of static types for java bytecode, 2000.
Etienne M. Gagnon et al Bytecode Decompilation: Typing 03/29/2012 46 / 46