Download - Compilers Are Databases

Transcript
Page 1: Compilers Are Databases

Compilers Are Databases

JVM Languages Summit

Martin OderskyTypeSafe and EPFL

Page 2: Compilers Are Databases

Compilers...

2

Page 3: Compilers Are Databases

Compilers and Data Bases

3

Page 4: Compilers Are Databases

Compilers are Data Bases?

4

Put a square peg in a round hole?

Page 5: Compilers Are Databases

This Talk ...

... reports on a new compiler architecture for dsc, the Dotty Scala Compiler.

• It has a mostly functional architecture, but uses a lot of low-level tricks for speed.

• Some of its concepts are inspired by functional databases.

Page 6: Compilers Are Databases

My Early Involvement in Compilers

80s Pascal, Modula-2 single pass, following the school of Niklaus Wirth.

95-96 Espresso, the 2nd Java compiler E Compiler Borland’s JBuilder

used an OO AST with one class per node and all processing distributed between methods on these nodes.

96-99 Pizza GJ javac (1.3+) -> scalac (1.x) replaced OO AST with pattern matching.

6

Page 7: Compilers Are Databases

Current Scala Compiler

2004-12 nsc compiler for Scala (2.0-2.10)

Made (some) use of functional capabilities of Scala

Added:

– REPL– presentation compiler for IDEs (Eclipse, Ensime)– run-time meta programming with toolboxes

It’s the codebase for the official scalac compiler for 2.11, 2.12 and beyond.

7

Page 8: Compilers Are Databases

Next Generation Scala Compiler

2012 – now: Dotty

• Rethink compiler architecture from the ground up.

• Introduce some language changes with the aim of better regularity.

• Status: – Close to bootstrap– But still rough around the edges

8

Page 9: Compilers Are Databases

Compilers – Traditional View

9

Page 10: Compilers Are Databases

Compilers – Traditional View

10

Page 11: Compilers Are Databases

Add Separate Compilation

11

Page 12: Compilers Are Databases

Challenges

A compiler for a language like Scala faces quite a few challenges.

Among the most important are:

» Complexity» Speed» Latency» Reusability

Page 13: Compilers Are Databases

Challenge: Complex Transformations

• Input language (Scala) is complicated.• Output language (JVM) is also complicated.• Semantic gap between the two is large.

Compare with compilers to simple low-level languages such as System F or SSA.

13

Page 14: Compilers Are Databases

Deep Transformation Pipeline

14

Parser

Typer

FirstTransform

ValueClasses

Mixin

LazyVals

Memoize

CapturedVars

Constructors

LambdaLift

Flatten

ElimStaticThis

RestoreScopes

GenBCode

Source

Bytecode

RefChecks

ElimRepeated

NormalizeFlags

ExtensionMethodsTailRec

PatternMatcher

ExplicitOuter

ExpandSAMs

Splitter

SeqLiterals

InterceptedMethsLiteralize

Getters

ClassTags

ElimByName

AugmentS2TraitsResolveSuper

Erasure

To achieve reliability, need

– excellent modularity– minimized side effects

Functional code rules!

Page 15: Compilers Are Databases

Challenge: Speed

• Current scalac achieves 500-700 loc/sec on idiomatic Scala code.

• Can be much lower, depending on input.• Everyone would like it to be faster.• But this is very hard to achieve.

- FP does have costs.- Optimizations are

ineffective.- No hotspots, costs are smeared out widely.

15

Page 16: Compilers Are Databases

Challenge: Latency

• Some applications require fast turnaround for small changes more than high throughput.

• Examples:– REPL– Worksheet– IDE Presentation Compiler

Need to keep things loaded (program + data)

16

Page 17: Compilers Are Databases

Challenge: Reusability

• A compiler has many clients:– Command line– Build tools– IDEs– REPL– Meta-programming

Abstractions must not leak.

(FP helps)

17

Page 18: Compilers Are Databases

A Question

Every compiler has to answer questions like this:Say I have a class

class C[T] { def f(x: T): T = ...}

At some point I change it to:class C[T] { def f(x: T)(y: T): T = ...}

What is the type signature of C.f?

Clearly, it depends on the time when the question is asked!18

Page 19: Compilers Are Databases

Time-Varying Answers

Initially: (x: T): T

After erasure: (x: Any): Any

After the edit: (x: T)(y: T): T

After uncurry: (x: T, y: T): T

After erasure: (x: Any, y: Any): Any

19

Page 20: Compilers Are Databases

Naive Functional Approach

World1 IR1,1 ... IRn,1 Output1

World2 IR1,2 ... IRn,2 Output2

.

.

.Worldk IR1,k ... IRn,k Outputk

How big is the world?20

Page 21: Compilers Are Databases

A More Practical Strategy

Taking Inspiration from FRP and Functional Databases:

• Treat every value as a time-varying function.• So the question is not:

“What is the signature of C.f” ?but:

“What is the signature of C.f at a given point in time” ?

Need to index every piece of information with the time where it holds.

21

Page 22: Compilers Are Databases

Time in dsc

Period = (RunID, PhaseID)

• RunIDs is incremented for each compiler run• PhaseID ranges from 1 (parser) to ~ 50 (backend)

22

Run1 Run2 Run3

Page 23: Compilers Are Databases

Time-Indexed Values

sig(C.f, (Run 1, parser)) = (x: T): T

sig(C.f, (Run 1, erasure)) = (x: Any): Any

sig(C.f, (Run 2, erasure)) = (x: T)(y: T): T

sig(C.f, (Run 2, uncurry)) = (x: T, y: T): T

sig(C.f, (Run 2, erasure) = (x: Any, y: Any): Any

23

Page 24: Compilers Are Databases

Task of the Compiler

• Compute all values needed for analysis and code generation over all periods where they are relevant.

• Problem: The graph of this function is humongous!

• More work is needed to make it efficiently explorable.

• But for a start it looks like the right model.

24

Page 25: Compilers Are Databases

Core Data Types

Abstract Syntax TreesTypes

ReferencesDenotations

Symbols

25

Page 26: Compilers Are Databases

Abstract Syntax Trees

• For instance, for x * 2:

26

Page 27: Compilers Are Databases

Tree Attributes

What about tree attributes?In dsc, we simplified as much as we could.Were left with just two attributes:

– Position (intrinsic)– Type

The job of the type checker is to transform untyped to typed trees.

27

Page 28: Compilers Are Databases

Typed Abstract Syntax Trees

28

For instance, for x * 2:

The distinction whether a tree is typed or untyped is pretty important, merits being reflected in the type of AST itself.

Page 29: Compilers Are Databases

From Untyped to Typed Trees

Idea: parameterize the type Tree of AST’s with the attribute info it carries.

Typed tree: tpd.Tree = Tree[Type]Untyped tree: untpd.Tree = Tree[Nothing]

This leads to the following class:

class Tree[T] { def tpe: T def withType(t: Type): Tree[Type]}

29

Page 30: Compilers Are Databases

Question of Variance

• Question: Which of the following two subtype relationships should hold?

tpd.Tree <: untpd.Tree

untpd.Tree <: tpd.Tree ?

• What is the more useful relationship?(the first)

• What relationship do the variance rules imply?(the second)

30

class Tree[? T] { def tpe: T ...}

Page 31: Compilers Are Databases

Fixing class Tree

class Tree[-T] { def tpe: T @uncheckedVariance def withType(t: Type): Tree[Type]}

Interesting exception to the variance rules related to the bottom type Nothing.

What can go “wrong” here? Given an untpd.Tree, I expect Nothing, but I might get a Type.

Shows that it’s good have an escape hatch in the form of @uncheckedVariance.

31

Page 32: Compilers Are Databases

Types

• Types carry most of the essential information of trees and symbols.

• Two kinds of types.– Value types: Int, Int => Int, (Boolean, String)– Types of definitions: (x: Int)Int, Lo..Hi, Class(...)

• Represented as subtypes of the same type “Type” for convenience.

32

Page 33: Compilers Are Databases

References

case class Select(qual: Tree, name: Name) {// what is its tpe?

}

case class Ident(name: Name) {// what is its tpe?

}

• Normally, these tree nodes would carry a “symbol”, which acts as a reference to some definition.

• But there are no symbol attributes in dsc, for good reason.

33

Page 34: Compilers Are Databases

Traditional Scheme

34

That’s not very functional!

Page 35: Compilers Are Databases

A Question of Meaning

Question: What is the meaning of obj.fun

?

It depends on the period!

Does that mean that obj.fun has different types, depending on period?

No, trees are immutable!

35

Page 36: Compilers Are Databases

References

36

• A reference is a type• It contains (only)

– a name– potentially a prefix

• Referencesare immutable, theyexist forever.

Page 37: Compilers Are Databases

What about Overloads?

The name of a TermRef may be shared by several overloaded members of a class.

How do we determine which member is meant?

(In a nutshell, that’s why overloading is so universally hated by compiler writers)

Trick: Allow “signature” as part of term names.

37

Page 38: Compilers Are Databases

What Does A Reference Reference?

Surely, a symbol?

No!

References capture more than a symbol

And sometimes they do not refer to a unique symbol at all.

38

Page 39: Compilers Are Databases

References capture more than a symbol.

Consider:class C[T] { def f(x: T): T}val prefix = new C[Int]

Then prefix.f:

resolves to C’s fbut at type (Int)Int, not (T)T

Both pieces of information are part of the meaning of prefix.f. 39

Page 40: Compilers Are Databases

References

Sometimes references point to no symbol at all.We have already seen overloading.Here’s another example using union types, which are newly supported by dsc:

class A { def f: Int }class B { def f: Int }val prefix: A | B = if (...) new A else new Bprefix.f

What symbol is referenced by prefix.f ?

40

Page 41: Compilers Are Databases

Denotations

The meaning of a reference is a denotation.

Non-overloaded denotationscarry symbols (maybe) andtypes (always).

41

Page 42: Compilers Are Databases

What Then Is A Symbol?

A symbol represents a declaration in some source file.

It “lives” as long as the source file is unchanged.

It has a denotation depending on the period.

42

Page 43: Compilers Are Databases

Denotation Transformers

• How do we compute new denotations from old ones?• For references pre.f: Can recompute the member at

new phase.• For symbols?

uncurry.transDenot(<(x: A)(y: B): C>) = <(x: A, y: B): C>

43

Page 44: Compilers Are Databases

Caching Denotations

Symbols are memoized functions: Period Denotation

Keep all denotations of a symbol at different phases as a ring.

44

Page 45: Compilers Are Databases

Putting it all Together

45

• ER diagram of core compiler architecture:

*

*

Page 46: Compilers Are Databases

Lessons Learned

(Not done yet, still learning)

• Think databases for modeling.• Think FP for transformations.• Get efficiency through low-level techniques

(caching)• But take care not to compromise the high-level

semantics.

46

Page 47: Compilers Are Databases

To Find Out More

47

Page 48: Compilers Are Databases

How to make it Fast

• Caching– Symbols cache last denotation– NamedTypes do the same– Caches are stamped with validity interval (current period until the

next denotation transformer kicks in).– Need to update only if outside of validity period– Member lookup caches denotation

Not yet tried: Parallelization. - Could be hard (similar to chess programs)

48

Page 49: Compilers Are Databases

Many forms of Caches

• Lazy vals• Memoization• LRU Caches

• Rely on– Purely functional semantics– Access to low-level imperative implementation code.– Important to keep the levels of abstractions apart!

49

Page 50: Compilers Are Databases

Optimization: Phase Fusion

• For modularity reasons, phases should be small. Each phase should od one self-contained transform.

• But that means we end up with many phases.• Problem: Repeated tree rewriting is a performance killer.• Solution: Automatically fuse phases into one tree

traversal. – Relies on design pattern and some small amount of

introspection.

50