Extensibility Study Report: Source-Level Instrumentation Adam Leko 9/24/2005 UPC Group HCS Research...

Extensibility Study Report:

Source-Level Instrumentation

Adam Leko

9/24/2005

UPC Group

HCS Research Laboratory

University of Florida

2

Instrumentation Levels (Review) Source

Most flexible, can retain source-code correlation Tends to be least accurate, can impede compiler optimizations

Binary (object code) More accurate, might not require any recompilation for user Hard to do, tends to raise platform-specific issues, hard to relate data

back to source code Compiler/runtime

Potentially most accurate, compiler can do program transformations and still correctly instrument

Requires lots of cooperation with compiler developers Operating system

Not useful unless code relies on lots of system calls

3

Need for Source Instrumentation Ideal case

Avoid source instrumentation Keep all possible optimizations (binary and compiler/runtime

instrumentation) Still relate data to source code at function- and line number-levels

However, Compiler/runtime instrumentation (UPC) will take time to get going Instrumenting libraries only (SHMEM) limits data that can be collected Binary instrumentation impossible on some platforms

Need source-level instrumentation Can get higher-level source information Serves as a preliminary instrumentation technique for UPC+SHMEM

code now

4

Automatic Source Instrumentation Based on our tool evaluations, found that

Automatic instrumentation necessary for tools Problems with automatic instrumentation can cause frustration and

decrease confidence in tool (SvPablo) Need a high-quality “preprocessor” that can

Take UPC/SHMEM code Instrument it based on what analysis a user wants Hand instrumented code to compiler

Can either write a parser from scratch, or use an existing system Requirements for instrumentation system

Accurate instrumentation Works with any valid C99/UPC program Shouldn’t this be easy?

5

Source Instrumentation Challenges Possible simple route: scan source code for tokens that look like

shared accesses, add instrumentation Problems

Macro expansion (can pass through cpp before) Scoping rules (need a good symbol table) “Implicit” communication (shared variables aren’t treated any differently

in UPC syntax) if/else statements without brackets, varargs with printf, interaction with

gotos and case statements, … Trying to instrument code without a full-blown parser will result in

buggy code! C syntax cannot be described with simple finite state machine (regular

expressions, etc) Need a context-free grammar (parser) to correctly interpret and

instrument source code

6

Writing Parser From Scratch Tools that can make this easier

flex/bison/yacc Antlr (many more)

Writing grammar for C relatively easy A few ambiguities in C grammar, like expression vs. declaration problem Can get around these with GLR parsers, or other tricks

Grammar isn’t everything… Once you have parse tree, you still need to correctly interpret it Reporting user-friendly parse errors is also difficult unless you have a

recursive-descent parser, which takes a long time to write Supporting compiler extensions to C syntax can be difficult Should avoid writing our own parser!

7

Some Real-World Observations“If you do not use CIL and want instead to use just

a C parser and analyze programs expressed as abstract-syntax trees then your analysis will have to handle a lot of ugly corners of the language (let alone the fact that parsing C itself is not a trivial task).”

-- authors of CIL (parser for C99 that supports GNU and MS extensions)

http://manju.cs.berkeley.edu/cil/

8

Some Real-World Observations [2]“When I (George) started to write CIL I thought it was going to take two

weeks. Exactly a year has passed since then and I am still fixing bugs in it. This gross underestimate was due to the fact that I thought parsing and making sense of C is simple. You probably think the same. What I did not expect was how many dark corners this language has, especially if you want to parse real-world programs such as those written for GCC or if you are more ambitious and you want to parse the Linux or Windows NT sources (both of these were written without any respect for the standard and with the expectation that compilers will be changed to accommodate the program).”

-- authors of CIL (parser for C99 that supports GNU and MS extensions)

http://manju.cs.berkeley.edu/cil/

9

Some Real-World Observations [3]“I’d rather not touch the translator [for adding

instrumentation for the UPC perf. tool interface]. Many of the bugs in the Berkeley UPC compiler are in the translator.”

-- loosely paraphrased, Dan Bonachea (talking about the UPC performance tool interface at the ’05 UPC workshop)

[can check at http://upc-bugs.lbl.gov/bugzilla/ by searching for translator in bug description]

10

Some Real-World Observations [4]“Parser development is still a black art.”

--Paul Klint et. al, “Towards an engineering discipline for GRAMMARWARE,” in ACM TOSEM, May 2005.

11

Some Real-World Observations [5]89% of the development is directly or

indirectly related to writing instrumentation software.

--paraphrase, Luiz DeRose, Bernd Mohr and Kevin London, “Performance Tools 101: Principles of Experimental Performance Measurement and Analysis”, SC2003 Tutorial M-11

12

Re-Use Compiler Frontend? Observation: compilers can correctly parse and analyze C, what if

we re-use a compiler frontend? Several candidates

GCC/GCC-UPC Open64 (used by Berkeley UPC), Trimaran/IMPACT (uses EDG

frontend), Zephr (uses EDG frontend) EDG frontend

Biggest argument against: complexity GCC-UPC compiler: ~650kloc EDG frontend: ~700kloc Berkeley UPC: takes up 1GB when compiled(!)

Have spent a lot of time looking at EDG, GCC, and Berkeley frontends Are all very reliable (especially EDG) But, very difficult to modify (too heavyweight)

13

Re-Use Compiler Frontend? [2] Have been some other efforts to reuse compiler frontends

GCC-XML (missing function declarations though) BisonXML/gccXfront (have bison output parses in XML) gcc --fdump-translation-unit (used by g4re, is supported by GCC-UPC)

These methods generally produces extremely large intermediate files For nontrivial code, can be as large as several hundred MBs, even after a

reduction phase Example: g4re with Fluxbox source (30kloc): ~500MB intermediate files

Drawbacks Still need to translate these intermediary files back to C/UPC Intermediate format might change between versions of compiler Format might also depend on names used for grammar terminals and

nonterminals (BisonXML) Not a very attractive alternative for source instrumentation

14

Quick Review of Other Options PDToolkit (most obvious choice)

Used by KOJAK and TAU for source instrumentation Relies on EDG frontend (high quality, robust parser) Some disadvantages

Current version doesn’t support UPC After parsing by PDToolkit, still relies on scripts to read .PDB file and

correctly place instrumentation in user code .PDB files can get large .PDB files alone probably not enough to correctly instrument complicated

UPC expressions PDToolkit is a large download (~38MB)

Also shares other problems for any project that relies on EDG (will discuss later)

15

Quick Review of Other Options [2] Keystone C++ parser

C++-specific parser Has some problems parsing real C++ code (GCC header files), might not

work with all C99 code Large code base

SUIF/SUIF2 Older source-to-source compiler infrastructure from Stanford Uses EDG frontend to parse C code Project not updated in several years

Sage++ Older source-to-source compiler infrastructure Project seems to have been abandoned (last update: 1997) Unlikely to support new versions of C (C99 as required by UPC spec) Was used by TAU, but deprecated (PDToolkit now used)

16

Top 3 Candidates After examining many systems and reading

many papers, have come up with Cetus EDG frontend CIL

Will discuss each in more detail in following slides

17

Cetus Source-to-source compilation system written by researchers at

Stanford Uses ANTLR parser-generator C grammar is a modified/(similar?) version of the C grammar provided by

ANTLR http://www.codetransform.com/gcc.html

Advantages Cetus is written in Java Project seems to be under active development, geared specifically

towards source-to-source transformations Disadvantages

No built-in support for UPC (but UPC grammar a simple extension of C99 grammar)

Not clear how robust C parser is (copyright date 1997, probably doesn’t support C99)

Java support needed for all platforms (Cray X1 javac?)

18

Interesting Comments by Cetus Authors“Documentation for GCC is abundant. The difficulty is that the sheer

amount easily overwhelms the user. Generally, we have found that there is a very steep learning curve in modifying GCC, with a big time investment to implement even trivial transformations.”

“Both SUIF and Cetus fall into the category of extensible source-to-source compilers, so at first SUIF looked like the natural choice for our infrastructure. Three main reasons eliminated our pursuit of this option. The first was the perception that the project is no longer active - the last major release was in 2001 and does not appear to have been updated recently. …”

http://paramount.www.ecn.purdue.edu/ParaMount/Cetus/manual/ch07.html

19

EDG Frontend Benefits

Contains full-featured C, C++, and UPC parser(!) High-quality commercial front-end for compilers

Recursive descent Gives good messages on syntax errors

Can understand several compiler extensions (e.g., GCC extensions) To use for instrumentation

Basic workflow User code is parsed by frontend executable Executable creates intermediary representation in memory (can also store to

file) Intermediate format (IL) is converted to executable code or source code by

backend So need to do

Source -> IL Instrument IL (probably in memory) Instrumented IL -> UPC/C-generating backend

20

EDG Frontend: Drawbacks Frontend is intellectual property of EDG

Redistribution of source code node allowed We can only redistribute compiled versions for each platform Implies we cannot support (at all) any platforms that we cannot compile

EDG’s frontend on Can only be used for noncommerical projects

Would be nice to allow vendors to bundle our performance tool along with their UPC compilers

Code is extremely complex About ~700kloc of ANSI C code Manual describing code and IL is over 500 pages long Have to “pay” for added complexity because frontend also supports C++

(which we don’t need)

21

Best Candidate: CIL Source-level analysis and transformation framework for ANSI and

C99 C code with GNU and MS compiler extensions Heavily tested

Successfully parses SPECINT95 benchmarks, the Linux kernel, GIMP, many others

Fails only 23 of the GCC torture tests (GCC itself fails 19, over 900 tests) Advantages

Pretty compact code (only about 40kloc for entire system) Ideal for adding static analyses to our performance tool Robust, heavily tested C parser Released under BSD license

Disadvantages Written in Ocaml Have to add UPC extensions to grammar files

22

Best Candidate: CIL [2] Have already made simple modifications to code

Added upc_forall statement support Was very easy to modify grammar We should be able to re-use much of GCC-UPC’s YACC grammar

Ocaml easy to learn Modern functional language with elements of imperative languages According to Wikipedia, Ocaml commonly used for writing compilers Seems like a language well-suited to the task

Ocaml compiler/interpreter supported on many platforms Consists of ~2.4MB (gzipp’ed) ANSI C code I have compiled and run the compiler and interpreter on our 32-bit Linux,

64-bit Linux (including Altix and Opteron), and Tru64 systems Known to run on FreeBSD, OpenBSD, NetBSD, HPUX, IRIX, Solaris,

and many other platforms http://caml.inria.fr/ocaml/portability.en.html Ironically seems more portable than Java!

23

Best Candidate: CIL [3] CIL brief overview

Parser is a heavily-modified version of FrontC, uses ocamllex and ocamlyacc

C code is parsed to a simple intermediate format, CABS, that contains type information, code structure, etc

CABS is converted to simpler subset of C code, CIL, or back to C code Analyses are done with simplified CIL Code can be converted back to C and fed into a C compiler The “cilly” driver script manages this whole process

How to use CIL Need to retain original source code structure as much as possible (keep

optimizations) So, should do parse -> CABS -> instrumented CABS -> C code Can keep around other code (CIL reduction, etc) if we want to do static

analysis later Might also want to investigate keeping CIL reductions…

24

Some Ocaml Examples Hello, world

print_endline "Hello world!";; Factorial

let rec fact = function| 0 -> 1| n -> n * fact(n-1);;

Quicksort let rec quicksort = function

[] -> [] | head::tail -> let left, right = List.partition (function x -> x < head) tail in (quicksort left) @ head::(quicksort right);;

Full tutorial available at http://www.ocaml-tutorial.org/

25

Conclusions Can’t afford to shortchange our source instrumentor

Instrumentation system vital to our tool’s success Source instrumentation necessary until we can get binary and

library/compiler instrumentation going at full speed Writing source instrumentation system is challenging

Must work on any C99 C or UPC code (compiler extensions nice) Must not have bugs

Bugs == aggravated users Empirical evidence shows we should not take this task lightly

Should reuse existing systems as much as possible Don’t want to waste all our time writing a good source parser and

analyzer! CIL looks like best bet right now EDG is a good fallback option if CIL gives us problems (but preliminary

experiences have been very positive)

Extensibility Study Report: Source-Level Instrumentation Adam Leko 9/24/2005 UPC Group HCS Research...

Documents

Transcript of Extensibility Study Report: Source-Level Instrumentation Adam Leko 9/24/2005 UPC Group HCS Research...