Extensibility Study Report: Source-Level Instrumentation Adam Leko 9/24/2005 UPC Group HCS Research...
-
Upload
annabel-robbins -
Category
Documents
-
view
221 -
download
0
Transcript of Extensibility Study Report: Source-Level Instrumentation Adam Leko 9/24/2005 UPC Group HCS Research...
Extensibility Study Report:
Source-Level Instrumentation
Adam Leko
9/24/2005
UPC Group
HCS Research Laboratory
University of Florida
2
Instrumentation Levels (Review) Source
Most flexible, can retain source-code correlation Tends to be least accurate, can impede compiler optimizations
Binary (object code) More accurate, might not require any recompilation for user Hard to do, tends to raise platform-specific issues, hard to relate data
back to source code Compiler/runtime
Potentially most accurate, compiler can do program transformations and still correctly instrument
Requires lots of cooperation with compiler developers Operating system
Not useful unless code relies on lots of system calls
3
Need for Source Instrumentation Ideal case
Avoid source instrumentation Keep all possible optimizations (binary and compiler/runtime
instrumentation) Still relate data to source code at function- and line number-levels
However, Compiler/runtime instrumentation (UPC) will take time to get going Instrumenting libraries only (SHMEM) limits data that can be collected Binary instrumentation impossible on some platforms
Need source-level instrumentation Can get higher-level source information Serves as a preliminary instrumentation technique for UPC+SHMEM
code now
4
Automatic Source Instrumentation Based on our tool evaluations, found that
Automatic instrumentation necessary for tools Problems with automatic instrumentation can cause frustration and
decrease confidence in tool (SvPablo) Need a high-quality “preprocessor” that can
Take UPC/SHMEM code Instrument it based on what analysis a user wants Hand instrumented code to compiler
Can either write a parser from scratch, or use an existing system Requirements for instrumentation system
Accurate instrumentation Works with any valid C99/UPC program Shouldn’t this be easy?
5
Source Instrumentation Challenges Possible simple route: scan source code for tokens that look like
shared accesses, add instrumentation Problems
Macro expansion (can pass through cpp before) Scoping rules (need a good symbol table) “Implicit” communication (shared variables aren’t treated any differently
in UPC syntax) if/else statements without brackets, varargs with printf, interaction with
gotos and case statements, … Trying to instrument code without a full-blown parser will result in
buggy code! C syntax cannot be described with simple finite state machine (regular
expressions, etc) Need a context-free grammar (parser) to correctly interpret and
instrument source code
6
Writing Parser From Scratch Tools that can make this easier
flex/bison/yacc Antlr (many more)
Writing grammar for C relatively easy A few ambiguities in C grammar, like expression vs. declaration problem Can get around these with GLR parsers, or other tricks
Grammar isn’t everything… Once you have parse tree, you still need to correctly interpret it Reporting user-friendly parse errors is also difficult unless you have a
recursive-descent parser, which takes a long time to write Supporting compiler extensions to C syntax can be difficult Should avoid writing our own parser!
7
Some Real-World Observations“If you do not use CIL and want instead to use just
a C parser and analyze programs expressed as abstract-syntax trees then your analysis will have to handle a lot of ugly corners of the language (let alone the fact that parsing C itself is not a trivial task).”
-- authors of CIL (parser for C99 that supports GNU and MS extensions)
http://manju.cs.berkeley.edu/cil/
8
Some Real-World Observations [2]“When I (George) started to write CIL I thought it was going to take two
weeks. Exactly a year has passed since then and I am still fixing bugs in it. This gross underestimate was due to the fact that I thought parsing and making sense of C is simple. You probably think the same. What I did not expect was how many dark corners this language has, especially if you want to parse real-world programs such as those written for GCC or if you are more ambitious and you want to parse the Linux or Windows NT sources (both of these were written without any respect for the standard and with the expectation that compilers will be changed to accommodate the program).”
-- authors of CIL (parser for C99 that supports GNU and MS extensions)
http://manju.cs.berkeley.edu/cil/
9
Some Real-World Observations [3]“I’d rather not touch the translator [for adding
instrumentation for the UPC perf. tool interface]. Many of the bugs in the Berkeley UPC compiler are in the translator.”
-- loosely paraphrased, Dan Bonachea (talking about the UPC performance tool interface at the ’05 UPC workshop)
[can check at http://upc-bugs.lbl.gov/bugzilla/ by searching for translator in bug description]
10
Some Real-World Observations [4]“Parser development is still a black art.”
--Paul Klint et. al, “Towards an engineering discipline for GRAMMARWARE,” in ACM TOSEM, May 2005.
11
Some Real-World Observations [5]89% of the development is directly or
indirectly related to writing instrumentation software.
--paraphrase, Luiz DeRose, Bernd Mohr and Kevin London, “Performance Tools 101: Principles of Experimental Performance Measurement and Analysis”, SC2003 Tutorial M-11
12
Re-Use Compiler Frontend? Observation: compilers can correctly parse and analyze C, what if
we re-use a compiler frontend? Several candidates
GCC/GCC-UPC Open64 (used by Berkeley UPC), Trimaran/IMPACT (uses EDG
frontend), Zephr (uses EDG frontend) EDG frontend
Biggest argument against: complexity GCC-UPC compiler: ~650kloc EDG frontend: ~700kloc Berkeley UPC: takes up 1GB when compiled(!)
Have spent a lot of time looking at EDG, GCC, and Berkeley frontends Are all very reliable (especially EDG) But, very difficult to modify (too heavyweight)
13
Re-Use Compiler Frontend? [2] Have been some other efforts to reuse compiler frontends
GCC-XML (missing function declarations though) BisonXML/gccXfront (have bison output parses in XML) gcc --fdump-translation-unit (used by g4re, is supported by GCC-UPC)
These methods generally produces extremely large intermediate files For nontrivial code, can be as large as several hundred MBs, even after a
reduction phase Example: g4re with Fluxbox source (30kloc): ~500MB intermediate files
Drawbacks Still need to translate these intermediary files back to C/UPC Intermediate format might change between versions of compiler Format might also depend on names used for grammar terminals and
nonterminals (BisonXML) Not a very attractive alternative for source instrumentation
14
Quick Review of Other Options PDToolkit (most obvious choice)
Used by KOJAK and TAU for source instrumentation Relies on EDG frontend (high quality, robust parser) Some disadvantages
Current version doesn’t support UPC After parsing by PDToolkit, still relies on scripts to read .PDB file and
correctly place instrumentation in user code .PDB files can get large .PDB files alone probably not enough to correctly instrument complicated
UPC expressions PDToolkit is a large download (~38MB)
Also shares other problems for any project that relies on EDG (will discuss later)
15
Quick Review of Other Options [2] Keystone C++ parser
C++-specific parser Has some problems parsing real C++ code (GCC header files), might not
work with all C99 code Large code base
SUIF/SUIF2 Older source-to-source compiler infrastructure from Stanford Uses EDG frontend to parse C code Project not updated in several years
Sage++ Older source-to-source compiler infrastructure Project seems to have been abandoned (last update: 1997) Unlikely to support new versions of C (C99 as required by UPC spec) Was used by TAU, but deprecated (PDToolkit now used)
16
Top 3 Candidates After examining many systems and reading
many papers, have come up with Cetus EDG frontend CIL
Will discuss each in more detail in following slides
17
Cetus Source-to-source compilation system written by researchers at
Stanford Uses ANTLR parser-generator C grammar is a modified/(similar?) version of the C grammar provided by
ANTLR http://www.codetransform.com/gcc.html
Advantages Cetus is written in Java Project seems to be under active development, geared specifically
towards source-to-source transformations Disadvantages
No built-in support for UPC (but UPC grammar a simple extension of C99 grammar)
Not clear how robust C parser is (copyright date 1997, probably doesn’t support C99)
Java support needed for all platforms (Cray X1 javac?)
18
Interesting Comments by Cetus Authors“Documentation for GCC is abundant. The difficulty is that the sheer
amount easily overwhelms the user. Generally, we have found that there is a very steep learning curve in modifying GCC, with a big time investment to implement even trivial transformations.”
“Both SUIF and Cetus fall into the category of extensible source-to-source compilers, so at first SUIF looked like the natural choice for our infrastructure. Three main reasons eliminated our pursuit of this option. The first was the perception that the project is no longer active - the last major release was in 2001 and does not appear to have been updated recently. …”
http://paramount.www.ecn.purdue.edu/ParaMount/Cetus/manual/ch07.html
19
EDG Frontend Benefits
Contains full-featured C, C++, and UPC parser(!) High-quality commercial front-end for compilers
Recursive descent Gives good messages on syntax errors
Can understand several compiler extensions (e.g., GCC extensions) To use for instrumentation
Basic workflow User code is parsed by frontend executable Executable creates intermediary representation in memory (can also store to
file) Intermediate format (IL) is converted to executable code or source code by
backend So need to do
Source -> IL Instrument IL (probably in memory) Instrumented IL -> UPC/C-generating backend
20
EDG Frontend: Drawbacks Frontend is intellectual property of EDG
Redistribution of source code node allowed We can only redistribute compiled versions for each platform Implies we cannot support (at all) any platforms that we cannot compile
EDG’s frontend on Can only be used for noncommerical projects
Would be nice to allow vendors to bundle our performance tool along with their UPC compilers
Code is extremely complex About ~700kloc of ANSI C code Manual describing code and IL is over 500 pages long Have to “pay” for added complexity because frontend also supports C++
(which we don’t need)
21
Best Candidate: CIL Source-level analysis and transformation framework for ANSI and
C99 C code with GNU and MS compiler extensions Heavily tested
Successfully parses SPECINT95 benchmarks, the Linux kernel, GIMP, many others
Fails only 23 of the GCC torture tests (GCC itself fails 19, over 900 tests) Advantages
Pretty compact code (only about 40kloc for entire system) Ideal for adding static analyses to our performance tool Robust, heavily tested C parser Released under BSD license
Disadvantages Written in Ocaml Have to add UPC extensions to grammar files
22
Best Candidate: CIL [2] Have already made simple modifications to code
Added upc_forall statement support Was very easy to modify grammar We should be able to re-use much of GCC-UPC’s YACC grammar
Ocaml easy to learn Modern functional language with elements of imperative languages According to Wikipedia, Ocaml commonly used for writing compilers Seems like a language well-suited to the task
Ocaml compiler/interpreter supported on many platforms Consists of ~2.4MB (gzipp’ed) ANSI C code I have compiled and run the compiler and interpreter on our 32-bit Linux,
64-bit Linux (including Altix and Opteron), and Tru64 systems Known to run on FreeBSD, OpenBSD, NetBSD, HPUX, IRIX, Solaris,
and many other platforms http://caml.inria.fr/ocaml/portability.en.html Ironically seems more portable than Java!
23
Best Candidate: CIL [3] CIL brief overview
Parser is a heavily-modified version of FrontC, uses ocamllex and ocamlyacc
C code is parsed to a simple intermediate format, CABS, that contains type information, code structure, etc
CABS is converted to simpler subset of C code, CIL, or back to C code Analyses are done with simplified CIL Code can be converted back to C and fed into a C compiler The “cilly” driver script manages this whole process
How to use CIL Need to retain original source code structure as much as possible (keep
optimizations) So, should do parse -> CABS -> instrumented CABS -> C code Can keep around other code (CIL reduction, etc) if we want to do static
analysis later Might also want to investigate keeping CIL reductions…
24
Some Ocaml Examples Hello, world
print_endline "Hello world!";; Factorial
let rec fact = function| 0 -> 1| n -> n * fact(n-1);;
Quicksort let rec quicksort = function
[] -> [] | head::tail -> let left, right = List.partition (function x -> x < head) tail in (quicksort left) @ head::(quicksort right);;
Full tutorial available at http://www.ocaml-tutorial.org/
25
Conclusions Can’t afford to shortchange our source instrumentor
Instrumentation system vital to our tool’s success Source instrumentation necessary until we can get binary and
library/compiler instrumentation going at full speed Writing source instrumentation system is challenging
Must work on any C99 C or UPC code (compiler extensions nice) Must not have bugs
Bugs == aggravated users Empirical evidence shows we should not take this task lightly
Should reuse existing systems as much as possible Don’t want to waste all our time writing a good source parser and
analyzer! CIL looks like best bet right now EDG is a good fallback option if CIL gives us problems (but preliminary
experiences have been very positive)