Semantic Scaffolds for Pseudocode-to-Code...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2283–2295July 5 - 10, 2020. c©2020 Association for Computational Linguistics

2283

Semantic Scaffolds for Pseudocode-to-Code Generation

Ruiqi Zhong Mitchell Stern Dan KleinComputer Science Division

University of California, Berkeleyruiqi-zhong,mitchell,[email protected]

Abstract

We propose a method for program generationbased on semantic scaffolds, lightweight struc-tures representing the high-level semantic andsyntactic composition of a program. By firstsearching over plausible scaffolds then usingthese as constraints for a beam search overprograms, we achieve better coverage of thesearch space when compared with existingtechniques. We apply our hierarchical searchmethod to the SPoC dataset for pseudocode-to-code generation, in which we are givenline-level natural language pseudocode anno-tations and aim to produce a program satisfy-ing execution-based test cases. By using se-mantic scaffolds during inference, we achievea 10% absolute improvement in top-100 ac-curacy over the previous state-of-the-art. Ad-ditionally, we require only 11 candidates toreach the top-3000 performance of the pre-vious best approach when tested against un-seen problems, demonstrating a substantial im-provement in efficiency.

1 Introduction

Systems that can map from natural language de-scriptions of tasks or programs to executable codehave the potential for great societal impact, help-ing to bridge the gap between non-expert usersand basic automation or full-fledged software de-velopment. Accordingly, this area of research hasgarnered significant interest in recent years, withsystems being devised for the translation of natu-ral language specifications into database queries(Wang et al., 2018), if-then programs (Chen et al.,2016), game elements (Ling et al., 2016), and more.

While much of the prior work in executable se-mantic parsing involves short descriptions beingmapped into single-line programs, some tasks haverecently been proposed that involve multiple natu-ral language utterances on the input side and fullprograms on the output side, often reaching tens of

Line Pseudocode Code1 in function main int main() 2 n is a long integer 0 long n = 0;3 while n is less than o while (n < ‘o’) 4 … … 5 close while scope

Translate

while (n < o)

while (n < ‘o’)

Other wrong candidates

error: use of undeclared identifier 'o'

error: missing ''

Figure 1: Pseudocode is translated to code for each lineand combined to form a valid program. Certain combi-nations are invalid due to syntactic and semantic con-straints.

lines in length and including non-trivial state ma-nipulation. Examples include the Magic the Gather-ing and Hearthstone datasets (Ling et al., 2016) de-rived from trading cards and Java or Python classesimplementing their behavior in a game engine, theCONCODE dataset (Iyer et al., 2018) consisting ofJava documentation strings and method bodies, andthe NAPS and SPoC datasets (Zavershynskyi et al.,2018; Kulal et al., 2019) consisting of pseudocodeannotations and source code for programming com-petition problems.

Past approaches to these large-scale language-to-code tasks have typically employed sequence-based models (Ling et al., 2016) that do not ac-count for structure on the output side, or tree-basedmodels (Allamanis et al., 2015; Rabinovich et al.,2017a; Yin and Neubig, 2017; Hayati et al., 2018;Iyer et al., 2019) that incorporate the syntax but notthe semantics of the output domain. However, ifwe want to generate programs that can be executedsuccessfully, the inclusion of both syntactic and se-mantic constraints is crucial. As shown in Figure 1,while multiple program fragments may be syntacti-cally correct and represent plausible translations ofthe corresponding pseudocode, not all of them willlead to executable programs.

To address this, we propose a search proce-dure based on semantic scaffolds, lightweight sum-

2284

maries of higher-level program structure that in-clude both syntactic information as well as seman-tic features such as variable declarations and scopeconstraints. See Section 3 for a more formal defini-tion. While these do not encode the full spectrum ofconstraints used in some formal program synthesistools (Solar-Lezama, 2009; Gulwani et al., 2017),they strike a balance between utility, speed, andease of use, offering substantial improvements insystem performance without a significant increasein complexity.

In this work we focus on the Search-based Pseu-docode to Code (SPoC) dataset (Kulal et al., 2019)due to its challenging multiline programs and avail-ability of input-output test suites to evaluate de-notation accuracy. The dataset contains line-levelpseudocode annotations for 18,356 C++ programsprovided by crowdsource workers from AmazonMechanical Turk. As in the approach of Kulal et al.(2019), we first obtain candidate code fragmentsfor each line using an off-the-shelf neural machinetranslation system. We then aim to find the highest-scoring combination of fragments that results in avalid program. Although finding the optimal pro-gram under this setting is NP-hard when variableusage constraints are introduced (see Section A.3),we can approximate it with a hierarchical beamsearch. Our algorithm first searches for seman-tic scaffolds for the program, then assembles frag-ments together conditioned on these scaffolds. Thishierarchical approach speeds up search, produceshigher quality variations, and leads to substantialimprovements in our system’s final accuracy.

We achieve a new state-of-the-art by solving55.1% of the test cases within 100 attempts. Thisrepresents a 10.4% absolute improvement over theprevious best (Kulal et al., 2019), and reaches 81%of our model’s oracle performance. When testedagainst unseen problems (or crowd-workers), ourtop 11 (or top 52, respectively) candidates havethe same performance as their top 3000 candidates,demonstrating marked gains in efficiency.

We complement our results with a discussionof specific cases in which our semantic scaffoldsuse global program context to resolve ambiguitiesin the pseudocode. We also conduct a manual er-ror analysis of 200 failures to better characterizethe limitations of our method and suggest possibleextensions for future work.

Our contributions are summarized as follows:

• We propose the use of semantic scaffolds to

add semantic constraints to models for long-form language-to-code generation tasks.

• We introduce a hierarchical beam search al-gorithm that incorporates these constraints,resulting in heightened efficiency, better cov-erage of the search space, and stronger per-formance when compared with the standardapproach.

• We achieve a new state-of-the-art accuracyof 55.1% on the SPoC pseudocode-to-codedataset.

2 Pseudocode-to-Code Task

In this work, we focus on the SPoC dataset intro-duced by Kulal et al. (2019).

2.1 Data

This dataset consists of C++ solutions to problemsfrom Codeforces, a competitive programming web-site, along with the input-output test cases used foreach problem to evaluate correctness. It contains18,356 programs in total with 14.7 lines per pro-gram on average. Each line is annotated with anatural language pseudocode description given bya crowd worker from Amazon Mechanical Turk.On average, there are 7.86 tokens per line of codeand 9.08 tokens per pseudocode annotation. Fromthe full dataset, 1,752 programs with annotationsfrom unseen crowd workers and 1,820 programsfor unseen problems are held out for evaluation.More details can be found in Kulal et al. (2019).

2.2 Task

Suppose the target program has L lines. For eachline l ∈ [L], we are given a natural language pseu-docode annotation xl and an indentation level il.Our goal is to find a candidate program y basedon (x1, i1), . . . , (xL, iL) that can solve the givenproblem (i.e. pass all the test cases) using as fewsubmission attempts as possible. The search effi-ciency of an algorithm is calculated as the fractionof problems it can solve using a budget of B at-tempts per problem, where an attempt includesboth compiling a candidate program and runningthe test cases.

As in Kulal et al. (2019), for each pseudocodeline xl, we use an off-the-shelf neural machinetranslation system to obtain a set of C candidatecode pieces Yl = ylc | c ∈ [C], where candidatecode piece ylc has probability plc. A full candidate

2285

program y is a concatenation of candidate codepieces, one per line, and has score p(y):

y = concatLl=1ylcl , p(y) =L∏l=1

plcl . (1)

We aim to find valid high-scoring programs in oursearch procedure.

3 Combination Constraints

Kulal et al. (2019) propose best-first search as abaseline, which enumerates all complete candidateprograms in descending order by score. Using apriority queue, this algorithm can efficiently findthe exact top B highest scoring candidates in timeO(L log(BL)) per candidate.

However, this approach ignores any dependencebetween different lines. For example, any of thecode piece candidates in Figure 1 could potentiallybe used in a valid program, but if we naively com-bine certain subsets of candidates together, the re-sulting program will be invalid due to the use ofundeclared variables or mismatching braces. Tosolve this problem, we propose to enforce certainsyntactic and semantic constraints when combin-ing candidate code pieces.

3.1 Syntactic ConstraintsThe candidate program should adhere to the gram-matical specification of the target language. How-ever, since incorporating the complete set of C++grammatical constraints would require significantengineering effort, we instead restrict our atten-tion to the set of “primary expressions” consistingof high-level control structures such as if, else,for loops, function declarations, etc. As shownin Figure 2, we parse the candidate code pieces foreach line into a list of primary expression symbols.In order for code pieces from consecutive lines tobe used together, there must exist a grammaticalderivation that combines their respective symbols.The complete list of primary expression can befound in the appendix; see Tables 6 and 7.

Additionally, some production rules are associ-ated with the start or end of a variable scope block.We require that the number of open scope blocksequals the indentation level il for each line l.

3.2 Symbol Table ConstraintsEach scope block is associated with a symbol table(Aho et al., 1986) keeping track of the variablesthat have been declared within that scope or any

containing scopes. We extract the variable namesused or declared by each code piece (Figure 3)and ensure that (1) undeclared variables are notused, and (2) variables are not redeclared withinthe same scope. After checking these constraints,any variables declared by a given code piece willbe added to the symbol table associated with thecurrent scope.

These symbol table constraints are based on thesemantic information of code pieces and are fun-damentally different from previous AST-based syn-tactic constraints for code generation (Rabinovichet al., 2017b; Yin and Neubig, 2017). Formally,any context free grammar that specifies the sameconstraints requires at least exponential descriptioncomplexity. We provide a proof adapted from Ellulet al. (2005) in Appendix A.2.

3.3 Syntactic and Semantic Scaffolds

We note two properties of the aforementioned con-straints. First, we can efficiently compute whethera program prefix can possibly lead to a full programthat satisfies the constraints by using an incrementalparser (Ghezzi and Mandrioli, 1979) and checkingthe symbol tables. Secondly, not all informationfrom a code piece is necessary to verify the con-straints. Accordingly, when multiple code piececandidates have the same primary expression sym-bols and variable declarations and usage, swappingbetween them would not affect the satisfiabilityof the constraints. For example, changing froma += 1 to a -= 1 will not change a compilableprogram into a non-compilable one, or vice versa.These two properties will help motivate the hier-archical beam search algorithm introduced in thenext section.

More formally, we take the configuration φ(ylc)of a line ylc to be the minimal set of features re-quired to verify the above constraints. The prefixscaffold Sy,l = [φ(y1c1), φ(y2c2), . . . , φ(ylcl)] of aprogram y then contains all the information neededto verify the constraints for the first l lines. We canefficiently compute whether Sy,l1 is a valid prefixscaffold when l < L and whether Sy,L is a validscaffold for a full program when l = L.

1To keep notation uncluttered, we sometimes use φ todenote a configuration, we ignore the subscript y of S whenwe refer to a general scaffold that is not necessarily associatedwith a specific program, and we ignore the subscript l = L ofS when we refer to the scaffold of a full program.

2286

Code Pieces Extracted Primary Expressions

int main() int n, ans = 1; for (int i = 1; i <= n / 2 - 1; i++) cout << 2 << " "; if (n % 2 == 0) cout << 2 << endl;

return_type function_name ( ) start terminal_stmt forstart terminal_parathenses terminal_stmtend if terminal_parathensesstart terminal_stmtendend

(a) Code pieces are parsed into Primary Expressions Symbols

Symbol Production Rules Used

function

stmt* for_stmtif_stmt

return_type function_name ( ) start stmt* endstmt* stmtfor_stmt | if_stmt | terminal_stmtifstart terminal_parathenses terminal_stmtend;

(b) Production rules of Primary Expression Grammar

function

returntype int

main

( ) start

stmt*

end

terminal stmt

for stmt

if stmt

int n, ans = 1;

forstart

terminalparathenses

…

terminal stmt end

(int i = 1; i <= n / 2 - 1;

i++)cout << 2 << " ";

(c) Abstract Syntax Tree of the code piece combination.

function name

Figure 2: Example primary expression grammar. Subscripts “start/end” refers to starting/ending variable scopes.

const int N = 35;int main()

int n, h[N], count;

mainN

Program Prefix

Symbol Table per Scope

nh

count i

Variable i declared in the third scopeVariable i, n, count used in the third scope

Variable Used/Declared

main() scope for () scopefile scope

Next Line for (int i = 0; i < n; i ++) count++;

extract

Figure 3: Extracting variables used or declared at eachscope for a given code piece to verify the symbol tableconstraints.

4 Constrained Search

4.1 Beam Search

Our goal is to find the top B highest-scoring candi-date programs that satisfy the aforementioned con-straints. Unfortunately, finding whether even onesolution exists is NP-hard (proof given in SectionA.3). One way we can approximate the solution isto use a standard beam search. The beam maintainsa list of hypothesis program prefixes along withtheir respective scores. We extend the beam byadding the candidate code pieces from the next lineto each candidate program prefix if they form validcombinations under the constraints, then prune thehypotheses with scores outside of the top W . Thealgorithm ends after L steps, returning all the validhypotheses in the final beam.

4.2 Scaffold Search

Although beam search can approximate the topB solutions, the time complexity of beam searchgrows quadratically with the beam width W . Find-ing the top B candidates requires that W ≥ B,and hence each candidate takes Ω(BL) (amortized)time to generate, which can become intractable ifB is on the order of thousands. Even worse, beamsearch is often biased towards variations at the endof the program due to its greedy decisions, and can

waste its budget on candidates that are unlikely tobe the correct solution.

This is in direct contrast to the computationallylighter baseline which generates the exact (unbi-ased) top candidates independently for each linewithout constraint. Can we combine the advantagesof both algorithms? A key observation is that theassumption of independent scoring across differentlines allows fast and unbiased full program candi-date generation, while an expensive beam search isinevitably needed to deal with the inherent depen-dence between lines.

Therefore, we propose a hierarchical beamsearch method that first uses beam search with asmaller beam width W to find likely scaffolds, in-cluding only the minimum dependency informationbetween lines to satisfy the constraints, then scorescandidates independently for each line conditionedon the scaffold. We assign probability p(φlγ) toconfiguration φlγ by marginalizing all code piececandidates at line l with configuration φlγ , and as-sign probability p(S) to scaffold S by multiplyingthe configuration probabilities from each line:

p(φlγ) =∑

φ(ylc)=φlγ

plc, p(S) =L∏i=1

p(S[i]).

(2)Using this scoring function, we run a scaffold beamsearch with size W , then select the top K highestscoring scaffolds S1, S2 . . . SK .

Next, to generate program candidates from agiven scaffold S, we filter out all code pieces inYl that do not have the configuration specified byS; in other words, the new set of code candidatepieces for each line l is

Y Sl = ylc ∈ Yl | φ(ylc) = S[l]. (3)

As a result, conditioned on a fixed scaffold S, codepieces from each line can be chosen independentlyand the resulting full program will be guaranteedto satisfy the aforementioned constraints.

2287

Given K candidate scaffolds, we enumerate thetop full program candidate from each scaffold andchoose the highest scoring one. This takes timeO(K + L log(BL)) per candidate. In practice, wepick relatively small K and the running time hasonly logarithmic dependence on B.

4.3 Tradeoffs in Early DetectionAn alternative view on beam search is that it frontloads the computation to reject invalid programsthat do not satisfy the constraints earlier in thesearch process. A brute force alternative is to gen-erate the next highest scoring candidates from theunconstrained baseline and reject invalid ones. Thismethod is guaranteed to produce top-scoring solu-tions, but it might need arbitrarily many candidatesto find a valid one. We need to compare the com-putational efficiency between these two methods.

The most computationally expensive operationin constraint verification is to verify whether thenext line is valid given the program prefix. There-fore, we count how many times this verifier func-tion is called as a proxy to measure computationalefficiency. We allow the brute force method to useas large a verifier function call quota as our “ac-tive” beam search method: it can validate/reject aprogram candidate until the quota is used up.

Section 6.4 compares our scaffold search methodagainst this brute force approach. The latter needsthousands of times more computation to attain thesame level of performance as the former.

5 Implementation2

Empty Pseudocode Around 26% of the lines inthe data set do not have pseudocode annotations.They usually correspond to lines of code that do nothave semantically meaningful information, suchas “int main() ”, “”, “”, etc. Kulal et al.(2019) replaced these empty pseudocode lines withthe ground truth code, effectively giving this in-formation away to the search algorithm. We didnot use the gold code pieces for these lines, whichmakes our task more challenging.

Model Training We use OpenNMT (Klein et al.,2017) with its default settings to translate pseu-docode into code piece candidates. Our modelis a two-layer LSTM seq2seq model with hiddensize 512, an attention mechanism (Bahdanau et al.,2014) and copy pointers (Vinyals et al., 2015).

2Our implementation is available at https://github.com/ruiqi-zhong/SemanticScaffold.

We estimate the fraction problems solvable giveninfinite search budget and 100 candidates per lineas in Kulal et al. (2019) to obtain an oracle boundon performance. Due to slight difference in hy-perparameters and tokenization method, our modelhas higher ceiling: on the unseen worker (prob-lems) test set, the oracle performance3 is 74.4%(60.5%), compared to 71.4% (55.2%) in previouswork. Across all test examples, the oracle perfor-mance is 68%.

Parsing Code Pieces Since no off-the-shelf C++parser extracts the information we need from codepieces, we implement our own primary expressionparser to extract high level control information. Werely on the following heuristic assumptions to parsethe code pieces generated by the model: (1) a codepiece belongs to only one variable scope; (2) thegeneration of every primary expression terminalsymbol lies in one line. Our parser fails on lessthan 0.01% of the code pieces in the dataset. Whileselecting the candidates for each line, we immedi-ately reject the ungrammatical pieces we cannotparse. Without deliberate implementation optimiza-tion, this parsing operation takes on average 2.6seconds to process all the top 100 code pieces for aproblem – approximately the same wallclock timeas 1 compilation attempt.

Search Algorithm Hyperparameters As in Ku-lal et al. (2019), we consider the top C = 100 codepieces for each line. Unless otherwise mentioned,our default beam width W is 50 for scaffold searchand we keep the top K = 20 scaffolds for thesubsequent generation.

6 Search Performance

6.1 Metrics

We evaluate a search algorithmA by computing thefraction of problem it can solve on the test set givenevaluation budget B per problem, which we denoteas fA(B). We plot fA against B and evaluate itat B = 1, 10, 100, 1000 for each algorithm A tocompare performance.

We note that the difference of f values betweentwo algorithms becomes smaller and less infor-mative as B increases. With infinite code piececandidates and budget, a brute force search can

3The oracle performance here is not a universal propertyof the data, but depends on the model used to generate thecode pieces.

https://github.com/ruiqi-zhong/SemanticScaffold

https://github.com/ruiqi-zhong/SemanticScaffold

2288

Line Pseudocode Code Piece Candidates Syntactic Config SymTable

1 in function main int main()

long long n = 0; terminal_stmt ; declare n 2 n is a long integer 0 long n = 0; terminal_stmt ; declare n

while (n < ‘o’) while condition use n3 while n is less than o while (n < o ) while condition use n, o

while (n < ‘o’) while condition use n

4 rest of the program omitted …

(a) Candidate code pieces and configs ϕ

(b) Search over scaffolds

Marginalizeover common

Configs

SymTable Configs differ

Config

2 terminal_stmt declare n 3 while condition use n

Other scaffolds omitted...

error: use of undeclared identifier 'o'

(c) Generate from scaffolds

terminal_stmt declare n

while condition use n

long long n = 0;long n = 0;

2 terminal_stmt declare n 3 while condition use n, o

while (n < ‘o’)

(d) Combine2 long long n = 0;3 while (n < ‘o’)

2 long n = 0;3 while (n < ‘o’)

SyntacticConfigs differ

Figure 4: (a) Candidate code pieces and their syntactic/Symtable configuration for each line; (b) use beam searchto find highest scoring valid scaffolds; (c) given a scaffold, select code pieces that has the same configurations foreach line. (d) combine code pieces to form full program.

enumerate all possible programs, find the right so-lution and f converges to 1. Direct comparisonon f values hence becomes meaningless as B in-creases. To address this deficiency, we define alead metric lA1,A2(B) equal to the extra budget Xneeded by algorithm A2 to reach the same level ofperformance as A1 given budget B. Formally,

lA1,A2(B) = infX | fA2(B +X) ≥ fA1(B).(4)

A visualization can be seen in Figure 5(c).We report our algorithms’ performance on the

heldout test set with annotations from unseencrowd workers and with unseen problems sepa-rately.

6.2 Comparison of ConstraintsWe compare four settings:

• No Constraints: the best-first search methodthat scores lines independently.

• Syntactic Constraints: the constraints on theprimary expression and indentation level asdescribed in section 3.1.

• Symbol Table Constraints: both the syn-tactic constraints and the symbol table con-straints described in section 3.2. We abbrevi-ate this as SymTable.

• Backoff: sometimes hierachical beam searchwith the SymTable constraints fails to return

Figure 5: (a), (b) Comparison of f performance underdifferent constraints. (c) a zoom in visualization on thedefinition of lead metrics (d) lead of SymTable con-straint on Syntactic constraint on different test sets.

any valid scaffold. We back off to just theSyntactic constraints if this happens.

Additionally, we compare with the Previous state-of-the-art reported by Kulal et al. (2019).

The results can be seen in Figure 5 and Table 1,where we use the constraint type as a shorthand forthe search algorithm under this constraint. Withoutconstraints, the baseline algorithm performs espe-cially poorly because it needs syntactic context toselect relevant code pieces for 26% of the lineswith empty pseudocode.

SymTable outperforms Syntactic. As shown in

2289

Test Against Unseen WorkersHierarchical Search (H), Beam Width W = 50Constraint B=1 B=10 B=102 B=103

None 0.0% 8.1 % 29.2 % 44.3%Previous 30.7% 44.4% 53.7% 58.6%Syntactic 42.8 % 51.9% 59.3% 65.9%SymTable 45.8% 55.1% 62.6% 67.3%Backoff 46.0% 55.3% 62.8% 67.6%

Test Against Unseen ProblemsConstraint B=1 B=10 B=102 B=103

None 0.0% 3.0% 11.5% 21.8%Previous 17.8% 28.4% 34.2% 38.3%Syntactic 27.5 % 35.4% 42.1% 47.8%SymTable 31.0% 39.2 46.0% 49.3%Backoff 31.2% 39.4% 46.1% 49.6%

Table 1: Comparison of the fraction of program passedwhen B = 100,1,2,3 under different constraints; con-straint satisfied by hierarchical beam search with thedefault hyper-parameters mentioned in Section 5. “Pre-vious” refers to the previous state of the art model.

Figure 5(d), the lead of SymTable on Syntacticgrows linearly: the more these two algorithmssearch, the more budget is needed by Syntacticto reach the same level as SymTable. Syntacticneeds nearly 600 more budget to have comparableperformance with SymTable that uses 400 budget.

We notice that all of our constrained search meth-ods outperform the previous state-of-the-art. Av-eraged across all test examples, Backoff can solve55.1% of the problems within 100 budget, which is≈ 10% higher than the previous work. On unseenworkers (problems), the top 11 (top 52) candidatesof Backoff solve the same fraction of problemsas the top 3000 candidates of the best performingalgorithm in Kulal et al. (2019).

6.3 Regular vs. Hierarchical Beam Search

We use regular beam search with beam widthW = 200 to generateB = 100 valid candidate fullprograms. We did not experiment with B = 1000because beam search with W ≥ B ≥ 1000 iscomputationally intractable. For hierarchical beamsearch we experiment with W = 10, 25, 50 forscaffold search and keep the top K = min(W, 20)scaffolds for subsequent searches.

Table 2 compares the performance of hierarchi-cal beam search against regular beam search withdifferent beam sizes under Syntactic and SymTableconstraints. We find that if hierarchical beamsearch is used, even dropping the beam width

Test Against Unseen Workers, SyntacticMethod, Width B=1 B=10 B=102

H, W=10 42.8% 51.7% 59.1%H, W=25 42.8% 51.8% 59.3%

H, W = 50 42.8% 51.9% 59.3%R, W=200 42.4% 51.3% 58.2%

Test Against Unseen Workers, SymTableMethod, Width B=1 B=10 B=102

H, W=10 45.4% 54.3% 61.0%H, W=25 45.6% 54.7% 61.9%

H, W = 50 45.8% 55.1% 62.6%R, W=200 45.6% 54.9% 61.9%

Table 2: Comparison of different beam size with Syn-tactic and SymTable constraint when tested against un-seen workers. H/R refers to hierarchical/regular beamsearch and W is the beam width. The same results onunseen problems can be seen in appendix .

from 50 to 10 leads to negligible change in per-formance. In contrast, even with a large beamwidth W = 200, regular beam search method can-not efficiently search for the solution and leads to anoticeable drop in performance.

We observe a similar trend for SymTable: regu-lar beam search with beam width W = 200 under-performs hierarchical search with beam widthW = 25. However, if we further decrease thehierarchical beam search width from 25 to 10 inthis setting, we observe a significant drop in perfor-mance, possibly because there are more variableusage variations than syntactic variations.

6.4 Scaffold Search vs. Brute Force Method

We now compare scaffold search to the brute forcealgorithm as described in section 4.3. We makeB = 50,000 attempts for the brute force method sothat its performance can match at least the top 10candidates of our constrained approach and makethe lead metrics meaningful. To save computationand avoid compiling all 50,000 programs, we earlyreject every candidate that does not fulfill our con-straints.

The lead of our approaches against the bruteforce algorithm is shown in Figure 6. After beingadjusted for the constraint checking quota used, thelead of our approach is tens of thousands ahead ofthe unconstrained approach. Scaffold search saveslot of computation by inducing a little overheadearlier in the search process.

2290

Figure 6: Lead of SymTable and Syntactic constraintson non-constrained approach with equal quota on testset with unseen (a) workers and (b) problems.

7 Analysis

7.1 Program Candidate Variations

Beam search has the problem of producing fewervariations at the beginning of the search. Such aweakness might be tolerable if we only care aboutthe top 1 candidate, but becomes disastrous in asearch setting where we want the top B candidates,whose variation is typically spread across the entireprogram.

We describe the following procedure to formallydefine this intuition. We first aggregate code piecechoices for each line for all the topB programs. Asshown in Figure 8(a), we construct a matrix suchthat each column corresponds to a full programcandidate; the number r in the ith row and jth

column means that on line i, the jth full programcandidate chooses the rth code piece candidate (i.e.yici = yir). Then we can build a prefix tree (Figure8(b)) by treating each column as a string, whereeach traversal from the root to a leaf is a completecandidate program y. We define the representativebranch/program as a traversal from the root to aleaf that always chooses the child that contains themost leaves (with ties being broken randomly). Foreach of the remaining B − 1 programs/traversals,we find the smallest line number where it starts todiverge from the representative branch. Amongthese B − 1 programs, we count the fraction ofdivergences that take place in the first/second halfof the lines. For example, in Figure 8(b), 0% of thedivergences occur in the first half.

We compare hierarchical vs. regular beam searchunder syntactic constraints with different beamwidths W : hierarchical W = 10, 50 and regularW = 50, 200. We group the programs by length L,consider the top B = 25 attempted programs foreach problem and report the fraction of divergencesthat occur in the first half of the program length foreach group.

Length L H 10 H 50 R 50 R 200(0, 10] 45.4% 45.5% 43.6% 45.5%

(10, 20] 63.2% 63.4% 58.2% 63.4%(20, 30] 63.6% 63.6% 56.8% 63.6%(30, 40] 67.2% 67.3% 58.2% 67.3%(40,∞) 69.4% 68.8% 56.8% 68.8%

Table 3: Fraction of divergence in the first half ofthe program, grouped by program length L. In thecolumn headers, H/R represents Hierarchical/Regularbeam search under Syntactic constraint, and the num-ber represents beam width W . The column with thelowest fraction is underlined.

The results can be seen in Table 3. For regularbeam search, a moderate beam width W = 50 con-sistently brings fewer variations in the first half ofthe program, and it needs a larger W = 200 tofix this problem. In contrast, a small W for hier-archical beam search produces the same amountof variations in the first half of the program. Thesame statistics under SymTable constraints can beseen in the appendix (Table 5) and the conclusionholds similarly.

7.2 Rejection by Constraints

In this section we give representative exampleson what program candidates are rejected by oursyntactic and symbol table constraints.

Syntactic Constraints As mentioned in Sec-tion 5, about 26% of the lines do not have pseu-docode. They may correspond to “”, “intmain()”, “”, ”return 0”, “;” or “;”.These lines need contextual information to selectvalid code pieces and naıvely combining the top 1candidate from each line independently will alwaysproduce grammatically invalid programs. Syntacticconstraints also rule out stylistic ambiguities. Forexample, when there is only one statement withinan if statement, the programmer can optionallyinclude a curly brace. However, the pseudocodedoes not contain such detailed information aboutstyle. Both “if(...)” and “if(...)” mightbe valid, but only one of them can be correct giventhe context of a program. Our syntactic constraints,which contain a curly brace constraint, can help usselect the right code piece.

Symbol Table (SymTable) Constraints Pseu-docode annotations are sometimes implicit aboutvariable declarations. Given the instruction “setN to 222222”, both code pieces (1) “int N =

2291

Reason (percentage) Pseudocode Gold Solution Model Generation

(a) Generation wrong (47.5%)let value1, value2, val, a, b be integers with val = 0

int value1, value2, val, a, b = 0 ;

int value1, value2, val = 0, a, b;

(b) Needs type disambiguation (12.5%)s[occur[i][j] + k] = letter

- a + As[occur[i][j] + k] = letter - 'a' + 'A';

s[occur[i][j] + k] = 'letter' - a + A;

(c) Needs syntax disambiguation (0.5%) else if dB is less than dW else if (dB < dW) else if (dB < dW)

(d) Variable name typos (15.0%) if lfg = 1 if (flg == 1) if (lfg == 1)

(e) Pseudocode wrong (23.5%) set ans = 25*length of s ans += (25 * s.length()); int ans = 25 * s.length();

Figure 7: Categorized error analysis for lines that no generated code piece is functionally equivalent to the gold.The percentage in the parentheses refers to the fraction of this category out of the 200 samples.

1

Full Program Rank

Line Number

The 4 th full program candidate picked the rank 0 code piece in line 6 .

1 2 3 4 5 61 0 0 0 0 0 02 1 1 1 1 1 13 0 0 0 0 0 04 0 0 0 0 0 05 0 0 0 1 2 16 0 1 3 0 1 2

0

00

0 1 2

10 3 20 13 branches diverge

from the representative branch at line 5.

3

first half program

Figure 8: (a) A matrix that represents each candidate’schoices of code pieces for each line. (b) A prefix treeconstructed by treating each column as a string; the rep-resentative branch is the second column and markedwith red color.

222222;” and (2) “N = 222222;” are poten-tially valid. We might disambiguate this case witha SymTable constraint: if the variable is declaredbefore in the same scope, then we know this codepiece should not contain a repeated declaration andhence we should choose candidate (2); otherwisewe should choose (1) to avoid using undeclaredvariables. SymTable constraints are also helpfulwhen the pseudocode does not put quotation marksaround string/character literals. Consider the in-struction “if lucky is A then do the following” withthe ground truth code piece “if (lucky ==’A’) ”. The model might misunderstand A asa variable name and generate “if (lucky ==A) ”. This error can be ruled out by SymTableconstraint if variable A is undeclared.

However, SymTable constraints do not precludeall errors related to declarations. Consider the fol-lowing generation where the last line is wrong:

i n t now = −1, c n t = 0 ;f o r ( i n t i = 0 ; i < n ; ++ i )

. . . / / some l i n e s o m i t t e d/ / c n t = 1 , now = v [ i ] ; / / go ldi n t c n t = 1 , now = v [ i ] ; / / pred

A programmer will usually not declare new vari-ables in the last line of a variable scope. However,technically this is not an invalid statement and theSymTable constraint fails to reject this wrong candi-

date. Extra modelling is needed to take into accountprogramming conventions and common sense.

7.3 Code Piece Error Analysis

So far we have focused on combining independentcandidates from each line together to search forthe target program. This heavily depends on theunderlying model to generate potentially correctcode pieces. However, in 32% of the programs atleast one “hard” line has no generated code piecethat is functionally equivalent to the solution, thusindicating plenty of room for improvement. Tohelp the readers understand the bottleneck for codepiece generation and point out important futuredirections, we randomly sampled 200 “hard” linesand manually analyzed why the generation fails bylooking at the top 1 candidate of the model. Theerror analysis is available on our GitHub.

We group the failures into the following cate-gories, giving a detailed breakdown and examplesin Figure 7. (a) The model generation is wrongdespite clear pseudocode; this typically happenswhen the gold code piece is long or highly compo-sitional. (b, c) The pseudocode contains ambigu-ity; the model generation is reasonable but eitherneeds (b) variable type clarification or (c) syntac-tic context. This requires incorporating contextualinformation of the program into the code piece gen-eration process. (d, e) The pseudocode either (d)consists of variable name typos or (e) is completelywrong.

References

Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. 1986.Compilers, principles, techniques. Addison wesley,7(8):9.

Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gor-don, and Yi Wei. 2015. Bimodal modelling ofsource code and natural language. In Proceedings

https://github.com/ruiqi-zhong/SemanticScaffold/blob/master/hard_lines_category.csv

http://dl.acm.org/citation.cfm?id=3045118.3045344


2292

of the 32Nd International Conference on Interna-tional Conference on Machine Learning - Volume 37,ICML’15, pages 2123–2132. JMLR.org.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Xinyun Chen, Chang Liu, Richard Shin, Dawn Song,and Mingcheng Chen. 2016. Latent attention forif-then program synthesis. In Proceedings of the30th International Conference on Neural Informa-tion Processing Systems, NIPS’16, pages 4581–4589, USA. Curran Associates Inc.

Keith Ellul, Bryan Krawetz, Jeffrey Shallit, and Ming-wei Wang. 2005. Regular expressions: New resultsand open problems.

Carlo Ghezzi and Dino Mandrioli. 1979. Incrementalparsing. ACM Transactions on Programming Lan-guages and Systems (TOPLAS), 1(1):58–70.

Sumit Gulwani, Oleksandr Polozov, and RishabhSingh. 2017. Program synthesis. Foundationsand Trends R© in Programming Languages, 4(1-2):1–119.

Shirley Anugrah Hayati, Raphael Olivier, Pravalika Av-varu, Pengcheng Yin, Anthony Tomasic, and Gra-ham Neubig. 2018. Retrieval-based neural code gen-eration. arXiv preprint arXiv:1808.10025.

Srinivasan Iyer, Alvin Cheung, and Luke Zettlemoyer.2019. Learning programmatic idioms for scalablesemantic parsing. arXiv preprint arXiv:1904.09086.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, andLuke Zettlemoyer. 2018. Mapping language tocode in programmatic context. arXiv preprintarXiv:1808.09588.

G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush.2017. OpenNMT: Open-Source Toolkit for NeuralMachine Translation. ArXiv e-prints.

Sumith Kulal, Panupong Pasupat, Kartik Chandra,Mina Lee, Oded Padon, Alex Aiken, and PercyLiang. 2019. Spoc: Search-based pseudocode tocode. arXiv preprint arXiv:1906.04908.

Wang Ling, Edward Grefenstette, Karl Moritz Her-mann, Tomas Kocisky, Andrew Senior, FuminWang, and Phil Blunsom. 2016. Latent predic-tor networks for code generation. arXiv preprintarXiv:1603.06744.

Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017a. Abstract syntax networks for code gen-eration and semantic parsing. arXiv preprintarXiv:1704.07535.

Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017b. Abstract syntax networks for code genera-tion and semantic parsing. In Proceedings of the

55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1139–1149, Vancouver, Canada. Associationfor Computational Linguistics.

Armando Solar-Lezama. 2009. The sketching ap-proach to program synthesis. In Proceedings of the7th Asian Symposium on Programming Languagesand Systems, APLAS ’09, pages 4–13, Berlin, Hei-delberg. Springer-Verlag.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In Advances in Neural In-formation Processing Systems, pages 2692–2700.

Chenglong Wang, Po-Sen Huang, Alex Polozov,Marc Brockschmidt, and Rishabh Singh. 2018.Execution-guided neural program decoding. CoRR,abs/1807.03100.

Pengcheng Yin and Graham Neubig. 2017. A syntac-tic neural model for general-purpose code genera-tion. In The 55th Annual Meeting of the Associa-tion for Computational Linguistics (ACL), Vancou-ver, Canada.

Maksym Zavershynskyi, Alex Skidanov, and Illia Polo-sukhin. 2018. Naps: Natural program synthesisdataset. arXiv preprint arXiv:1807.03168.



https://doi.org/10.1561/2500000010

http://arxiv.org/abs/1701.02810


https://doi.org/10.18653/v1/P17-1105

https://doi.org/10.18653/v1/P17-1105

https://doi.org/10.1007/978-3-642-10672-9_3

https://doi.org/10.1007/978-3-642-10672-9_3


https://arxiv.org/abs/1704.01696



2293

A Appendices

A.1 Primary Expressions

Table 6 contains the grammar we use for the syn-tactic constraint and Table 7 defines the generationof terminal symbols.

A.2 CFG Description Size of SymTable

We show that we cannot specify the SymTable con-straint in a context free grammar without expo-nential description complexity w.r.t. the numberof variables declared. The intuition is that, sincerepeated declarations of a variable are not allowed,we need to keep track of all the variables that havebeen declared every time when verifying whetherthe next line is valid; however, a CFG, when trans-formed into a pushdown automata, is only allowedto peek at the top of the stack to decide the statetransition. This means the symbol on the top of thestack, the state, or the transition rule need to havefull information of about whether each variable hasbeen declared, which contains exponentially manypossibilities w.r.t. the number of variables.

Our proof is an adaptation of Ellul et al. (2005),which proves this property for the language thataccepts all the permutations of a fixed number ofvariables. We refer the readers to this paper ifmore details of the proof are needed. To formal-ize, we consider a simple grammar of K charac-ters v1, . . . , vK, where vi means, semantically,declaring the variable vi, and the language L con-sists of all the possible sequences of declarationsthat have no repetition.

L = concatkj=1vaj |aj1 6= aj2 if j1 6= j2, k ≤ K(5)

We prove that

Theorem 1 L has at least Ω(1.37K) descriptioncomplexity4 as a context free grammar.

Intuitively, it means if we want to use a CFG tospecify L, we need the sum of total length of theproduction rules and number of symbols to be atleast exponential.

Proof: Since we can convert any CFG with sizeB to Chomsky Normal Form (CNF) with sizeO(B2), the above statement would be implied ifwe prove that L needs Ω(1.372K) = Ω(1.89K)description size in Chomsky Normal Form.

We use Lemma 31 from Ellul et al. (2005):

4Ω ignores all the poly(K) multiplicative factors.

Lemma 2 Let S be the start symbol of the CFG.Then for all w ∈ L, there exists a symbol A with

S =⇒∗ αAβ =⇒∗ w (6)

such that ifA yields y inw (i.e. w = αyβ), 13 |w| ≤

|y| ≤ 23 |w|.

In other words, for any member of the language,we can find a symbol in the derivation responsiblefor between 1/3 and 2/3 of the final yield.

Let PK be all sequences of permutations of theK variables and thus PK ⊂ L. Then by Lemma 2,for every permutation π ∈ PK we can find yield yπthat is yielded by a single symbol such that 1

3K ≤|yπ| ≤ 2

3K. Now we consider two permutationsπ1 and π2. If yπ1 and yπ2 are yielded by the samesymbol, then they must have the same length (thisis the part where the proof is slightly different fromEllul et al. (2005)): suppose the contrary, w.l.o.g.,let |yπ1 | > |yπ2 |. By the definition of a context freegrammar, we can replace the sub-string yπ2 in π2by yπ1 to create a new string y′π2 which is still amember of L. We have |y′π2 | = K−|yπ2 |+|yπ1 | >K by assumption. However, there are in total Kvariables; by the pigeonhole principle there must bea variable that is declared twice, and hence y′π2 /∈ Land we obtain a contradiction.

Then all the assumption needed by Theorem 30in Ellul et al. (2005) hold and L has descriptioncomplexity Ω(1.89K) in CNF and hence L hasdescription complexity Ω(1.89K/2) = Ω(1.37K).

A.3 Hardness of Satisfying SymTable

We show that combining code pieces from eachline under the SymTable constraint is NP-Hard ingeneral. We first remind the readers of the setpacking problem:

Definition 3 Assume the universe to be V , and sup-pose we are given a family of subsets S from thepower set of V , i.e. P (V) = S | S ⊆ V andS ⊆ P (V). We want to determine whether we canfind a packing K ⊆ S for which all sets in K arepairwise disjoint and with size |K| ≥ L for somefixed L > 0. This problem is called the set packingproblem, and is known to be NP-complete.

Following the notation in section A.2, for eachline l ∈ [L], we construct the C = |S| code piececandidates ylS for S ∈ S as

ylS = concatv∈Sv. (7)

2294

Test Against Unseen Problems, SyntacticMethod, Width B=1 B=10 B=102

H, W=10 27.4% 35.3% 42.0%H, W=25 27.5% 35.4% 42.1%H, W=50 27.5% 35.4% 42.1%R, W=200 27.1% 34.7% 41.0%

Test Against Unseen Problems, SymTableMethod, Width B=1 B=10 B=102

H, W=10 30.3% 38.1% 43.1%H, W=25 30.9% 39.2% 45.7%H, W=50 31.0% 39.2% 45.9%R, W=200 30.7% 38.9% 45.4%

Table 4: Comparison of different beam size with Syn-tactic and SymTable constraint when tested against un-seen problems. H/R refers to hierarchical/regular beamsearch and W is the beam width. This table is struc-tured similarly as 2 .

Length L H 25 H 50 R 50 R 200(0, 10] 40.7% 41.5% 39.4% 41.5%

(10, 20] 60.9% 59.8% 54.3% 61.3%(20, 30] 62.2% 61.3% 54.2% 61.3%(30, 40] 66.0% 66.1% 56.8% 66.1%(40,∞) 69.0% 68.7% 57.9% 68.7%

Table 5: Fraction of divergence in the first half ofthe program, grouped by program length L. In thecolumn headers, H/R represents Hierarchical/Regularbeam search under SymTable constraint, and the num-ber represents beam width W .

We easily see that there is a set packing of size L ifand only if there is a valid code piece combinationunder SymTable constraint (declarations need tobe disjoint for each line). Hence we finish ourreduction proof.

A.4 Beam Search on Unseen ProblemsTable 4 contains similar information as in Table2, except that the results are obtained on testingwith unseen problems. The exact same conclusionholds: for regular beam search, small beam sizehurts performance, but hierarchical beam searchcan solve this problem.

A.5 Variation under SymTable ConstraintsTable 5 contains similar information as Table 3, butfor SymTable constraints. The same trend holds:regular beam search with small beam size havefewer variations in the first half of the program.

2295

Symbol Production Ruleprogram stmt program

function programstmt for stmt

if stmtwhile stmt

dowhile stmtterminal stmt ;

X∗ X∗XX

〈 EMPTY 〉function return type function name ( args) start stmt∗ end

return type function name ( type∗);args 〈 EMPTY 〉

arg , argsarg type arg name

for stmt forstart terminal parentheses terminal stmtend;forstart terminal parentheses stmt∗end

while stmt whilestart terminal parentheses terminal stmtend;whilestart terminal parentheses stmt∗end

dowhile stmt dostart stmt∗ while terminal parenthesesend;dostart terminal stmt while terminal parenthesesend;

if stmt single if stmt elif stmt∗ else stmtsingle if stmt elif stmt∗

single if stmt ifstart terminal parentheses terminal stmtend;ifstart terminal parentheses stmt∗end

elif stmt elifstart terminal parentheses terminal stmtend;elifstart terminal parentheses stmt∗end

else stmt elsestart terminal stmtend;elsestart stmt∗end

Table 6: The full primary expression grammar we are using. Each line is a production rule. X is a generic symbol.

Terminal Implementationterminal parentheses a string that has matching parentheses and starts with parentheses

terminal stmt a string that does not contain “;”, “for”, “if”, “else”, “while”, “do”for, if, else, while, do reserved key words

function name, arg name function name and function argument namereturn type, type type in C++

Table 7: The definition of the terminals appearing in Table 6

Semantic Scaffolds for Pseudocode-to-Code...

Documents

Transcript of Semantic Scaffolds for Pseudocode-to-Code...