Automated Conversion of Table-based Websites to Structured ...€¦ · 2.1 XHTML Conversion In the...

Post on 12-Aug-2020

4 views 0 download

Transcript of Automated Conversion of Table-based Websites to Structured ...€¦ · 2.1 XHTML Conversion In the...

Automated Conversion of Table-based Websites to Structured

Stylesheets Using Table Recognition and Clone Detection

Andy Y. Mao James R. Cordy Thomas R. Dean

School of ComputingQueen’s University

Kingston, Ontario, Canada{mao, cordy, dean}@cs.queensu.ca

Abstract

Web standards such as XHTML and CSS arerapidly coming into practice and have many ad-vantages, including compatibility, consistencyacross browsers, and increased ease of main-tenance. Unfortunately large numbers of ex-isting websites still use the deprecated table-based layout style in which page style is uniqueto each page. Existing tools for automat-ing the transition to stylesheets provide lit-tle help, converting page-by-page using a flat-tened structure and local inline styles ratherthan a common CSS stylesheet. This approachignores hierarchical structure and defeats themain purpose of moving to the new standard,losing all of the advantages.

In this work we present an automatedmethod for converting table-based layout web-sites to standards-compliant modern CSSstylesheet-based websites using a two-step pro-cess. Pages of the site are first converted page-by-page using table recognition technology topreserve hierarchical structure and layout se-mantics in local styles. Software clone detec-tion technology is then utilized to recognizecommon layout styles in the pages and ex-tract and minimize them to a common CSS

Copyright c© 2007 Andy Y. Mao, James R. Cordyand Thomas R. Dean. Permission to copy is herebygranted provided the original copyright notice is repro-duced in copies made.

stylesheet for the site. The result is a main-tainable, efficient modern standards-compliantwebsite with the same look and feel as the orig-inal but with all the maintenance advantages ofa custom programmed new site.

1 Introduction

Long before the maturity of World Wide Webstandards, websites implemented standard lay-outs and look-and-feel of pages using table-based layouts that are copied from one pageto another, often because the original siteswere generated by early website editors such asClaris Home Page. Many of these websites arestill alive and actively maintained, and indeed alarge number of popular websites still use tradi-tional table-based layouts. Now that web stan-dards designed for expressing and maintainingcommon layout and style such as XHTML, DIVlayout and separate CSS stylesheets have ma-tured, it is highly desirable to migrate existinglegacy websites to the new technology.

The use of a separate common CSSstylesheet for a site is an example of the clearadvantages offered by such a conversion. Sup-pose, for example, that we wished to changean entire website from left-handed logo form toright-handed, as shown in Figure 1. To makethis change to a traditional table-based layoutsite, every single page of the site would have

Figure 1: Left- to Right-handed Logo Example

to be hand edited to move table elements be-tween columns one by one using copy-paste.The amount of work and level of tedium in im-plementing even this simple change would al-most certainly lead to errors and anomalies. Bycontrast, implementing this change in a com-mon CSS stylesheet version of the same sitewould involve only change to one style in thestylesheet file, leaving all pages untouched andvastly reducing the effort and chances for error.

In addition to the clear advantage of a con-sistent common style across a site, web stan-dards also offer many other advantages, includ-ing compatibility with modern website editingand searching tools, greater browser indepen-dence, and enhanced ease of maintenance. Ide-ally every website should be redesigned to usethese new standards from scratch, but in prac-tice the effort to do so for large websites withsubstantial investment can be prohibitive.

Existing automation for migrating table-based legacy websites to the new web standardssuch as that offered by Adobe Dreamweaver [1]is at best cursory, preserving layout of individ-ual pages separately by absolute pixel place-ment and localized inline DIV styles, thus los-ing all hierarchical structure and commonalityof style. The result is a conversion equivalent toper-page plotting of page elements (Figure 2),yielding a website that is actually less main-tainable than the original and defeating thewhole purpose of moving to the new standards.

In this paper we propose a more realistic andambitious automated conversion, leveraging ta-ble analysis and source transformation technol-ogy already proven in the document recogni-tion and software reengineering domains. Ourconversion recognizes and preserves hierarchi-cal structure and commonality of style across

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<html> <head>

<title>Table test</title>

<style type="text/css">

div { border:1px red solid;}

</style>

</head>

<body>

<div id="Layer1" style="position:absolute;

left:15px; top:18px; width:244px; height:22px;

z-index:1; vertical-align:middle">content 1</div>

<div id="Layer2" style="position:absolute;

left:286px; top:18px; width:294px; height:46px;

z-index:2; vertical-align:middle">content 2</div>

<div id="Layer3" style="position:absolute;

left:610px; top:18px; width:161px; height:22px;

z-index:3; vertical-align:middle">content 3</div>

<div id="Layer4" style="position:absolute;

left:15px; top:42px; width:244px; height:22px;

z-index:4; vertical-align:middle">content 4</div>

. . .

</body>

</html>

Figure 2: Example Dreamweaver ConversionPositions are absolute, style attributes are em-bedded and all table hierarchy is lost.

the entire site. The result of this automatedconversion is a site that preserves the look andfeel of the pages of the original site while pre-serving hierarchical layout structure. A com-mon CSS style file which is essentially identicalto one that would be authored in a disciplinedhand-crafted migration is inferred from stylesimilarity (Figure 3).

Our method utilizes a four step approach,in which web pages are first converted fromHTML to XHTML using a source transforma-tion based on robust parsing [8]. The tablestructure of each page is then analyzed usingtable recognition methods [22] to separate lay-

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<html> <head>

<title>Table test</title>

</head>

<style>

#top_left {float:left; margin:auto; width:250px}

#top_left_container_1 {float:left; margin:0;

width:250px}

#top_left_content_1 {float:left; margin:auto;

width:250px; border:1px red solid;}

. . .

</style>

<body>

<div id="top_left">

<div id="top_left_container_1">

<div id="top_left_content_1">

content 1

</div>

<br clear="both"/>

</div>

<div id="top_left_container_1">

<div id="top_left_content_1">

content 4

</div>

<br clear="both"/>

</div>

</div>

. . .

</body>

</html>

Figure 3: Example Conversion by Our ProcessPositions relative, style attributes in a separatestylesheet and nesting hierarchy preserved.

out table structure from intentional data ta-bles and to elucidate implicit hierarchy rep-resented by row-spanning (ROWSPAN) andcolumn-spanning (COLSPAN) attributes as ex-plicit nested tables. The augmented explicit ta-ble layout structure is then page-by-page con-verted to the web standard DIV-based layoutwith a separate CSS style file for each page,preserving the original look and feel. Finally,software clone detection technology [14] is uti-lized to recognize common styles and synthesizea single minimized CSS stylesheet file for thesite, converting each page to use the commonstylesheet. The entire process architecture canbe visualized as shown in Figure 4.

The remainder of this paper is organized asfollows. Section 2 outlines the phases of ourprocess in detail and gives small examples ofeach transformation on example HTML code.

Figure 4: Conceptual ProcessOur process consists of two major steps. InStep 1, table recognition is used to infer andpreserve hierarchical structure in a conversionfrom tables to DIVs, separating style informa-tion into a stylesheet for each page. In Step 2,clone detection is used to unify and minimizestyles into a single consistent stylesheet file forthe entire site.

Section 3 demonstrates the entire process usingour experience in converting a real entire table-based legacy website to modern XHTML / CSSstandards. Section 4 relates our work to thatof others, and Section 5 discusses limitationsand future extensions of our process. Finally,Section 6 summarizes and concludes the paper.

2 Approach

The overall purpose of our approach is to pro-vide an automated transformation system thatpreserves the layout structure of the pages of aweb site specified by HTML table layout intoa nested hierarchical DIV structure while re-taining the original look and feel, and to recog-nize and extract DIV styles to a single unified,minimized, maintainable CSS style file for thesite. The first part is achieved by convertingeach page separately, creating a CSS stylesheetfor it independent of the others. The secondpart uses clone detection techniques to recog-nize and minimize common styles into a singleunified stylesheet used by all the pages. Figure4 shows a conceptual view of our method.

All steps of our approach are implementedas source transformations implemented in theTXL source transformation language [6]. Eachphase consists of a number of source transfor-mations, strung together to achieve the result(Figure 5). In this section we outline the detailsof these transformations, using small examplesto demonstrate the technique.

2.1 XHTML Conversion

In the first transformation, web pages are in-dependently converted from HTML to XHTMLusing a TXL grammar that utilizes robust pars-ing [2] to correct for HTML exceptions toXML form. Robust parsing is a method thatattempts to parse each page as an XHTMLdocument, adapting to exceptions where forexample closing tags are missing. The ex-ceptions are isolated into special nonterminalforms that are then targeted for correction byTXL source transformation rules, resulting ina valid XHTML page. Figure 6 shows an ex-ample of this transformation.

2.2 Table Recognition

To preserve the original hierarchical layout se-mantics of the original table layout pages intonested DIV sections, we must first understandwhat the intended structure is. In HTML ta-ble layout, some of the intended structure isencoded in the ROWSPAN and COLSPAN at-tributes of table elements (Figure 7). SinceDIV sections have no such corresponding fea-ture, we must first make this implicit substruc-ture explicit by transforming the original pagesto eliminate ROWSPAN and COLSPAN whileretaining the layout semantics they imply.

In order to do this we have adapted ideasfrom the table recognition literature in patternanalysis and machine intelligence research [22].The methods we have adapted are called pro-jection and partitioning. The basic idea is thata table cell that spans two or more other cells ina row or column implies a projected or nestedstructure on parallel rows or columns that em-beds their corresponding cells in a sub-table.Table recognition methods such as Handley [9]and Hu et al. [11] use this idea in analysis ofhigher level structure of tables in documents.

In our case we have implemented this ideausing a table partitioning and nesting transfor-mation which forms a part of our conversion toDIV structures. To assist in the analysis, wecompute an approximate layout for each pageby assigning position information to every tablecell using custom attributes. This informationis used in the analysis only - the final resultis constrained only by the page’s original styleattributes. The conversion proceeds by parti-tioning ROWSPAN and COLSPAN structuresinto nested tables, separating them from par-allel unspanned rows or columns and reducingall table structures to simple ones without anyspans. Nesting of tables retains the relation-ship between the elements so that layout is notlost. Figure 8 shows the result of table parti-tioning the example of Figure 7.

In some table layout sites, there could ofcourse be a conflict between rowspans as shownin Figure 9. Just as such cases cause problemsfor table recognition algorithms, it leads to anambiguity for our method and our prototypeconversion system requires hand interventionto handle such (relatively rare) cases.

2.3 Table Identification

While the same HTML ”TABLE” feature isused, not every table in a web page representslayout information - some tables are intendedto actually be data tables. As part of our con-version, we must identify which tables shouldbe converted to DIV structures and which not.In our prototype, this decision is made on avery simple criterion - if the HTML table struc-ture has a table header (THEADER) or tablefooter (TFOOTER) tag (i.e., if it has labeledcolumns or rows), then the table is assumed torepresent a real data table and is not converted.

The identification of tables to be convertedis done using another TXL source transforma-tion that marks layout tables to be convertedto DIV using a custom XML tag that also giveseach table a unique name for use in attachingCSS styles to it later. Figure 10 shows part ofthe result of table identification on an examplepage.

Figure 5: Implementation Architecture

2.4 Conversion to DIVs with Lo-cal Styles

Following table partitioning and identification,identified layout tables in pages are convertedfrom table structures to DIV partitions for eachtable element, each with its own individual lo-cal inline style preserving the style attributesof the original table cell. Figure 11 shows partof the result of the DIV conversion of an ex-ample page. Relative positioning implied bytable rows is maintained in the result using theFLOAT=”LEFT” style attribute.

Like all of our stages, the conversion to DIVis done using TXL source transformation rules.Figure 12 shows the main transformation rulereplaceTableByDIV for identified tables. In theusual TXL style, this one rule automaticallysearches to match and convert every identifiedtable in the input. As part of the transfor-mation, it uses the transformation subrule re-

placeTrByDIV to convert each row of the table,and so on. The generated DIVs are uniquelyidentified by the table id generated in the tableidentification step. These ids will attach eachDIV to its corresponding style in the next step.

A web-based human interface allows the op-erator of the conversion process to choose moremeaningful names for the generated DIVs atthis stage (Figure 13). As the operator entersnew names, the interface automatically changesthe XHTML source of the generated DIVs tocorrespond. Figure 14 shows the convertedDIV example of Figure 11 after renaming.

2.5 Separation of CSS Style Files

Following the conversion to DIV form with in-line local styles, another transformation gath-ers and converts all styles in each page into anindividual CSS style file for the page, using thetable ids of the previous step. This is done

<html><body> <table width=100% align=left> <tr> <td width=250> <p> Content 1 has two paragraphs. <p> This is the second one. </td> <td rowspan=2 width=300> Content 2 is a row-spanning entry. </td> <td> <p> But Content 3 has one. </td> </tr> <tr> <td width=250> Content 4 also has two paragraphs. <p> This is the second. </td> <td width=150> Content 5. </td> </tr> <tr> <td> Content 6. </td> <td colspan=2> Content 7 is a column-spanning entry. </td> </tr> </table></html>

<html><body> <table width=”100%” align=”left”> <tr> <td width=”250”> <p> Content 1 has two paragraphs. </p> <p> This is the second one. </p> </td> <td rowspan=”2” width=”300”> Content 2 is a row-spanning entry. </td> <td> <p> But Content 3 has one. </p> </td> </tr> <tr> <td width=”250”> Content 4 also has two paragraphs. <p> This is the second. </p> </td> <td width=”150”> Content 5. </td> </tr> <tr> <td> Content 6. </td> <td colspan=”2”> Content 7 is a column-spanning entry. </td> </tr> </table></body></html>

Figure 6: Conversion to XHTML

Figure 7: Row- and Column-spans in Table LayoutThe layout specified by the table in Figure 6. (Borders shown to make the layout visible.)

using a TXL transformation that extracts thelocal style parameters such as font and align-ment from each DIV and creates a CSS stylefor it, named using the DIV’s unique table id.As part of this transformation, the order of pa-rameters in each generated style is normalizedby sorting into alphabetical order in order toallow so that similar styles are more easily de-tected in the next phase.

Figure 15 shows a portion of the correspond-ing extracted CSS style file for the example ofFigure 11. Following this step all pages of thesite have been converted to DIV layout, eachpage with its own individual CSS style file.

2.6 Clone Detection on Styles

While the DIV conversion results in a com-pletely migrated XHTML, DIV and CSS-basedweb standard website, it still has the undesir-able property that each page has its own CSSstyle file. The remaining problem is the in-tegration of these styles to a single uniformstylesheet for the entire website. To achievethis result we employ clone detection tech-nology [14] borrowed from our previous soft-ware re-engineering work [7] to recognize simi-lar styles across pages and integrate them intoa single global CSS stylesheet file.

The process of CSS style clone detection is

Figure 8: Table Partitioning to Eliminate Row and Column SpanningTable partitioning converts COLSPAN and ROWSPAN attributes to equivalent nested table

structures, reducing all layout tables to simple ones. (Borders shown to make the layout visible.)

Figure 9: Conflictual Rowspan ExampleOur prototype is not yet able to automatically handle row partitioning for this example,

and hand assistance is required. (Borders shown to make the layout visible.)

<tag id="table1">

<table float="left">

<tag id="table1_tr1">

<tr>

<tag id="table1_tr1_td1">

<td width="250" widthi="266">

content 1

</td>

</tag>

</tr>

</tag>

<tag id="table1_tr2">

<tr>

<tag id="table1_tr2_td1">

<td width="250" widthi="266">

content 4

</td>

</tag>

</tr>

</tag>

</table>

</tag>

Figure 10: Table Identification ExampleXML custom tags (“<tag>”) mark anduniquely name each component of tables iden-tified as layout tables.

achieved by two linked source transformations,one which works on the pages’ CSS files to de-tect and unify style clones and another that

<div id="table1" float="left">

<div id="table1_tr1">

<div id="table1_tr1_td1" width="250"

widthi="266">

content 1

</div>

<br clear="both"/>

</div>

<div id="table1_tr2">

<div id="table1_tr2_td1" width="250"

widthi="266">

content 4

</div>

<br clear="both"/>

</div>

</div>

<div id="table2" float="left">

<div id="table2_tr1">

<div id="table2_tr1_td1" width="300"

widthi="266">

content 2

</div>

<br clear="both"/>

</div>

</div>

Figure 11: DIV Conversion ExampleStyle attributes such as WIDTH remain inlineat this stage. The WIDTHI style attribute isan artifact of the layout stage of our processand will be removed later.

rule replaceTableByDIV replace [html_interesting_element]

<tag 'id=TableIDParam [stringlit]> <table RptTableParams [repeat html_any_tag_parameter]> RptTableContents [repeat html_table_content] </table> </tag> construct TrtoDIV [repeat div_tag] _ [replaceTrByDIV each RptTableContents] by <div 'id=TableIDParam RptTableParams> TrtoDIV </div>end rule

Figure 12: Main TXL Transformation Rule forDIV Conversion

Figure 13: Interface for Hand Renaming

works on the pages themselves to update thestyle references in DIV sections to refer to thenew unified style names.

The first transformation, all of the CSS filesfor individual pages are concatenated into onemerged file. A simple TXL pattern-matchingrule searches the merged file for exact clones ofeach CSS style. Subsequent clones are markedwith the name of the original style and a tableof equivalences is output as a list to a clone ta-ble file (Figure 16). Once the clone table filehas been output, the merged CSS file is opti-mized by removing all marked clones to yield aminimal CSS style file for the entire website.

<div id="top_left" float="left">

<div id="top_left_container_1">

<div id="top_left_content_1" width="250"

widthi="266">

content 1

</div>

<br clear="both"/>

</div>

<div id="top_left_container_2">

<div id="top_left_content_2" width="250"

widthi="266">

content 4

</div>

<br clear="both"/>

</div>

</div>

<div id="top_middle" float="left">

<div id="top_middle_container">

<div id="top_middle_content" width="300"

widthi="266">

content 2

</div>

<br clear="both"/>

</div>

</div>

Figure 14: DIV Conversion Example After Re-naming

The second transformation is then run on ev-ery page of the site, updating style referencesof each DIV according to the clone table. Eachstyle reference is looked up in the clone tableand changed to the name of the style of whichit is a clone. The final result is a website witha single merged, optimized CSS file used by allpages of the site, as if the site had been handcrafted to the modern web standard.

2.7 Hand Tuning

The final step in the process is the hand tuningof the generated CSS styles to exactly matchminor details of the original look and feel. Typ-ically this involves adding a bit of extra marginspace to the styles for some DIV blocks andremoving an occasional redundant attribute.This step usually requires only a few minutesof web programmer time to complete.

3 Experience

Our method has been tested on a number ofexample websites with varying levels of tablelayout complexity ranging from simple layout

#top_left {

float: left;

margin: auto;

}

#top_left_container_1 {

float: left;

margin: auto;

}

#top_left_container_2 {

float: left;

margin: auto;

}

#top_left_content_1 {

float: left;

margin: auto;

width: 250;

widthi: 266;

}

#top_left_content_2 {

float: left;

margin: auto;

width: 250;

widthi: 266;

}

#top_middle {

float: left;

margin: auto;

}

#top_middle_container {

float: left;

margin: auto;

}

#top_middle_content {

float: left;

margin: auto;

width: 300;

widthi: 266;

}

Figure 15: Example Extracted CSS Style FileAs style attributes are extracted to a CSS stylefile for each DIV converted page, they are re-moved from the DIVs in the page so that allstyle information appears only in the style file.

to complex ROWSPAN and COLSPAN struc-tures in order to validate our table recognitionalgorithms and the ability of our method topreserve look and feel. In addition, two real en-tire table-based legacy websites, one with sim-ple table layout and one with complex, oneoriginally generated using Claris Home Pageand one with MS Front Page, have been con-verted to test our clone detection and CSS gen-eration methods. This section outlines our ex-periences with some of these examples, firstwith two simple layout sites and then two com-plex.

"top_left" -> "top_left_container_1"

"top_left" -> "top_left_container_2"

"top_left" -> "top_middle"

"top_left" -> "top_middle_container"

"top_left" -> "top_right"

"top_left" -> "top_right_container_1"

"top_left" -> "top_right_container_2"

"top_left" -> "a_top_left"

"top_left" -> "a_top_left_container_1"

"top_left" -> "a_top_left_container_2"

"top_left" -> "a_top_middle"

"top_left" -> "a_top_middle_container"

"top_left" -> "a_top_right"

"top_left" -> "a_top_right_container_1"

"top_left" -> "a_top_right_container_2"

"top_left" -> "b_top_left"

"top_left" -> "b_top_left_container_1"

"top_left" -> "b_top_left_container_2"

"top_left" -> "b_top_middle"

"top_left" -> "b_top_middle_container"

"top_left" -> "b_top_right"

"top_left" -> "b_top_right_container_1"

"top_left" -> "b_top_right_container_2"

"top_left_content_1" -> "top_left_content_2"

"top_left_content_1" -> "a_top_left_content_1"

"top_left_content_1" -> "a_top_left_content_2"

"top_left_content_1" -> "b_top_left_content_1"

"top_left_content_1" -> "b_top_left_content_2"

"top_middle_content" -> "a_top_middle_content"

"top_middle_content" -> "b_top_middle_content"

Figure 16: Partial Example Clone TableAs well as an integrated CSS style file, clonedetection generates a clone equivalence table

for use by the clone resolution transformation.

3.1 Queen’s School of ComputingHome Page

The home page of the School of Computing’swebsite was authored and is maintained byhand in HTML using table layout, but withpre-existing CSS styles that must be retainedin the result, making it an interesting differ-ent kind of challenge for our method. The lay-out is relatively simple, involving no ROWS-PANs in the tables. Figure 17 shows the resultof converting the front page of this site usingour method, preserving existing style references(such as ”class=wong”) while introducing ourown for the newly generated DIVs. New stylesgenerated by our process are concatenated tothe existing CSS style file for the site, yieldingan identical look and feel (Figure 18).

Figure 18: School of Computing Home Page Before and After Conversion

<div id="topcontainer">

<!--H1 -->

<div id="sep1">

<div id="topcontainercontent">

. . .

</div>

</div>

<div id="sep1">

<div id="leftsep1" class="wong">

<!-- -->

</div>

<div id="rightsep1">

. . .

</div>

<br clear="both"/>

</div>

<div id="sep1">

<div id="sep2bg">

<!-- -->

</div>

</div>

</div>

Figure 17: School of Computing Home PageFollowing Conversion

3.2 IEEE Kingston Website

The IEEE Kingston Section website is small,consisting of only seven HTML pages and 2,781lines of code. It is a simple layout site, with noROWSPANs in its table structures, but uses ahighly complex hierarchical set of nested tablestructures to layout its components. Due tothis complexity, conversion of the site initiallygenerated seven CSS files totally 5,598 lines ofstyle specifications, posing a challenge for ourclone detection and style minimization steps.Clone detection found 769 cloned styles thatcould be minimized and removed (see exampleFigure 19), reducing the final CSS style file for

#topnavitem1 { float: left; margin: auto; widthi: 57; }#topnavitem2 { float: left; margin: auto; widthi: 57; }#topnavitem3 { float: left; margin: auto; widthi: 57; }#topnavitem4 { float: left; margin: auto; widthi: 57; }#topnavitem5 { float: left; margin: auto; widthi: 57; }#topnavitem6 { float: left; margin: auto; widthi: 57; }#topnavitem7 { float: left; margin: auto; widthi: 57; }

. . .

#topnavitem1 { float: left; margin: auto; widthi: 57; }

Figure 19: A Portion of the CSS Style File forthe IEEE Kingston Website Before and AfterClone Detection

the site to only 323 lines. Due to the largenumber of similar generated styles, the renam-ing stage for this site used a significant amountof human interaction time. This points to a po-tential limitation of our method that may needto be addressed in future work.

3.3 James Cordy’s Home Page

The home page of the second author’s web-site uses one ROWSPAN in the table layout

<table cellspacing="0" cellpadding="0" width="100%">

<tr>

<td rowspan="4" width="190">

. . .

</td>

<td nowrap>

. . .

</td>

<td>

. . .

</td>

</tr>

<tr>

<td>

. . .

</td>

<td>

. . .

</td>

</tr>

<tr>

<td colspan="2">

. . .

</td>

</tr>

<tr>

<td colspan="2">

. . .

</td>

</tr>

. . .

</table>

Figure 20: Portion of Original Table Layout ofJames Cordy’s Home Page

structures and controls layout using WIDTHspecifications with both absolute and percentsizes as well as NOWRAP attributes, whichare not available in CSS styles (Figure 20).This is typical of many legacy sites and makesa good example for our method. Followingtransformation, table partitioning has sepa-rated the ROWSPAN cells and maintained rel-ative position using the FLOAT:LEFT stylein the resulting CSS and DIV structure. Al-though there is no style corresponding tothe NOWRAP attribute in the web standard,our transformation introduces a special NOBRDIV class, which is supported by the majorbrowsers, to maintain the NOWRAP behaviourin the result (Figure 21).

3.4 TXL.ca Website

The TXL website was originally created usingthe early website authoring tool Claris Home

<div id="topleft">

<div id="topleftbody">

<div id="topleftcontent">

. . .

</div>

</div>

</div>

<div id="topright">

<div id="toprightbody1">

<div id="toprightboycontent">

<div class="nobr">

. . .

</div >

</div>

<div id="toprightboycontent">

. . .

</div>

</div>

<div id="toprightbody1">

<div id="toprightboycontent">

. . .

</div>

<div id="toprightboycontent">

. . .

</div>

</div>

<div id="toprightbody1">

<div id="toprightbody3content">

. . .

</div>

</div>

<div id="toprightbody1">

<div id="toprightbody3content">

. . .

</div>

</div>

</div>

<br clear="both"/>

Figure 21: Corresponding Portion of Gener-ated DIV Layout of James Cordy’s Home Page

Page, which predates modern web standardsand uses table layout and complex inline styleattributes to achieve a pleasing result. Sincethe HTML code was originally automaticallygenerated but is now hand maintained, thissite is a prime candidate for our conversion.The site contains 38 HTML pages comprisingapproximately 4,996 lines along with two CGIscripts with dynamic page generation. Likemany legacy retail sites, it includes pages withdynamic order forms and responses and makesa challenging realistic example.

The TXL site has a very complex table-basedlayout including multiple ROWSPANs at thesame level. These pose a particular difficultyfor table partitioning in that there is no unique

solution. Our conversion resolves such ambi-guities by choosing the ROWSPAN with thelargest span to begin the partitioning, and thenrecursively repartitioning the next largest un-til all ROWSPANs have been resolved, givinga top-down implicit hierarchy.

The realistic size of this production web-site provided an interesting challenge for ourclone detection and style minimization algo-rithms as well. The site has two different kindsof pages which use two quite different look-and-feel styles. Before the clone detection and re-moval process, the page-by-page conversion toDIV yielded 38 CSS files with 19,028 lines ofstyle code. However, after consolidation, detec-tion and removal of 2,850 style clones, the finalsitewide CSS file was reduced to only 647 lines,a maintainable and reasonably sized stylesheetfor such a complicated website (Figure 22).

The result of the conversion was a clearlymaintainable web standards-compliant newwebsite which is virtually indistinguishablefrom the original in look and feel (Figure23) and has all the advantages of a handcrafted modern standard website in compati-bility, maintainability and efficiency.

3.5 Other Examples

Several other websites have been used as ex-amples for our process, including the IEEE IC-CBSS conference website, which poses the newproblem of multiple ROWSPANs at multiple si-multaneous levels. At present our system can-not automatically partition such sites (whichare relatively unusual), but it can still be ap-plied, using a few minutes of hand partitioningfor the pages with this problem. The conver-sion of this site yielded a home page which is lit-erally indistinguishable from the original, whilebeing completely migrated to the XHTML,DIV and CSS web standards.

3.6 Performance

Our process is very fast, requiring for exam-ple only two or three minutes of transformationtime for the entire TXL website conversion ona standard PC. The two human interventionsteps, tag renaming and final style tuning, re-quire a skilled web programmer to be done effi-

#topcontainer {

padding: 0;

height: auto;

margin: auto;

width: 798px;

text-align: center;

}

#sep1 {

float: left;

margin: auto;

}

#topcontainercontent {

text-align: left;

width: 798px;

float: left;

margin: auto;

}

#leftsep1 {

text-align: left;

float: left;

height: 24px;

margin: auto;

width: 675px;

}

#rightsep1 {

text-align: right;

background-color: #ffcc00;

float: left;

height: 24px;

margin: auto;

vertical-align: top;

width: 123px;

}

#sep2bg {

text-align: left;

background-color: #ffffff;

float: left;

height: 1px;

margin: auto;

width: 798px;

}

Figure 22: Part of Generated CSS Stylesheetfor the TXL.ca Website

ciently. Even so, each of these steps took onlyten to 15 minutes of programmer time in theTXL website conversion, for a total of less than30 minutes elapsed time to convert the 38 pagetable-based legacy site to a maintainable mod-ern web standards compliant result.

4 Related Work

Some commercial tools offer a rudimentaryform of automated web standards conversion.For example, Adobe Dreamweaver [1] hasa conversion that can automatically convert

Figure 23: TXL.ca Website Main Page Before and After Conversion(Rendered in different browser window sizes.)

pages from table layout to DIV-based layout.However, these tools use only local CSS styleson a per page basis and do not recognize com-monality, with the result that the convertedpage is no more maintainable than the origi-nal. In the case of Dreamweaver, the situationis even worse since in order to maintain an iden-tical look the tool resorts to converting to ab-solute positions and sizes for all page elements,losing all structure and making it impossible torecognize style clones even by hand.

Many other researchers have worked in thebroader area of web technology migrations, in-cluding several that have exploited the samesource transformation engine, TXL, that un-derlies our work. Hassan et al. [10] have usedTXL transformations to migrate web applica-tions from ASP to the NSP web frameworkwhile preserving the original code structure andcommenting. Xu et al. [20] have used TXL tomodernize embedded Java code in JSP pagesto custom-tag based JSP applications.

In other work, Jiang et al. [13] have used pat-tern matching techniques to automatically ana-lyze server generated pages to aid in migratingweb applications to web services. Ping et al.[15] have described an approach to migratingweb applications from IBM Net.Data to JSPwhile separating database aspects from presen-tation logic. Ricca et al. [16] have used clus-tering techniques to identify static pages thatcan be transformed to dynamic pages.

Our method for clone detection is based onour previous work with Synytskyy [7] whichaddressed the problem of identifying near-missclones in static and dynamic web pages. Our

method adopts only exact clone detection fromthat work, but could be extended to parame-terized styles using near-miss clone detection.Many other clone detection techniques couldbe used as well, for example Baxter’s methodbased on abstract syntax trees [3] which alsohandles near-miss clones.

Our simple table partitioning strategy is onlyroughly based on the wide range of literature intable recognition [22], and relates most closelyto methods such as Handley’s [9] which addressthe higher-level logical structure of tables. Fulltable recognition is a much more ambitious taskwhich includes table detection, functional andstructural analysis, and finally table interpre-tation [12]. One piece of work that relates wellto ours because it applies similarity-based rea-soning to detect cells in tables is that of Chenet al. [4], which has been used to detect andanalyze tables in airline information web pages.Yoshida et al. [21] have described a method tointegrate HTML tables based on the categoryof object in the table. However, in these casesthe focus has been on the contents of true ta-bles rather than on table-based layout.

Finally, we have used the TXL transforma-tion language [6] in implementing our sourcetransformations. Any other modern sourcetransformation system, for example ASF+SDF[18] or Stratego [19] could serve as well. Advan-tages of TXL over XSLT [5] in our applicationinclude its ability to handle malformed inputusing robust parsing [2], its ability to expresscomplex nonstructural patterns [8], and previ-ous experience in applying it to table recogni-tion and clone detection tasks [23, 7].

5 Summary & Future Work

We have presented an automated process formigrating legacy websites using table-basedlayout to modern, maintainable XHTML, DIVand CSS web standards-based websites withall the advantages of a hand-crafted modernreplacement. Our process uses table recogni-tion and software clone detection technologywith multi-pass source transformation to yielda high quality result that preserves hierarchicaland layout structure of web pages while synthe-sizing a minimized common stylesheet for theentire site. The look and feel of converted web-sites is virtually identical to the original.

Our method has a number of limitations thatbear further work. As noted in Section 3.2,websites with large numbers of small similarelements can make the renaming task an is-sue. This could be addressed with better nameinference for generated DIVs, or better cluesto the programmer indicating which styles arelikely to be eliminated as clones. MultipleROWSPANs at multiple simultaneous levelspose problems that must be assisted by hand.While this limitation can largely be overcomewith better table analysis, to some extent itderives from the limited expressiveness of theDIV feature itself, and may always need somehand tuning. At present our system does nothandle web applications written in server sidelanguages such as ASP and JSP. This can po-tentially be addressed by adopting the multilin-gual parsing methods of Synytskyy et al. [17]to separate page factors using robust parsing.And of course the method needs to be validatedon a larger number of legacy websites (of whichthere is no lack of examples).

Acknowledgements

This work is supported by the Natural Sciencesand Engineering Research Council of Canada.

About the Authors

Andy Mao is an IT specialist in the global busi-ness services division of the IBM Pacific devel-opment center in Vancouver, BC. Prior to join-ing IBM, Andy was a web developer at Blast

Radius Inc. from 2003-05. He holds a B.A. inInformation Technology from York Universityand recently completed his Master’s in Com-puting at Queen’s University under the super-vision of James Cordy and Thomas Dean. Hisresearch interests include source transforma-tion, software re-engineering and e-commerceweb applications.

James Cordy is a Professor and recent for-mer Director of the School of Computing atQueen’s University. From 1995 to 2000 hewas Vice President and Chief Research Scien-tist at Legasys Corporation, a software tech-nology company specializing in legacy softwaresystem analysis and renovation. Dr. Cordyis a founding member of the Software Tech-nology Laboratory at Queen’s and winner ofthe 1994 ITRC Innovation Excellence awardand the 1995 ITRC Chair’s Award for En-trepreneurship in Technology Innovation. Heis a member of the ACM, a senior member ofthe IEEE and an IBM CAS Faculty Fellow.

Thomas Dean is an Associate Professor inthe Department of Electrical and ComputerEngineering at Queen’s University and an Ad-junct Associate Professor at the Royal MilitaryCollege of Kingston. His background includesresearch in air traffic control systems, languageformalization and five and a half years as aSr. Research Scientist at Legasys Corporationwhere he worked on advanced software trans-formation and evolution techniques in an in-dustrial setting. His current research interestsare software transformation, web site evolutionand the security of network applications.

References

[1] Adobe. Dreamweaver CS3. http://www.adobe.com/products/dreamweaver/.

[2] D.T. Barnard and R.C. Holt. HierarchicSyntax Error Repair for LR Grammars.Int. J. Computing and Information Sci-ences, 11(4):231–258, 1982.

[3] I.D. Baxter, A. Yahin, L. Moura,M. Sant’Anna, and L. Bier. Clone Detec-tion using Abstract Syntax Trees. In IEEE14th Int. Conf. on Software Maintenance,pages 368–377, 1998.

[4] H. Chen, S. Tsai, and J. Tsai. MiningTables from Large Scale HTML Texts. In18th Conf. on Computational Linguistics,pages 166–172, 2000.

[5] J. Clark. XSL Transformations (XSLT)v1.0. http://www.w3.org/TR/xslt, 1999.

[6] J.R. Cordy. The TXL Source Transfor-mation Language. Science of ComputerProgramming, 61(3):190–210, 2006.

[7] J.R. Cordy, T.R. Dean, and N. Synyt-skyy. Practical Language-independent De-tection of Near-miss Clones. In CASCON’04: 2004 Conf. of the Center for Ad-vanced Studies on Collaborative Research,pages 1–12, 2004.

[8] T.R. Dean, J.R. Cordy, A.J. Malton,and K.A. Schneider. Agile parsing inTXL. J. Automated Software Engineering,10(4):311–336, 2003.

[9] J. Handley. Table Analysis for Multi-lineCell Identification. In Document Recog-nition and Retrieval VIII, volume 4307,pages 34–43, 2001.

[10] A.E. Hassan and R.C. Holt. MigratingWeb frameworks using Water Transforma-tions. In IEEE 27th Int. Conf. on Com-puter Software and Applications, pages296–303, 2003.

[11] J. Hu, R. Kashi, D. Lopresti, and G. Wil-fong. Table Structure Recognition and itsEvaluation. In Doc. Recog. and RetrievalVIII, volume 4307, pages 44–55, 2001.

[12] M. Hurst. Layout and Language: Chal-lenges for Table Understanding on theWeb. In 1st Int. Workshop on Web Doc-ument Analysis, pages 27–30, 2001.

[13] Y. Jiang and E. Stroulia. TowardsReengineering Web Sites to Web-servicesProviders. In 8th European Conf. onSoftware Maintenance and Reengineering,pages 296–308, 2004.

[14] R. Koschke. A Survey of Research on Soft-ware Clones. In Duplication, Redundancy,and Similarity in Software, number 06301in Dagstuhl Seminar Proceedings, 2007.

[15] Y. Ping, J. Lu, T.C. Lau, K. Kontogiannis,T. Tong, and B. Yi. Migration of LegacyWeb Applications to Enterprise Java En-vironments net.data to JSP Transforma-tion. In CASCON ’03: 2003 Conf. of theCenter for Advanced Studies on Collabo-rative Research, pages 223–237, 2003.

[16] F. Ricca and P. Tonella. Using Cluster-ing to Support the Migration from Staticto Dynamic Web Pages. In 11th IEEEInt. Workshop on Program Comprehen-sion, pages 207–216, 2003.

[17] N. Synytskyy, J.R. Cordy, and T.R. Dean.Robust Multilingual Parsing using IslandGrammars. In CASCON ’03: 2003 Conf.of the Center for Advanced Studies on Col-laborative Research, pages 266–278, 2003.

[18] M. G. J. van den Brand, J. Heering,P. Klint, and P. A. Olivier. Compil-ing Language Definitions: the ASF+SDFCompiler. ACM Trans. on Prog. Lan-guages and Systems, 24(4):334–368, 2002.

[19] E. Visser. Program Transformation withStratego/XT: Rules, Strategies, Tools,and Systems in Stratego/XT 0.9. InDomain-Specific Program Generation, vol-ume 3016 of LNCS, pages 216–238, 2004.

[20] S. Xu and T.R. Dean. Transforming Em-bedded Java to Custom Tags. In IEEE 5thInt. Workshop on Source Code Analysisand Manipulation, pages 173–182, 2005.

[21] M. Yoshida, K. Torisawa, and J. Tsujii.A method to integrate tables of the worldwide web. In 1st int. Workshop on WebDocument Analysis, pages 31–34, 2001.

[22] R. Zanibbi, D. Blostein, and J.R. Cordy.A Survey of Table Recognition: Models:Observations, Transformations, and Infer-ences. Int. J. Document Analysis andRecognition, 7(1):1–16, 2004.

[23] R. Zanibbi, D. Blostein, and J.R. Cordy.The Recognition Strategy Language. InICDAR ’05: 8th Int. Conf. on DocumentAnalysis and Recognition, pages 565–569Vol. 2, 2005.