Translator-oriented localisation of CMS-based...

36
XLIFF Localisation for Joomla! Translator-oriented localisation of CMS-based websites Jesús Torres del Rey Emilio Rodríguez Vázquez de Aldana Faculty of Translation and Documentation http://diarium.usal.es/codex

Transcript of Translator-oriented localisation of CMS-based...

XLIFF Localisation for Joomla! Translator-oriented localisation of

CMS-based websites Jesús Torres del Rey

Emilio Rodríguez Vázquez de Aldana

Faculty of Translation and Documentation

http://diarium.usal.es/codex

Agenda

Introduction – Motivation – Multilingual management & interchange

Our Research/Experiments – Analysis of other tools – Application Workflow – XLIFF 1.2, XML+its1.0 – Behaviour in CAT tools

Translation-Oriented L10n Future Work

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 1

Motivation: chronology

2009: Request for translation of Faculty’s website (Joomla 1.5, multilingual Joomfish)

• Html download > use of CAT > paste on Joomla html editor

2010-11: How to teach localisation of dynamic websites to our UG students? – Full localisation of static websites taught

• Filetypes and technologies (html, js, css, graphics…) • Super-, Macro-, Hyper-, Micro- structures • Directory structures, relative links… • Link/Web management (Ms Expression, Adobe DW…) • Automatisation via Search/Replace, regular expressions…

2012: Multilingual extensions for Joomla 2.5 – Falang (also for Joomla 3), Josetta, Joomfish, Jolomea

2013: Research with other CMSs (Drupal, Wrpss., Ty3.) Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 2

Motivation: T&R philosophy

Translator/Localiser-oriented approach – Integration with CAT/Localisation tools – Empowerment through control of

• Processes, lifecycle – From request to publication, update, multilingualisation…

• Visual/Relational/Functional Context, Global meaning, Negotiation of communication needs

– Standardisation, XLIFF, ITS – Acquisition of basic knowledge of Nature and Mechanics

of Dynamic, CMS-based websites • (On top of nature and mechanics of static websites) • Filetypes, Databases and technologies • Server – Client intrastructure • Composition of Dynamic active pages • Front-end, Back-end, interface, content…

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 3

Motivation: (dis)empowerment

Static html-based websites: Full localisation

Dynamic CMS-based websites: Patchy translation

Visual and functional context CMS partial webpage/separate translation environment

Use of functionality, quality tools (CAT/L) Texts “locked” in DB-> export/import (for interchange, batch quality/analysis/term extraction processes)

Capable of multilingual re-structuring Only if administrative rights for CMS and multilanguage module installed

Publication-ready deliverables Only if write-access rights; partial, patchy publication

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 4

L10n, I18n, Multilanguage: evolution in CMSs

Specific (often) third-party modules to make multilingual websites easy to setup and manage. – Automatic duplication of structure/pages – Taking advantage of simplified CMS editing

environments At the same time, translatable data export/import modules to csv, po, xml and, increasingly, XLIFF

» Drupal XLIFF Tools, Wordpress WPML, Typo3 l10nmgr, Joomla JDiction (since early 2013)...

Combination of multilingual management and XLIFF et al. export/import

» Wordpress WPML, Joomla JDiction (since early 2013)...

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 5

Our experiments: overview

Application: Falang2XLIFF (beta) » http://diarium.usal.es/codex/desarrollo

– Java Client • (compiled to 1.7) • Handy experimental tool with our limited resources • Not embedded into CMS as a module: access rights to DB?

– Uses Falang multilingual DB structure for Joomla • Potentially applicable to other DB structures, like Josetta, Jfish…

– Main purpose: to experiment with data to be extracted, XLIFF and whole L10n process, and to use it for our UG L10n course for translator training

Other tools: • Jdiction (xliff tool added since March) • For other CMSs: XLIFF Tools (Drupal)

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 6

Experiments: L10n objects

CMS objects for L10n: – Editor/Administration interface

• php, asp, or externalised to ini, po… – Dependent, linked files (pdf, epub, graphics, video,

audio…) – Database elements

• Article/page • Modules (e.g. calendar…) • Categories (e.g. for thematically grouping blog posts). • Smaller user interaction elements (weblinks, etc.)

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 7

Experiments: L10n objects

In Database tables, L10n elements: 1. Structural/Interface text strings

– menus, article titles, sections…

2. Longer (x)html article contents 3. Parameters for the above elements

– metakey, metadesc, menu params….

– All in text fields in DB*

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 8

Experiments: other extraction strategies (JDiction)

Titles <!CDATA[ TEXT]]>

HTML:: TAG & TEXT <!CDATA[ TEXT]]>

Parameters: state->translated! (Drupal: final status) Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 9

Experiments: other extraction>CAT (JDiction>Virtaal)

Tags are visually marked probably, regex <[^>]+/?> However, unprotected tags CAT tools could integrate a WYSWYG html editor if xliff 1.2 datatype = "htmlbody"

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 10

Experiments: other extraction>CAT (Jdiction>MemoQ 5)

Filters not always versatile enough Segments should be shorter and regularly segmented for better matches and TM leverage

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 11

Experiments: Summarising JDiction Multilingual management + export/import – Some multilingual management problems:

• Translation editor: – separate environment (not integrated in target -language page) – does not show original in parallel

– Some export/import problems: • Indiscriminate bulk export, irrespective of newness or

update/translated state • CDATA export of (x)html content

» No different from csv export » Whole article/item, without structure

– XHTML should be processed with XML processors, rather than with regular expressions

– HTML text should be carried to CAT tool not as plain text but as html tags and text (Drupal Xliff Tools does)

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 12

Experiments: Application Workflow (export)

BD 3. DB Extraction (new & updated)

2. DB Connection

Simple XML (Temporary)

1. In Falang, element selection

XML+its1.0 4. XML Generation

5. XLIFF Generation

xml2xliff.xsl of XliffRoundTrip

Tool XLIFF 1.2

Falang2Xliff Joomla! with Falang

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 13

Application: Workflow (1/6) 1. In Falang: Element Selection

1.2 …selects elements one by one! and…

1.3 Copy Source!

1.1. Falang. PM with admin rights…

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 14

Application: Workflow (2/6) 2. Database Connection

– Only standard TCP/IP connections to SQL server • Only in network security zone or localhost

– Joomla DB prefix needed

– Read-access permission for export

• Falang tables but also Original content tables, to check newness & update status

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 15

Application: Workflow (3/6) 3. Database Extraction (new & updated)

• New: Established as translatable by PM by using "Copy Source”

• Updated: translatable text whose source content has been edited (original content tables checked –MD5 hash-)

– Info from attributes title, text, introtext, name, fulltext, description & content in tables categories, content, menu, modules and weblinks

• Parameters not extracted to prevent DB corruption.

X X √

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 16

Application: Workflow (3/6) 3. Database Extraction (new & updated):

– The Joomla! html editor typically rewrites HTML fragments as XHTML

– But are we certain that it is correct XHTML? • We have rechecked (Jericho Parser HTML) and rewritten

data if necessary – XML entitities, closing attribute quotes, checking and correcting

node hyerarchy » Some current limitations: e.g. unpaired <tag> <tag/>

– XHTML elements should be stored in DB as XMLElements

» ISO/IEC 9075-14:2011-Part 14:XML-Related Specifications (SQL/XML)

» XML Support low in MySQL; high in PostgreSQL

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 17

Application: Workflow (4/6) 4. XML Generation

<value_falang>Usando Joomla! &amp; …</value_falang>

<value_falang><p> <img …/>… </p></value_falang>

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 18

Moderador
Notas de la presentación
Deseable que en las bases de datos se almacenara XHTML en campos XML

Application: Workflow (4/6) 4. XML Generation (temporary file to be

converted to XLIFF)

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 19 Text

XHTML

<registros_falang> Root

<registro_falang> Attributes contain info for correct back import to DB

<value_falang> Contains translatable content (can include html elements)

Application: Workflow (4/6) 4. Generation of XML+its1.0

Global, Embedded ITS rules. Features: • Translate • Elements Within Text

W3C WG (2008): Best Practices por XML Localization. 5.1.4 Associating existing XHTML markup with ITS Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 20

ITS1.0 supports XPath 1.0 (which does not support regex)

Application: Workflow (5/6) 5. Generation of XLIFF 1.2

– Schnabel’s xml2xliff.xsl adapted so that source language=variable

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 21

Application: Workflow Generation of XML+its1.0 and XLIFF

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 22

Application: Workflow (import) From XLIFF/XML+its back to DB (import)

BD

1. XML Generation

XML (Temporary)

XML + ITS 1.0

2. SQL Generation

xliff2xml.xsl (XliffRoundTrip)

XLIFF 1.2

Falang2Xliff Joomla! with Falang

SQL

Optional online update

XLIFF encoding (UTF-8 without BOM) Translation states (e.g. “needs-translation”, etc.) not taken into account XML to SQL via Xquery processor (http://xmlbeans.apache.org/index.html)

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 23

Application: Workflow (import) From XLIFF/XML+its back to DB (import)

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 24

Application: XLIFF generated xliffRoundTrip XSL – For regular XML structures – Limitation:

• Attributes (translatable) – must be post-processed

<p><img alt="…" … </p> <p><span>… <a title ="…" >… </a>… </span></p> <ul><li><span> … <strong>… </strong>…</span></li> <li><span> … <strong>… </strong>… <em>…</em>…</span></li></ul>

1

2

3

4 <trans-unit><x/>……</trans-unit> <group><trans-unit>… <g id="" >… </g>… </trans-unit></group> <group> <group><trans-unit> … <g id="">… </g>…</trans-unit></group>

<group><trans-unit> … <g id="">… </g>… <g d=""> … </g>… </trans-unit></group>

</group>

1

2

3

4

1

2

3

4

Tags: <group> (without text) <trans-unit> (with text) <g> </g>, <x/> (within text/inline)

25

Experiments: XLIFF>CAT

Translation Units segmented at paragraph level

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 26

Experiments: XML+its1.0>CAT

1 2

4,5

6

Support for ITS in CAT? • SDL Trados Studio:

• Global & Embedded rules for features: • Translate • Elements Within

text 27

7

3

• Okapi Rainbow • (For Global:) Translate, Elements WithinText,

LocNote • XTM

• (Linked File)

Experiments: html overtagging > XLIFF

Many reformatting actions (on the html editor) produce html overtagging

<ul> <li><span> … <strong>… </strong>…</span></li> <li><span> … <strong>… </strong>… <em>… </em>… </span> </li> </ul>

3

4

3

4

<ul> <li><span> … <strong>… </strong>…</span></li> <li><span style=""> … </span><strong style="">… </strong><span style="">…</span>… <em style="">…</em><span style="">… </span> </li> </ul>

Previous Segment 4 becomes 4, 5, 6, 7, 8

3

Therefore, one trans-unit for each <tag></tag> pair 28

Experiments: html overtagging > XLIFF > CAT

4 4

5

6 7

8

3

Html overtagging by CMS html editors produces oversegmentation when converting to XLIFF (following XSL’s logical segmentation strategy) CMS editors’ Clean-html function seldom helps!

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 29

Experiments: html overtagging > ITS > XLIFF > CAT

Okapi Rainbow-generated XLIFF from XML+its 1.0

XML+its 1.0 converted to SDLXLIFF by CAT tool

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 30

Translator/Localiser needs

CAT/L

Communication Structure

Agent/Doc/Kn Interaction

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 31

CMS

Translator/Localiser

Global Meaning & Function

Intratextual relations Purpose

Form, layout, expression

Quality, Consistence, Adherence to conventions, leverage, format, language/knowledge building

Exchange PM

Translator/Localiser needs Meaningful, (dynamically) coherent whole that needs to attract, keep & direct attention

– Translation as just a matter of words, just a language problem?! – Localisation/Translation as adaptation, communication,

cultural/professional mediation – Articles/Items are coherently, cohesively integrated in

• General/Particular communicative/performative purpose • Sometimes bigger articles • Regions in the webpage, & relative positions • Hyperlink/Interaction relationships • Structure/sitemap relationships

(internal and external –menus, etc.) • Potentially indexed search results • Type of article/element/module categories • Usability/Accessibility needs/alternatives

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 32

Translator/Localiser needs

CMS<xliff/its>CAT/L TOOLS – Exported units must behave properly and efficiently in

CAT/L tools • Segmentation • XHTML structure, function, meaning of tags

– Preview? Visual/functional contextualisation • Link to published webpage, highlighted translated elements • Zielinski & Beuster (memoQfest 2012): DB>html>CATpreview

– Control of new elements, updates, trans status, etc. – Interchange (batch extraction, revision, etc.) – Other

• Possibility of placeable adaptation? – E.g. specific/global localisable links (href attribute)

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 33

Translator/Localiser desiderata CMS4L10n – CMS managing content … taking translators/localisers/PMs

into account • Separating content from layout & function but showing

interrelationships – XLIFF with linked XSL/CSS? (in xliff 2.0 L10n kit/portfolio?) – Preview, link to published page?

• Classifying elements in a standard way, semantics? – Types of articles/pages – Types of modules – Relations between constituents

• Possibility of PM preprocessing for translation » CMS User profiles: localisation PM, localiser…

– E.g. specific/global localisable links (href attribute) – Including various articles, entities , elements (e.g. flash, graphics, etc.) of

a page in an XLIFF file/group element, marking which for translation, others translated/for context…

– Generating html skeleton? Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 34

Future work

In-depth analysis of export/import tools in different CMSs and other Joomla! Multilingual Managers. – Josetta, new Joomfish version

Extraction of contextual, preview information • Links to published page containing translatable

articles…

Analysis of object types & relationships in web CMSs + Accessibility needs

Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 35