Translator-oriented localisation of CMS-based...
Transcript of Translator-oriented localisation of CMS-based...
XLIFF Localisation for Joomla! Translator-oriented localisation of
CMS-based websites Jesús Torres del Rey
Emilio Rodríguez Vázquez de Aldana
Faculty of Translation and Documentation
http://diarium.usal.es/codex
Agenda
Introduction – Motivation – Multilingual management & interchange
Our Research/Experiments – Analysis of other tools – Application Workflow – XLIFF 1.2, XML+its1.0 – Behaviour in CAT tools
Translation-Oriented L10n Future Work
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 1
Motivation: chronology
2009: Request for translation of Faculty’s website (Joomla 1.5, multilingual Joomfish)
• Html download > use of CAT > paste on Joomla html editor
2010-11: How to teach localisation of dynamic websites to our UG students? – Full localisation of static websites taught
• Filetypes and technologies (html, js, css, graphics…) • Super-, Macro-, Hyper-, Micro- structures • Directory structures, relative links… • Link/Web management (Ms Expression, Adobe DW…) • Automatisation via Search/Replace, regular expressions…
2012: Multilingual extensions for Joomla 2.5 – Falang (also for Joomla 3), Josetta, Joomfish, Jolomea
2013: Research with other CMSs (Drupal, Wrpss., Ty3.) Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 2
Motivation: T&R philosophy
Translator/Localiser-oriented approach – Integration with CAT/Localisation tools – Empowerment through control of
• Processes, lifecycle – From request to publication, update, multilingualisation…
• Visual/Relational/Functional Context, Global meaning, Negotiation of communication needs
– Standardisation, XLIFF, ITS – Acquisition of basic knowledge of Nature and Mechanics
of Dynamic, CMS-based websites • (On top of nature and mechanics of static websites) • Filetypes, Databases and technologies • Server – Client intrastructure • Composition of Dynamic active pages • Front-end, Back-end, interface, content…
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 3
Motivation: (dis)empowerment
Static html-based websites: Full localisation
Dynamic CMS-based websites: Patchy translation
Visual and functional context CMS partial webpage/separate translation environment
Use of functionality, quality tools (CAT/L) Texts “locked” in DB-> export/import (for interchange, batch quality/analysis/term extraction processes)
Capable of multilingual re-structuring Only if administrative rights for CMS and multilanguage module installed
Publication-ready deliverables Only if write-access rights; partial, patchy publication
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 4
L10n, I18n, Multilanguage: evolution in CMSs
Specific (often) third-party modules to make multilingual websites easy to setup and manage. – Automatic duplication of structure/pages – Taking advantage of simplified CMS editing
environments At the same time, translatable data export/import modules to csv, po, xml and, increasingly, XLIFF
» Drupal XLIFF Tools, Wordpress WPML, Typo3 l10nmgr, Joomla JDiction (since early 2013)...
Combination of multilingual management and XLIFF et al. export/import
» Wordpress WPML, Joomla JDiction (since early 2013)...
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 5
Our experiments: overview
Application: Falang2XLIFF (beta) » http://diarium.usal.es/codex/desarrollo
– Java Client • (compiled to 1.7) • Handy experimental tool with our limited resources • Not embedded into CMS as a module: access rights to DB?
– Uses Falang multilingual DB structure for Joomla • Potentially applicable to other DB structures, like Josetta, Jfish…
– Main purpose: to experiment with data to be extracted, XLIFF and whole L10n process, and to use it for our UG L10n course for translator training
Other tools: • Jdiction (xliff tool added since March) • For other CMSs: XLIFF Tools (Drupal)
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 6
Experiments: L10n objects
CMS objects for L10n: – Editor/Administration interface
• php, asp, or externalised to ini, po… – Dependent, linked files (pdf, epub, graphics, video,
audio…) – Database elements
• Article/page • Modules (e.g. calendar…) • Categories (e.g. for thematically grouping blog posts). • Smaller user interaction elements (weblinks, etc.)
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 7
Experiments: L10n objects
In Database tables, L10n elements: 1. Structural/Interface text strings
– menus, article titles, sections…
2. Longer (x)html article contents 3. Parameters for the above elements
– metakey, metadesc, menu params….
– All in text fields in DB*
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 8
Experiments: other extraction strategies (JDiction)
Titles <!CDATA[ TEXT]]>
HTML:: TAG & TEXT <!CDATA[ TEXT]]>
Parameters: state->translated! (Drupal: final status) Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 9
Experiments: other extraction>CAT (JDiction>Virtaal)
Tags are visually marked probably, regex <[^>]+/?> However, unprotected tags CAT tools could integrate a WYSWYG html editor if xliff 1.2 datatype = "htmlbody"
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 10
Experiments: other extraction>CAT (Jdiction>MemoQ 5)
Filters not always versatile enough Segments should be shorter and regularly segmented for better matches and TM leverage
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 11
Experiments: Summarising JDiction Multilingual management + export/import – Some multilingual management problems:
• Translation editor: – separate environment (not integrated in target -language page) – does not show original in parallel
– Some export/import problems: • Indiscriminate bulk export, irrespective of newness or
update/translated state • CDATA export of (x)html content
» No different from csv export » Whole article/item, without structure
– XHTML should be processed with XML processors, rather than with regular expressions
– HTML text should be carried to CAT tool not as plain text but as html tags and text (Drupal Xliff Tools does)
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 12
Experiments: Application Workflow (export)
BD 3. DB Extraction (new & updated)
2. DB Connection
Simple XML (Temporary)
1. In Falang, element selection
XML+its1.0 4. XML Generation
5. XLIFF Generation
xml2xliff.xsl of XliffRoundTrip
Tool XLIFF 1.2
Falang2Xliff Joomla! with Falang
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 13
Application: Workflow (1/6) 1. In Falang: Element Selection
1.2 …selects elements one by one! and…
1.3 Copy Source!
1.1. Falang. PM with admin rights…
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 14
Application: Workflow (2/6) 2. Database Connection
– Only standard TCP/IP connections to SQL server • Only in network security zone or localhost
– Joomla DB prefix needed
– Read-access permission for export
• Falang tables but also Original content tables, to check newness & update status
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 15
Application: Workflow (3/6) 3. Database Extraction (new & updated)
• New: Established as translatable by PM by using "Copy Source”
• Updated: translatable text whose source content has been edited (original content tables checked –MD5 hash-)
– Info from attributes title, text, introtext, name, fulltext, description & content in tables categories, content, menu, modules and weblinks
• Parameters not extracted to prevent DB corruption.
X X √
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 16
Application: Workflow (3/6) 3. Database Extraction (new & updated):
– The Joomla! html editor typically rewrites HTML fragments as XHTML
– But are we certain that it is correct XHTML? • We have rechecked (Jericho Parser HTML) and rewritten
data if necessary – XML entitities, closing attribute quotes, checking and correcting
node hyerarchy » Some current limitations: e.g. unpaired <tag> <tag/>
– XHTML elements should be stored in DB as XMLElements
» ISO/IEC 9075-14:2011-Part 14:XML-Related Specifications (SQL/XML)
» XML Support low in MySQL; high in PostgreSQL
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 17
Application: Workflow (4/6) 4. XML Generation
<value_falang>Usando Joomla! & …</value_falang>
<value_falang><p> <img …/>… </p></value_falang>
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 18
Application: Workflow (4/6) 4. XML Generation (temporary file to be
converted to XLIFF)
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 19 Text
XHTML
<registros_falang> Root
<registro_falang> Attributes contain info for correct back import to DB
<value_falang> Contains translatable content (can include html elements)
Application: Workflow (4/6) 4. Generation of XML+its1.0
Global, Embedded ITS rules. Features: • Translate • Elements Within Text
W3C WG (2008): Best Practices por XML Localization. 5.1.4 Associating existing XHTML markup with ITS Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 20
ITS1.0 supports XPath 1.0 (which does not support regex)
Application: Workflow (5/6) 5. Generation of XLIFF 1.2
– Schnabel’s xml2xliff.xsl adapted so that source language=variable
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 21
Application: Workflow Generation of XML+its1.0 and XLIFF
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 22
Application: Workflow (import) From XLIFF/XML+its back to DB (import)
BD
1. XML Generation
XML (Temporary)
XML + ITS 1.0
2. SQL Generation
xliff2xml.xsl (XliffRoundTrip)
XLIFF 1.2
Falang2Xliff Joomla! with Falang
SQL
Optional online update
XLIFF encoding (UTF-8 without BOM) Translation states (e.g. “needs-translation”, etc.) not taken into account XML to SQL via Xquery processor (http://xmlbeans.apache.org/index.html)
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 23
Application: Workflow (import) From XLIFF/XML+its back to DB (import)
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 24
Application: XLIFF generated xliffRoundTrip XSL – For regular XML structures – Limitation:
• Attributes (translatable) – must be post-processed
<p><img alt="…" … </p> <p><span>… <a title ="…" >… </a>… </span></p> <ul><li><span> … <strong>… </strong>…</span></li> <li><span> … <strong>… </strong>… <em>…</em>…</span></li></ul>
1
2
3
4 <trans-unit><x/>……</trans-unit> <group><trans-unit>… <g id="" >… </g>… </trans-unit></group> <group> <group><trans-unit> … <g id="">… </g>…</trans-unit></group>
<group><trans-unit> … <g id="">… </g>… <g d=""> … </g>… </trans-unit></group>
</group>
1
2
3
4
1
2
3
4
Tags: <group> (without text) <trans-unit> (with text) <g> </g>, <x/> (within text/inline)
25
Experiments: XLIFF>CAT
Translation Units segmented at paragraph level
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 26
Experiments: XML+its1.0>CAT
1 2
4,5
6
Support for ITS in CAT? • SDL Trados Studio:
• Global & Embedded rules for features: • Translate • Elements Within
text 27
7
3
• Okapi Rainbow • (For Global:) Translate, Elements WithinText,
LocNote • XTM
• (Linked File)
Experiments: html overtagging > XLIFF
Many reformatting actions (on the html editor) produce html overtagging
<ul> <li><span> … <strong>… </strong>…</span></li> <li><span> … <strong>… </strong>… <em>… </em>… </span> </li> </ul>
3
4
3
4
<ul> <li><span> … <strong>… </strong>…</span></li> <li><span style=""> … </span><strong style="">… </strong><span style="">…</span>… <em style="">…</em><span style="">… </span> </li> </ul>
Previous Segment 4 becomes 4, 5, 6, 7, 8
3
Therefore, one trans-unit for each <tag></tag> pair 28
Experiments: html overtagging > XLIFF > CAT
4 4
5
6 7
8
3
Html overtagging by CMS html editors produces oversegmentation when converting to XLIFF (following XSL’s logical segmentation strategy) CMS editors’ Clean-html function seldom helps!
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 29
Experiments: html overtagging > ITS > XLIFF > CAT
Okapi Rainbow-generated XLIFF from XML+its 1.0
XML+its 1.0 converted to SDLXLIFF by CAT tool
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 30
Translator/Localiser needs
CAT/L
Communication Structure
Agent/Doc/Kn Interaction
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 31
CMS
Translator/Localiser
Global Meaning & Function
Intratextual relations Purpose
Form, layout, expression
Quality, Consistence, Adherence to conventions, leverage, format, language/knowledge building
Exchange PM
Translator/Localiser needs Meaningful, (dynamically) coherent whole that needs to attract, keep & direct attention
– Translation as just a matter of words, just a language problem?! – Localisation/Translation as adaptation, communication,
cultural/professional mediation – Articles/Items are coherently, cohesively integrated in
• General/Particular communicative/performative purpose • Sometimes bigger articles • Regions in the webpage, & relative positions • Hyperlink/Interaction relationships • Structure/sitemap relationships
(internal and external –menus, etc.) • Potentially indexed search results • Type of article/element/module categories • Usability/Accessibility needs/alternatives
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 32
Translator/Localiser needs
CMS<xliff/its>CAT/L TOOLS – Exported units must behave properly and efficiently in
CAT/L tools • Segmentation • XHTML structure, function, meaning of tags
– Preview? Visual/functional contextualisation • Link to published webpage, highlighted translated elements • Zielinski & Beuster (memoQfest 2012): DB>html>CATpreview
– Control of new elements, updates, trans status, etc. – Interchange (batch extraction, revision, etc.) – Other
• Possibility of placeable adaptation? – E.g. specific/global localisable links (href attribute)
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 33
Translator/Localiser desiderata CMS4L10n – CMS managing content … taking translators/localisers/PMs
into account • Separating content from layout & function but showing
interrelationships – XLIFF with linked XSL/CSS? (in xliff 2.0 L10n kit/portfolio?) – Preview, link to published page?
• Classifying elements in a standard way, semantics? – Types of articles/pages – Types of modules – Relations between constituents
• Possibility of PM preprocessing for translation » CMS User profiles: localisation PM, localiser…
– E.g. specific/global localisable links (href attribute) – Including various articles, entities , elements (e.g. flash, graphics, etc.) of
a page in an XLIFF file/group element, marking which for translation, others translated/for context…
– Generating html skeleton? Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 34
Future work
In-depth analysis of export/import tools in different CMSs and other Joomla! Multilingual Managers. – Josetta, new Joomfish version
Extraction of contextual, preview information • Links to published page containing translatable
articles…
Analysis of object types & relationships in web CMSs + Accessibility needs
Jesús Torres del Rey & Emilio Rodríguez Vázquez de Aldana 35