Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Sentence Alignment as the Basis For...
-
Upload
brianna-fisher -
Category
Documents
-
view
214 -
download
0
Transcript of Digital Information and Heritage INFuture Zagreb, 7-9.11.2007. Sentence Alignment as the Basis For...
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Sentence Alignment as the Basis For Translation Memory Database
Sanja SeljanFaculty of Humanities and Social Sciences – University of Zagreb
Department of Information [email protected]
Angelina GašparSOA Centre Split
Damir PavunaIntegra d.o.o.
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Overview
I Introduction II When to use TMs?
Text preparationIII Corpus used
Text characteristicsIV Research
Tools usedAutomatic and manual alignementComparison of TMsResults
V Conclusion
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Sentence alignment (SA)
• basis for computer-assisted translation (CAT)• terminology management• term extraction• word alignment • cross-linguistic information retrieval
Sentence alignment (SA) -> translation memory (TM)basis for further research
in translation equivalencies
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Problems in automatic SA:
• robustness
• discrepancies in layout and omissions
• -> influence on accuracy and TM
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Research:
• SA on Cro-Eng parallel texts (laws, regulations, acts, decisions)
• alignment tool WinAlign 7.5.0 by SDL Trados 2006 Professional
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and HeritageAim:
• impact of SA process on the creation of TM
• comparison of 3 types of TMs
• Differences:
– in levels of expert intervention in set up of the alignment program
– in preparation of the source text for the segmentation
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
II When to use TMs?
• Fast and consistent translation (e.g. EU, multinational agencies)
• Voluminous texts • Highly repetitive types of texts • Use of specialized and consistent terminology• Several languages
• Sharing of common resources (cooperation)• Time-saving (Speed up the translation process)• Cost-saving• Consistent translation
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Directly through Directly through translationtranslation
Use of Use of already already translated materialtranslated material (alignment process)(alignment process)
Creation of TM
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
III Corpus used
• 9 parallel legislative Croatian-English texts or bitexts related to: acts, laws, regulations, decisions and ordinances;
• The sake of uniformity: standard presentation and standard formulas;
• 33.15% - percentage ratio for word count in English translations;
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
• Reasons:– English-an analytic type of language, use of
passive voice,– Croatian - a highly flective system, use of
active voice,
• Repetitive legal terms, phrases, sentences
• A regulation main components: the title, preamble, enacting terms, addresee, place, date and signature.
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
• Enacting terms - strict rules of presentation: - subject matter and scope, - definitions, - provisions conferring implementing power, - penalties or legal remedies, - transitional and final provisions.
• Standard form prescribes the layout on the page: spacing, paragraphing, punctuation and even typographic characteristics (capitalisation, typeface, boldface and italics)
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
• Use of verbs in enacting terms- Binding Croatian legislation:
- declarative terms (definitions, amendments) - and imperative terms (commands, prohibitions)
- English “shall”= Croatian present tense, modals (morati , trebati)
- English “may” for prohibition, permission and authorisation = Croatian present tense (“ne može se”, “može se”).
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
• Bitexts similarities : – punctuation, numbers, dates, foreign words;
• Differences: – capital letters, hyphens, compound words,
synonyms (avoided in target language);
• Common points: – consistent terminology, a uniform manner,
gender-neutral language;
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
IV Alignment research
• Texts:– Croatian legislative acts translations Cr->En;
• Tools: – AnyCount 4.0 (version 405) – for document
structure analysis – SDL Trados 2006 Professional (WinAlign
7.5.0.) – for alignment process;
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Alignment research
• PREPARATORY ACTIVITIES:– comparison of the source and target texts (whether all
text is translated)– defining set up of end and skip rules (delimiters,
creating abbreviation user list)– preparation of the source text for better segmentation
(spelling, automatic bullets and numbering, deleting of soft returns, hyphens, certain punctuation, tables created with tabs and revision marks)
– modification of set up rules– verification of the alignment (especially 1:2 and 2:1
pairs and commitment of pairs) – creation of translation memory and verification
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Alignment research
• Automatic alignment
WinAlign has language independent algorithms that count:– the quality of translation units which can
have tree levels (low, medium, high)– translation units aligning 1:2 or 2:1 pairs– unconnected target segments
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Alignment research
• Manual alignment– source text corresponds to translated target
segment (Aligned TM)– set up of the alignment program (Aligned TM
+ set up rules, e.g. segment and skip rules, abbreviation user list)
– segmentation of the source text (e.g. changes of soft returns, check of colon segmentation)
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Alignment research
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Alignment research
Raw TM Aligned TM + Setup rules
++ Segmented source
100% 121 106 112 120
95%-99% 0 0 0 0
85%-94% 2 5 0 0
75%-84% 2 2 1 0
50%-74% 1 1 2 0
No match 6 18 11 0
Total 132 132 126 120
Percent 91.67% 80.30% 88.89% 100%
INFutureZagreb, 7-9.11.2007.
Future 2007The Future of Information Sciences
7-9 November 2007Zagreb, Croatia
IN
Digital Information and Heritage
Alignment research
• Conclusion– The translation memories created in this study
out of different types of the alignment processes give different results regarding the quality of the translated material.
– The results show necessary interventions of an expert when defining the set up rules, in preparation activities for the source text segmentation and in the verification of suggested translation units.