Terminology management made easier —
a TBX-compliant terminology repository for a translation agency
Dave Calvert, TransForm Gesellschaft für Sprachen- und Mediendienste mbH
Wolf-Dietrich von [email protected]
Who’s who� TransForm GmbH
� Established 1994
� Specializes in corporate image and science and technology
� EN 15038 certified
� Wolf Dietrich von Loeffelholz
� Freelance software development and maintenance
Problem� Terminological data
� In-house legacy format MultiTerm 5.5� In-house legacy format intranet database� Freelancer preferred format Wordfast� Customers’ terminology
• All imaginable formats and conditions
����� Restricted interoperability� Need to run concurrent, incompatible systems� Pressure to upgrade to extremely expensive server-based solutions.
Concept� Application to store and maintain terminological data� Future-proof data format� Import and export file formats currently in use at TransForm
� Define other import/export formats without the need for substantial programming
� Access via existing intranet
����Web-based terminology repository with non-proprietary data format
Why TBX?� Substantial advantages to the user
� Open standard — effectively future-proof
� Open standard — pressure on tool vendors to support the format
� Clearly defined
� XML, so relatively easy to work with
� Available for use without licensing fees
TBX-Basic� TBX for small and medium sized language industry applications
� TBX is too powerful for most LSP applications
� TBX-Basic — lightweight version of TBX
� Developed by LISA Terminology Special Interest Group
� Specifically aimed at small and medium sized language industry applications
� Fully complient with TBX
� Restricted subset of TBX features
Our answer� Store terminological data as TBX
� Ensures future compatibility
� Use of standard will boost quality of data in medium term
� TBX capability ensures TBX-Basic capability
� Future changes to terminological markuplanguage (TML) possible within constraints of TBX
Now a mapping problem� Mapping legacy database terminological data formats to TBX
� TBX has three-level concept structure
� Concept
� Language
� Term
� Information on all levels is constrained in terms of what may and what must be stored
� Both explicit and implicit information must be handled
Implicit terminological information
� Glossary stored as:•M:\Customers\LN\Leistungselektronik_2
� contains entry
•Regelkreis control loop en.wikipedia.org
� Source and target terms, target term source are explicitly recorded
� Customer and project must be derived from path and filename.
� Languages are implied.
Handling implicit information � Input templates must define implicit information to be captured
� Wordfast requires more data to be entered at import time
� Intranet database records permit lookup of information
TBX-Basic Structure� Three levels
� Concept “termEntry”• Subject• Definition and its source• Cross-reference and/or image
� Language “langSet”• Definition and its source
� Term “tig”• Term notes, linguistic usage labels• Context and its source• Term source, administrative usage labels
� Any level• Administrative / transactional information• Notes
Compliance Issues � Structural or syntactic compliance
� Check using validation program e.g. tbxcheckhttp://sourceforge.net/projects/tbxutil/
� Content compliance� Can depend on purpose of data
� Machine processing requires Part of speech (TBX-Basic)
� Human use does not if either a Definition or a Context is provided (TBX-Basic)
� TransForm data was collected without consideration of these issues
� Full compliance with TBX-Basic only possible for new data
� Careful use of implicit information will help to mitigate these issues
What we intend to do with it� Import existing terminological data from:
� MultiTerm 5.5 databases
� Wordfast glossaries
� Intranet-based system
� Customers
� Maintain existing data
� Replace existing terminology collection back end� Terminology captured direct to TBX format
� Export project-specific and customer-specific terminology in the form of:� Dictionaries
� Glossaries
� Databases
MultiTerm data format� Last file-based version of MultiTerm
� Flexible concept-oriented system� Index fields—defined as languages and contain terms
� System fields� Attribute and text fields
� Order, number and relationships of attribute and text fields are not constrained
Wordfast glossaries
� Tab-delimited text glossaries
� Simple
� Open
� User-definable fields
� Fine for the translator
Wordfast glossaries—TransForm
� Source term
� Target term
� Note
� Term source
� Context sentence
� Context source
Intranet terminology capture� Term entry screen
• Simple term entry structure
• To be expanded by the addition of a context and its source
How it works — customers’ data
� Excel data
� Import in similar way to Wordfast glossary
� Tab-delimited text (Wordfast style)
����
����
� Convert to tab-delimited text
� TBX
Implementation
� Wolf Dietrich von Loeffelhoz• [email protected]
� Freelance software development and maintenance
System
� Web server with php5 and java
� Database to store metadata for search and management purposes
� Flex application to help with management of terminological data
� TBX storage on the file system with backmatter
System — php5
� PDO as abstract database layer
� Mysql, Oracle, MS sql, postgresql, etc
� DOM as document object model
� Work any XML needs
� Pear html template engine
Converter� Need for import function to have XML data
� UTF-8 convert to conform to XML requirements
� MultiTerm glossary (5.5 and earlier)• Tagged format
� Wordfast glossary or any tab-delimited format
• Definition of language combination and attribute fields
Import function
� Import of XML data of unknown format
� Mapping filters
� XML import filter
� TBX mapping filter
� TBX template
Import function — categories� Definable based on XML import filter
� Required information
� Term and language
� Expected information
� Admin, user and date information
� Optional information
� Error Logging
Import function — import� Import filter
� Import file
� Concept grouping
� TBX header information
� TBX back information
� Full automated import
� Step-by-Step with user validation
Administrative tools
� Grouping of terms
� Fixed grouping during import — so-called term id
� User-defined grouping into concept
Administrative tools
� RIA to help with management of concepts and terms
� Access to stored terminological data using search masks
� One-click copying of concepts and all associated terminology into export group
Export function� Export of export groups
� TBX
� TMX
� Any mapped XML format
� MultiTerm 5.5 tagged
� Wordfast
Features� Ldap authentification
� Dav
� Quick import of TBX files
� Quick access to TBX files
� RSS Feed
� Subscribe and observe insert, update and delete on
• User level
• Concept, term and/or comment level
Current status� Working title TBX-Transform
� Self-certified https site
� In beta release
� Full integration into intranet system in progress
� External beta testers by year-end
Immediate objectives� Fitness for productive use
� Including work on user interface
� Complete integration into intranet system
� Start migrating existing data
Future strategies � Import existing data, convert and storage as TBX
� Step-by-step validation where necessary
� Consolidation where appropriate
� Future-proofing
� Additional export forms for glossaries and dictionaries
Top Related