Exploring XML-based Technologies and Procedures for Quality Evaluation from a Real-life Case...

1
Exploring XML-based Technologies and Procedures for Quality Evaluation from a Real- life Case Perspective Folkert de Vriend 1 & Giulio Maltese 2 1 Speech Processing Expertise Centre (SPEX) Radboud University Nijmegen, The Netherlands 2 IBM Rome Solutions Lab and Voice Technology Development Rome, Italy XML-based technologies and procedures are a promising - alternative processing - Some of the technologies discussed are still in development or have an unclear status. XSLT 1.0 for instance does not provide a “random” function needed for sample-based validation while implementations fully supporting XSLT 2.0 – which does have a “random” function – are still scarce. -Since there is a large variety of checks that are performed in validation of SLR’s one will probably have to “tie together” different XML-based technologies oneself. The ISO DSDL project should be an improvement in offering a framework for the diversity in schema and transformation languages for validation. However DSDL is also still very much a “work in progress”. If proven to be more efficient, XML-based technologies and procedures will be used in future projects SPEX is involved in. Exploring XML-based technologies and procedures for validation LC-STAR For applications to be integrated into speech-driven interfaces embedded in mobile appliances and network servers, development of: - Bilingual corpora for Speech-to-Speech Translation applications. - Lexica for automatic speech recognition and text-to-speech synthesis. Example fragment of LC-STAR lexicon <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE LEXICA SYSTEM "NewLexica10.dtd"> <LEXICA xml:lang="it"> <ENTRYGROUP orthography="Abatantuono"> <ENTRY> <NOM class="PER" gender="invariant" number="invariant"/> <LEMMA>Abatantuono</LEMMA> <PHONETIC>a – b a – t a n - " t u O - n o</PHONETIC> </ENTRY> </ENTRYGROUP> <ENTRYGROUP orthography="Abbadia_Cerreto"> <ENTRY> </ENTRY> </ENTRYGROUP> -Only XML-based technology currently used is the Document Type Definition (DTD). But DTD has weak datatyping system: no control over element or attribute content. - For most validation text stream processing procedures (Perl scripts) were used. Checks on: - Orthography - Part of Speech (POS) - Lemma - Phonetic transcription - Special software was written (in Perl) to select samples for manual validation of certain aspects of orthography, phonetic transcription and POS. Alternatives for technologies and procedures currently used in LC- STAR XML Schema - Far more control over element and attribute content than DTD because of datatyping functionality. - Datatypes can also be specified with regular expressions: <xsd:simpleType name="orthography"> <xsd:restriction base="xsd:token"> <xsd:pattern value="[a-zA-Z_]"/> </xsd:restriction> </xsd:simpleType> - A basic Schema can be automatically generated from the DTD. The generated rules can then be made more stringent for validation purposes. XSL Transformations (XSLT) - For taking samples out of the XML-encoded data and directly marking it up with HTML for online manual validation. - Regular expressions are supported by XSLT 2.0. New possibilities for validation Easy character set validation - Part seven of the “Document Schema Definition Languages” (DSDL) framework aims at standardising a schema language specifically for checking that element and attribute content belongs to a specific subset of Unicode (Cyrillic or ISO 8859-1 for instance). “Character Repertoire Validation for XML” (CRVX) is a proposal for part seven: <context path="ENTRY/LEMMA"> <restrict charrep="\p{IsCyrillic}"/> </context> - CRVX uses XSLT 2.0 for implementation. Remote validation Supplementing production/validation cycles with on-line available XML Schema which can be used by producers for continuous self monitoring but which is still maintained by the validation centre. Introduction The project “Lexica and Corpora for Speech-to-Speech Translation Components” (LC-STAR) is an example of a project that uses Extensible Markup Language (XML) for the encoding of its Spoken Language Resources (SLR). The Speech Processing Expertise Centre (SPEX) is responsible for part of the quality evaluation (further: “validation”) in LC-STAR. Text stream processing procedures that are not truly XML-aware are not ideal for efficient validation of XML-encoded resources. Therefore SPEX explores XML- based validation technologies and procedures. This is done using the XML-encoded phonetic lexica developed in the LC-STAR project as a test bed. off-line on-line Productio n Validatio n Schema monitoring

Transcript of Exploring XML-based Technologies and Procedures for Quality Evaluation from a Real-life Case...

Page 1: Exploring XML-based Technologies and Procedures for Quality Evaluation from a Real-life Case Perspective Folkert de Vriend 1 & Giulio Maltese 2 1 Speech.

Exploring XML-based Technologies and Procedures for Quality Evaluation

from a Real-life Case Perspective

Folkert de Vriend1 & Giulio Maltese2

1Speech Processing Expertise Centre (SPEX)Radboud University Nijmegen, The Netherlands

2IBM Rome Solutions Lab and Voice Technology Development Rome, Italy

Future work

XML-based technologies and procedures are a promising - and on principle preferable - alternative for validation of XML with text stream processing procedures that are not truly XML-aware.

Drawbacks:

- Some of the technologies discussed are still in development or have an unclear status. XSLT 1.0 for instance does not provide a “random” function needed for sample-based validation while implementations fully supporting XSLT 2.0 – which does have a “random” function – are still scarce.

-Since there is a large variety of checks that are performed in validation of SLR’s one will probably have to “tie together” different XML-based technologies oneself. The ISO DSDL project should be an improvement in offering a framework for the diversity in schema and transformation languages for validation. However DSDL is also still very much a “work in progress”.

If proven to be more efficient, XML-based technologies and procedures will be used in future projects SPEX is involved in.

Current validation in LC-STAR Exploring XML-based technologies and procedures for validation

LC-STAR

For applications to be integrated into speech-driven interfaces embedded in mobile appliances and network servers, development of:

- Bilingual corpora for Speech-to-Speech Translation applications. - Lexica for automatic speech recognition and text-to-speech synthesis.

Example fragment of LC-STAR lexicon

<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE LEXICA SYSTEM "NewLexica10.dtd"> <LEXICA xml:lang="it"> <ENTRYGROUP orthography="Abatantuono"> <ENTRY> <NOM class="PER" gender="invariant" number="invariant"/> <LEMMA>Abatantuono</LEMMA> <PHONETIC>a – b a – t a n - " t u O - n o</PHONETIC> </ENTRY> </ENTRYGROUP> <ENTRYGROUP orthography="Abbadia_Cerreto"> <ENTRY> </ENTRY> </ENTRYGROUP>

-Only XML-based technology currently used is the Document Type Definition (DTD). But DTD has weak datatyping system: no control over element or attribute content.

- For most validation text stream processing procedures (Perl scripts) were used. Checks on:

- Orthography

- Part of Speech (POS)

- Lemma

- Phonetic transcription

- Special software was written (in Perl) to select samples for manual validation of certain aspects of orthography, phonetic transcription and POS.

Alternatives for technologies and procedures currently used in LC-STAR

XML Schema

- Far more control over element and attribute content than DTD because of datatyping functionality.

- Datatypes can also be specified with regular expressions:

<xsd:simpleType name="orthography">

<xsd:restriction base="xsd:token">

<xsd:pattern value="[a-zA-Z_]"/>

</xsd:restriction>

</xsd:simpleType>

- A basic Schema can be automatically generated from the DTD. The generated rules can then be made more stringent for validation purposes.

XSL Transformations (XSLT)

- For taking samples out of the XML-encoded data and directly marking it up with HTML for online manual validation.

- Regular expressions are supported by XSLT 2.0.

New possibilities for validation

Easy character set validation

- Part seven of the “Document Schema Definition Languages” (DSDL) framework aims at standardising a schema language specifically for checking that element and attribute content belongs to a specific subset of Unicode (Cyrillic or ISO 8859-1 for instance). “Character Repertoire Validation for XML” (CRVX) is a proposal for part seven:

<context path="ENTRY/LEMMA">

<restrict charrep="\p{IsCyrillic}"/>

</context>

- CRVX uses XSLT 2.0 for implementation.

Remote validation

Supplementing production/validation cycles with on-line available XML Schema which can be used by producers for continuous self monitoring but which is still maintained by the validation centre.

Introduction

The project “Lexica and Corpora for Speech-to-Speech Translation Components” (LC-STAR) is an example of a project that uses Extensible Markup Language (XML) for the encoding of its Spoken Language Resources (SLR). The Speech Processing Expertise Centre (SPEX) is responsible for part of the quality evaluation (further: “validation”) in LC-STAR.

Text stream processing procedures that are not truly XML-aware are not ideal for efficient validation of XML-encoded resources. Therefore SPEX explores XML-based validation technologies and procedures. This is done using the XML-encoded phonetic lexica developed in the LC-STAR project as a test bed.

off-line on-line

Production Validation Schema monitoring