Making Multi-Structured Documents
-
Upload
peportier -
Category
Technology
-
view
454 -
download
0
description
Transcript of Making Multi-Structured Documents
[email protected] - http://liris.cnrs.fr/~peportie [email protected] - http://liris.cnrs.fr/~scalabre
Laboratoire d'InfoRmatique en Image et Systèmes d'informationLIRIS UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon
Université Claude Bernard Lyon 1, bâtiment Nautibus43, boulevard du 11 novembre 1918 — F-69622 Villeurbanne cedex
http://liris.cnrs.fr
UMR 5205
Lyon - 25/11/2008
Lyon - 25/11/2008
Multi-structured documents
Modelisation and creation
Lyon - 25/11/2008 2
MSD Problematic
Several specific uses several structure types
e.g. physical, logical, semantic, poetic, linguistic
Recurrent problematic of Digital HumanitiesTEI recommendations and overlapping hierarchies
Example queries:Find all damaged words that contain damaged characters
only.Indicate for each word containing restored characters the
location of the corresponding line.
€
⇒
Lyon - 25/11/2008 3
Medieval Manuscript (1)
Transcription of old manuscripts
Lyon - 25/11/2008 4
Medieval Manuscript (2)
Physical structure
Lyon - 25/11/2008 5
Medieval Manuscript (3)
Lexical structure
Lyon - 25/11/2008 6
Medieval Manuscript (4)
Damaged characters structure
Lyon - 25/11/2008 7
Medieval Manuscript (5)
Image regions structure
Lyon - 25/11/2008 8
Medieval Manuscript (6)
Relations between structures
Physical structure Lexical structure Damaged characters structure
Text regions structuretra
nscr
iptio
n
lines
loca
lizat
ion
brokenWordslocalization
damagedcharacterslocalization
A multi-structured document is a document having multiple structureslinked together through a shared content or other inter-structural relations.
Lyon - 25/11/2008 9
Modern Manuscript (1)
Modern manuscript of J.T. Desanti
Lyon - 25/11/2008 10
Modern Manuscript (2)
Physical structure: lines
Lyon - 25/11/2008 11
Modern Manuscript (3)
Idiomatic structure
Lyon - 25/11/2008 12
Modern Manuscript (4)
Alterations structure
Lyon - 25/11/2008 13
Existing works (1)
(too) specific “models”
0
0,5
1
1,5
2
2,5
3model expressivity
model genericity
implementation
usability of XML tools
query mechanisms
Structures and data changes TEI Guidelines Redundant encoding
TEI Guidelines Empty elements
TEI Guidelines Virtual elements
TEI Guidelines Stand-off markup
CONCUR
MuLaX
MECS / TexMECS
LMNL
MonetDB
Lyon - 25/11/2008 14
Existing works (2)
Generic models
0
0,5
1
1,5
2
2,5
3model expressivity
model genericity
implementation
usability of XML tools
query mechanisms
Structures and data changes
Delay Nodes
Annotations Graphs
RDF (RDFTEF)
MCT
MSXD
GODDAG
MSDM / MultiX
Lyon - 25/11/2008 15
Multi-Structured Document Model
MSDM
Lyon - 25/11/2008 16
MSDM (2)
Relations between structures
P h y s i c a l s t r u c t u r e P h y s i c a l s t r u c t u r e L e x i c a l s t r u c t u r eL e x i c a l s t r u c t u r eD a m a g e d c h a r a c t e r s
s t r u c t u r e
D a m a g e d c h a r a c t e r s
s t r u c t u r e
T e x t r e g i o n s t r u c t u r eT e x t r e g i o n s t r u c t u r e
B a s e s t r u c t u r eB a s e s t r u c t u r e
L o c a l i z a t i o n o f b r o k e n
w o r d s
L o c a l i z a t i o n o f l i n e s
T r a n s c r i p t i o n
L o c a l i z a t i o n o f d a m a g e d
c h a r a c t e r s
P h y s i c a l s t r u c t u r e P h y s i c a l s t r u c t u r e L e x i c a l s t r u c t u r eL e x i c a l s t r u c t u r eD a m a g e d c h a r a c t e r s
s t r u c t u r e
D a m a g e d c h a r a c t e r s
s t r u c t u r e
T e x t r e g i o n s t r u c t u r eT e x t r e g i o n s t r u c t u r e
B a s e s t r u c t u r eB a s e s t r u c t u r e
L o c a l i z a t i o n o f b r o k e n
w o r d s
L o c a l i z a t i o n o f l i n e s
T r a n s c r i p t i o n
L o c a l i z a t i o n o f d a m a g e d
c h a r a c t e r s
MultiX ;Xinclude ;Etc.
Stand-Off Markup
Lyon - 25/11/2008 17
MultiX (1)
Base Structure
Lyon - 25/11/2008 18
MultiX (2)
Composition for a line of the physical structure
<msd:comp id=“C1” idrefs=“F1 F2 F3=F4 F5 F6 F7” />
<line n=“1”><msd:clink target=“BS” label=“text content” to=“C1”/></line>
Lyon - 25/11/2008 19
MultiX (3)
Querying MultiX documents: Xquery functionsrebuild ($elem-seq as element()*) as element()*share-content ($e as element()) as xs:Booleanshare-content-with ($e as element(), $str_name as
xs:string) as element()*share-fragments ($e1 as element(), $e2 as element()) as
xs:Booleanget-shared-fragments ($e1 as element(), $e2 as element())
as element(msd:frag)*includes-fragments-of ($e1 as element(), $e2 as element())
as xs:BooleanEtc.
Lyon - 25/11/2008 20
MultiX (4)
Find all damaged words that contain damaged characters only.
Lyon - 25/11/2008 21
MultiX (5)
Creation and evolution of MultiX documentsA parser (MXP) creates an internal representation from
separated structuresUseful with a priori known structures
Lyon - 25/11/2008 22
Creation and Evolution of MSD
Little or no a priori knowledge about the structures
Common situation for scholars in the humanities
E.g. transcription of a poem found in a manuscript using the vocabulary defined by the TEI schema
The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.
He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”
Lyon - 25/11/2008 23
Before restructuring
The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.
He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”stanzas
sentences
verses
base structure
compositionnodes
fragments
Lyon - 25/11/2008 24
Restructuring is necessary
The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.
He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”stanzas
sentences
verses
base structure
compositionnodes
fragments
Lyon - 25/11/2008 25
Automatic restructuring
The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.
He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!” stanzas
sentences
verses
Lyon - 25/11/2008 26
User intervention in restructuring
The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.
He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!” stanzas
verses
sentences
Lyon - 25/11/2008 27
Perspectives
Shared responsibilitiesWho is responsible for each document structure ?Life cycle of newly created document structures ?
Use of formal knowledgeFormal knowledge, the tree structure of well formed XML
documents, made possible an automatic restructuringIt seems necessary to find simple formal conditions for
restructuring times …