Unified File Container: Homogeneous Content Storage with...

31
Unified File Container: Homogeneous Content Storage with Heterogeneous Display Sanders King – u4651271 Australian National University Comp8750 - Computer Systems Project Final Report Abstract – File formats have a tendency to proliferate and the tools required to convert between formats tend to lag behind. In addition, fle conversion is often inadequate and information lossy. The Unifed File Container (UFC) is a framework for allowing existing fle formats to share content, structure, and presentation without necessarily using the same underlying fle format. Any proposed framework clearly needs to allow for modern functionality such as collaborative editing, undo/redo, version control, compound documents, object linking and embedding, document preservation, format obsolescence, cascading styles, and more. The UFC attempts to address or bypass most of these issues. 1. Introduction There are numerous fle formats in common use today, and at least an equal number of format conversion utilities. Much fle conversion is inadequate or at least insufcient. This paper intends to examine whether a way can be found to mitigate the problems of fle conversion by allowing existing formats to share content without format conversion. The basic approach is to provide a framework which can interact with existing applications and formats, and take over some of the content storage and manipulation tasks. This functionality would be provided in a consistent and shareable way. To test the viability of such a concept we also introduce a small suite of applications demonstrating the concept. This artefact is not intended to show why such a concept is advantageous, however, this paper will discuss why such a concept might be compelling. Some aspects of developing such a concept also merit some further attention, so, some of the issues will be discussed, outlining good, bad, and alternative approaches to overcoming the problems. 2. The Issue There are two parts to the problem that this paper is trying to address. The frst is that of document conversion. Arthur wants to work on his documents using his favourite commercial word processor, but Martha always uses her favourite FOSS word processor. Whenever Arthur and Martha send documents to each other they have to convert from one format to the other. That conversion is often lossy and Arthur and Martha cannot really collaborate on a document because the constant conversions ruin the formatting of the document. Also, the embedded vector drawings never really convert well at all. And, so far we haven't discussed Tom who wants to make the document available on-line in HTML. He has to laboriously convert and rework the HTML every time Arthur and Martha send him a new version of the document. The second part of the problem is that of a single document that is to be viewed in several contexts. Arthur has been working on his extensive magnum opus for a long time and now he wants to give a slide-show presentation of his thesis. The slide-show will have approximately the same outline as the thesis, however he needs a summarised version. Arthur takes a copy of his thesis and reworks it into a slide-show, but, critically, any changes he now makes to his original thesis will not be refected in the slide-show. Any spelling mistakes will have to be fxed twice, and if he reworks the order of his argument he'll have to extensively rework two documents. Still not convinced? Consider the following extract from an article about a slide show application called SlideRocket[1]: Users can now post comments on the presentation itself, either as virtual sticky note to the presentation author, who can receive alerts about them, or as part of a broader group having an online conversation within the slide itself.” [2] This is a major feature in SlideRocket, however this paper proposes a system that would include features like this out of the box. 2.1 Previous Approaches to Solve this Issue Two main approaches have been used in the past to solve this problem. The frst approach is to invent a 'better' or more 'adaptable' or more 'fexible' format and ask everyone to use that format. Ian Barnes[3] discusses various document formats suitable for long term preservation and concludes that “On these criteria, only XML is any good, but what XML?”. Barnes goes on to explain the USQ ICE[4] template and the Scholar's Workbench. [5] He would like all academics to agree on and use a (his) single document template (DTD). I like the Coombs et al take on this idea: Others have attempted to eliminate the need for meta-markup by providing complete referential and descriptive vocabularies, but such eforts are contrary to the spirit of human creativity.” 1 [6] 1 The highlighting has been added; it is not a feature of the original text.

Transcript of Unified File Container: Homogeneous Content Storage with...

Page 1: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Unified File Container: Homogeneous Content Storage with Heterogeneous Display

Sanders King – u4651271

Australian National UniversityComp8750 - Computer Systems Project

Final Report

Abstract – File formats have a tendency to proliferate and the tools required to convert between formats tend to lag behind. In addition, fle conversion is often inadequate and information lossy. The Unifed File Container (UFC) is a framework for allowing existing fle formats to share content, structure, and presentation without necessarily using the same underlying fle format. Any proposed framework clearly needs to allow for modern functionality such as collaborative editing, undo/redo, version control, compound documents, object linking and embedding, document preservation, format obsolescence, cascading styles, and more. The UFC attempts to address or bypass most of these issues.

1. Introduction

There are numerous fle formats in common use today, and at least an equal number of format conversion utilities. Much fle conversion is inadequate or at least insufcient. This paper intends to examine whether a way can be found to mitigate the problems of fle conversion by allowing existing formats to share content without format conversion.

The basic approach is to provide a framework which can interact with existing applications and formats, and take over some of the content storage and manipulation tasks. This functionality would be provided in a consistent and shareable way.

To test the viability of such a concept we also introduce a small suite of applications demonstrating the concept. This artefact is not intended to show why such a concept is advantageous, however, this paper will discuss why such a concept might be compelling.

Some aspects of developing such a concept also merit some further attention, so, some of the issues will be discussed, outlining good, bad, and alternative approaches to overcoming the problems.

2. The Issue

There are two parts to the problem that this paper is trying to address.

The frst is that of document conversion. Arthur wants to work on his documents using his favourite commercial word processor, but Martha always uses her favourite FOSS word processor. Whenever Arthur and Martha send documents to each other they have to convert from one format to the other. That conversion is often lossy and Arthur and Martha cannot really collaborate on a document because the

constant conversions ruin the formatting of the document. Also, the embedded vector drawings never really convert well at all. And, so far we haven't discussed Tom who wants to make the document available on-line in HTML. He has to laboriously convert and rework the HTML every time Arthur and Martha send him a new version of the document.

The second part of the problem is that of a single document that is to be viewed in several contexts. Arthur has been working on his extensive magnum opus for a long time and now he wants to give a slide-show presentation of his thesis. The slide-show will have approximately the same outline as the thesis, however he needs a summarised version. Arthur takes a copy of his thesis and reworks it into a slide-show, but, critically, any changes he now makes to his original thesis will not be refected in the slide-show. Any spelling mistakes will have to be fxed twice, and if he reworks the order of his argument he'll have to extensively rework two documents.

Still not convinced? Consider the following extract from an article about a slide show application called SlideRocket[1]: “Users can now post comments on the presentation itself, either as virtual sticky note to the presentation author, who can receive alerts about them, or as part of a broader group having an online conversation within the slide itself.” [2]

This is a major feature in SlideRocket, however this paper proposes a system that would include features like this out of the box.

2.1 Previous Approaches to Solve this Issue

Two main approaches have been used in the past to solve this problem.

The frst approach is to invent a 'better' or more 'adaptable' or more 'fexible' format and ask everyone to use that format. Ian Barnes[3] discusses various document formats suitable for long term preservation and concludes that “On these criteria, only XML is any good, but what XML?”. Barnes goes on to explain the USQ ICE[4] template and the Scholar's Workbench. [5] He would like all academics to agree on and use a (his) single document template (DTD).

I like the Coombs et al take on this idea:

“Others have attempted to eliminate the need for meta-markup by providing complete referential and descriptive vocabularies, but such eforts are contrary to the spirit of human creativity.”1 [6]

1 The highlighting has been added; it is not a feature of the original text.

Page 2: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

We are unlikely to get any signifcant portion of the population to agree on and use a single format, let alone a single DTD.

Additionally, no format can possibly expect to cover every display or formatting possibility, especially when some requirements are mutually exclusive. Documents come in several main types: paged documents (normal word-processor documents), slide-shows (like PowerPoint), fexible layout (such as HTML in web-pages), and more. Can any fle format be made fexible enough to cope with any layout, and cope with them all simultaneously? Note that the, Portable Document Format (PDF) does a very credible job of this but is primarily intended as a read-only viewing and printing version of a document.

The second approach was discussed as early as 1988 when Gomez, Pratt, and Buckley saw the problem this way:

“.much efort is being put toward the goal of universal exchange,it appears to us that it can go only so far. What we do expect to happen is a fourishing of translation products and services as the interchange problems become more manageable but not go away.” [7]

File conversion has become much better but no conversion program can expect to successfully convert a paged document into a slide show that may not even require the same structure or exact content.

What a final solution should encompass

There are several aspects that any solution to this problem should encompass, including:

• Provide for multiple views of the same content.

• Allow for modern document editor functionality such as undo/redo, version control, and unlimited formatting options.

• Allow incorporation of all the advanced features provided by existing fle formats.

• Allow for the integration of new as yet unthought of fle formats and features.

• Allow for collaborative editing.

• Allow 'nested containment'. [8]

• Be open (non-proprietary), deal with the issue of format obsolescence, and be human readable in any case.

• Be adaptable to difering viewing environments such as computer screens versus mobile phone screens.

• Blur the lines between a document, the web-browsing experience, and a desktop GUI application experience.

• Be suitable to be stored in a database, and catalogued and indexed and searched.

3. Literature Search

As this paper is primarily about creating interoperability between fle formats the literature search started with an overview of existing technologies. Wikipedia is a convenient (if informal) source of information about many encoding formats, fle formats, and 'format decorators' including:

• SGML, XML, LaTeX, HTML, XSL, XSLT, CSS,

• Open Document Format (ODF),

• OOXML (Microsoft),

• DocBook, TEI,

• Document Type Defnition (DTD),

• W3C XML Schema (supersedes DTDs)

• MIME Types,

• Object Linking and Embedding (OLE), CORBA,

• Google Wave, and

• many more.

It became apparent that existing technologies seem to break down into some basic groups.

Descriptive formats

Postscript and Portable Document Format are examples of a descriptive format. The entire document is encoded as a stream of instructions 'describing' the content.

Mark-up formats

LaTeX, XML and their descendants such as HTML, DocBook, and TEI hold the structure of the document while the mark-up contains information about how to format or display the content. LaTeX styles or XML Style Sheets may further describe the formatting of the content.

Compound formats

Some formats such as ODF and OOXML and PDF contain several components and embedded resource fles. For example, ODF and OOXML are both ZIPPED fle formats containing the a manifest fle (as XML), the document text fle (as XML), style information in another fle, and various other resource fles. PDF can also store embedded fonts to make the fle display similarly on any platform.

Nested Containment

ActiveX, OLE, and CORBA are examples of technologies that allow nested containment. A document of one type may be embedded into a document of another type. In some cases the host application will allow editing of the contained content, and in some cases this editing, while appearing to be occurring inside the host application is actually being performed by a another application.

Page 3: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Viewing and non-editable formats

Many formats are designed as viewing only formats. PDF, PS, and DVI all fall into this category. They tend to be designed to be device independent.

Collaborative Technologies

Google Wave [9], and Google Docs [10], are the primary examples of working collaborative technologies. They allow more than one person to edit a document simultaneously. The underlying functionality behind these technologies is provided by operational transforms [11], but equivalent functionality can be obtained using WOOT [12] or causal trees.[13]

Adaptable Content

HotMedia [14], and Chua's paper on 'Web-page Adaptation' [15], are both about ways to automatically deliver slightly diferent content depending on the target device. e.g. Desktop PC or mobile phone.

Preservation

The next theme is articles about the importance of choosing formats suitable for long term preservation and archival. The Victorian Electronics Records Strategy (VERS) [19] and the Scholar's Workbench [5] both emphasise the importance or long term archive formats. This is important, and certainly infuences some of the design of the UFC.

Multiple Views?

None of these technologies are about document formats that can be displayed or edited diferently depending on the display or user context. The ODF and OOXML formats are designed to be used as page presentation, slide, or drawing formats but only one at a time.

So what other ways might there be to provide interoperability between display formats?

XML in SQL

There are several ways to store XML fles in an SQL database when the DTD is not defned. Several articles tackle this issue.[16][17][18] However, none of them tackle the issue when trying to store several XML structures for the same content, as is the aim of this paper.

Collaborative Editing

Another area of interest is the state of collaborative text editing. It seems sensible that any design for a document store should include an API to allow collaborative access to the document content. Bieber and Isakowitz [20] describe the basics of text editing. Oster et al [12] describe an approach to group editing of documents without using operational transforms.

As this point we realise the need to see how existing applications implement the core document/text handling functionality. SharpDevelop [21] has a core text structure that implements an interface very like the one described by

'The Implementation and Experiences of a Structure-Oriented Text Editor'. [22] These provide the starting point for designing a content store.

Document Breakdown

In 1987, Coombs et al [6] wrote an interesting more grass-roots article about early approaches to document mark-up. They usefully identifed diferent mark-up methods as 'punctuational', 'presentational', 'procedural', and 'descriptive'. They showed remarkable clarity in their analysis of the basis of mark-up systems. They summarise their view like this:

“Presentational markup is designed for reading. Procedural markup is designed for formatting, but usually only by a single program. Descriptive markup is moderately well suited for reading, but primarily designed to support an open class of applications (e.g., information retrieval). ”

In the context of allowing documents to have multiple views, perhaps a content store needs to include a concept of 'structural' mark-up.

4. The Unified File Container (UFC)

At this point it is apparent that the solution cannot just be a new fle format. It needs to be document management framework within which document editors can function, and provide an in-memory data structure to store the live document currently being worked on. This framework is known as the Unifed File Container or UFC, and the in-memory content store is known as the Hierarchical Content Store or HCS.

4.1 The Grand Vision

In the interests of world domination, the vision for the UFC could incorporate many concrete and abstract ideas.

The line between an application and a document is steadily blurring with the increased fexibility of HTML and web pages. The PDF format allows for data entry into forms. HTML links allow you to open new windows. In that sense a form in a desktop GUI application is just another document, and the printable report launched from that same desktop GUI application is just another document. In fact, if the GUI form is, for example, a data entry form for a customer record, and the report is the print-out of that same customer information, they are, in efect, multiple views of the same document.

So, in this light, a dream for the UFC and the HCS at its core would be to make it fexible enough to be a form in a GUI application, a web page, a data entry form, or your letter to Gran about what you did last week.

To achieve all this, the UFC and the HCS would have to be remarkably fexible. Fortunately this is not the expectation. Existing client applications, be they a GUI application, a web site, a data entry form, or an e-mail client, allow the HCS to manage the content for them and then they supply the database connectivity, access authentication, record storage, and all the other fexibility required.

Page 4: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Refer to Appendix G for some of the other ideas that could be incorporated into a UFC/HCS implementation.

4.2 What is in scope?

World domination aside, the goal of this paper is merely to design a version of the UFC and the HCS core that allows for multiple views, and content shared by multiple client applications.

4.3 The Design

First of all, the proposed UFC framework is not a new fle format, or even yet another DTD. A brief look at Wikipedia will sufce to see that there already exist a multitude of fle formats and MIME types. Each of these formats has its own strengths and weaknesses, and purpose. No format can do everything.

UFC is not yet another OLE style solution [23]; although it does intend to aid such content integration.

UFC is a set of rules to allow existing formats to interact and share the same content. UFC is a container to allow existing formats and client applications to share the same fle space. UFC is a set of tools to allow users to select from various views of their content and edit or display that content using the appropriate client application.

The HCS stores some of the very basic formatting; all the advanced formatting is maintained by the client application and stored in the fle format preferred by the client application.

Background

A document can be thought of as having three distinct components. The frst is the content. This is the text, images, sound bites, and other such elements that make up a document. The second is the structure of the document. This defnes the order of the elements, whether they are foating, or anchored, footnotes, and so on. The third is the presentation or decoration: Is the text bold or italic? Is there a background image? What styles are applied to headings and body text?

Existing document formats mix these components in various ways. An XML document has the content and structure in the XML part of the document and the presentation is provided by a combination of the DTD/Schema and a style sheet. (Noting of course that the XML component contains the mark-up.)

Other formats issue a stream of instructions describing all aspects of a document in an interpretable language. The PDF format mostly works this way. The content, structure, and presentation are inextricably mixed.

During the literature search for this paper, not a single format was found that completely separates these three components.

Conventional Document Editors

Examining some existing document editors such as OpenOfce [24], the core editor inside SharpDevelop [21], and the text view control in GTK# [25], we see a model for how document editors work internally. The document is loaded from the on-disk format and converted into an in-memory structure. The in-memory structure is then edited and when saved it is converted back into the on-disk structure to be written to disk.

UFC Framework – Co-operative Editors

The UFC framework allows for two models of use. The frst, less preferred, is the 'co-operative' model. Client applications are adapted to use the HCS to store as much content as possible. For example, an HTML document might use place-holders instead of text and replace these with content from the HCS when editing or displaying the document.

HCS Enabled Editors

The second, preferred, model is the 'HCS enabled' model. In this case the client application is adapted to use the HCS as its in-memory structure. The HCS has been specifcally designed to allow for this type of use.

Please see Appendix F for a diagram comparing the 'HCS Enabled' and conventional models.

HCS Enabled DocumentEditor

UFC Service

Disk

MyFile.UFC

MyFile.ext(with place-holders)

User Interface

File Load/SaveService

content.hcs

HCS In-memoryStructure

(ITextNode,ITextRange)

Conventional DocumentEditor

In-MemoryStructure

MyFile.ext

User Interface

File Load/SaveService

UFC Service

Disk

MyFile.UFC

Co-operative DocumentEditor

In-MemoryStructure

MyFile.ext(with place-holders)

User Interface

File Load/SaveService

content.hcs

Content StoreService(IStore)

Page 5: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

In either case the client application still maintains a document in the original format of that application (e.g. .odt, .ppt, .doc). It is by using that original document format that the client application retains all the power and functionality provided by that document format.

The client application also uses the UFC to store any additional fles that might be required: manifest fles, image fles, style sheets, meta-data, etc. The UFC is actually an uncompressed ZIP fle, much like the ODF [26] and OOXML [27] formats.

Note that while the UFC and HCS could be implemented for each client application, the complexity and the high number of corner cases suggests that they should be implemented as an add-on to the host operating system and be made available to client applications as a service.

The Framework

The UFC launcher is launched when a user frst tries to open or create a UFC fle. The launcher interrupts the request and displays an appropriate dialog. When you create a new UFC fle you need to select a name for the frst view, and select the client application to use to edit this view. The prototype artefact uses a dialog like this:

When opening an existing UFC document, the user must choose which view they wish to open, or elect to create a new view. Here is the prototype screen-shot for opening an existing view:

And here is a screen-shot for creating a new view:

The HCS Service

At this point the client applications start to access the HCS service using the ITextRange and ITreeNode interfaces. These interfaces provide the sort of functionality required by an in-memory store. Read about the genesis of these interfaces in the APIs section of this paper.

The client application defnes the scope of document changes. The scope can apply to all current views in which case all content and structural changes apply to all views of the document. Otherwise, the scope can apply to just the current view; all changes are only visible to the current view.

4.4 Providing Modern Functionality

The beauty of the UFC solution is that it bypasses most problems by leaving them in the realm of the client application and the fle format used by that application.

However, the HCS does provide some common services that client applications can elect to use.

Twins

A twin is a section of a document that appears in more than one place in a document. If a twin is edited in one section of a document the other twins are also changed. Twins are possible because the the content of a document is not stored in the element stream but in 'blips'2 associated with elements in the stream. Each blip can be associated with more than one element.

Fields

Field elements are ranges of content that appear in the normal content fow but like anchored objects are not selected when placing the cursor. Again, they must be specifcally selected to be edited. Field elements contain data about the value but the displayed value does not necessarily match the stored value. There are several standard felds that the HCS understands such as: date, time, URL, citation, foat, and MIME types. The client application can also store custom felds. The feld can contain the feld data or it can contain a reference to the feld data, where the feld data might be stored in the UFC or might be an external reference. All felds have default place-holder text which is displayed if a display engine for the feld type is not installed or if remote content has not yet been downloaded.

2 A 'blip' is a section of text that has all the same attributes.

Page 6: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Anchors

Anchors, sometimes paradoxically called 'foating frames', are sections of a document that are tied to a location in the document. They might be tied to a paragraph, like a picture, or tied to a page like a footnote. When a cursor is inserted into a range, anchored ranges are ignored in the calculation of where the cursor should be inserted. Anchored ranges must be explicitly selected before the content can be edited.

Attributes

Attributes are kept for all elements in the stream. Client applications can store any key value pair they wish. Attributes are prefxed with a type: some are for the exclusive use of the HCS, some are for a particular view, and some standard attributes are shared among all views. The only requirement is that shared attributes must be rendered in a consistent way by all client applications.

Versions

The HCS is primarily designed to allow multiple views of the document. It is therefore trivial to allow multiple views of bits of the document, hence versions. As sections of a document are edited, copies of those sections are stored as 'divergent paths' in the stream. The client application can view and select those divergent paths, and reinsert them back into the primary element stream.

Undo/Redo

Modifying the content of the HCS is done using a string of commands. Each of these commands is non-destructive. For example, even when a client application 'deletes' a section of a document, that section is not deleted, it is merely hidden. Therefore it is possible for the HCS to retain a stack of operations (the 'OpStack'), and allow for easy undo and redo operations.

5. The Artefact

The purpose of the artefact is to demonstrate that the HCS concept has the fexibility to become the core in-memory representation of a document for various types of document editors. Unfortunately, the purpose is not to show how the concept of the UFC framework could be useful. The implemented 'document editors' are far too simple to highlight the benefts of the UFC.

5.1 The UFC/HCS Prototype

The artefact consists of a functional but incomplete prototype implementation of the HCS (HCS-P), a simple prototype implementation of the UFC framework (UFC-P), and two extremely simple HCS-P enabled 'document editors'. These two editors do not use the HCS-P as their primary in-memory store, but they do immediately refect all content changes from the editor to the HCS-P.

The HCS-P works in most cases and is fairly bug free, however any implementation will have to deal with a large number of corner cases and the prototype may function unexpectedly when encountering these corner cases.

The document editors consist of frstly a simple text editor in which you can set bold and italic text, and change the font to 'Purisa'. The second editor allows the user to create a tree of nodes with text. In each the user can change the scope of the editing between all views and the current view.

A UFC-P document can be edited with either or both editors, changing the editing scope as required. They serve to indicate that the HCS-P can maintain multiple views of the same document.

The prototype UFC framework and prototype HCS are written using C# and Gtk+ on the Mono.Net platform. [28] Importantly C#.Net includes automatic garbage collection; some features of the HCS-P would be quite difcult to achieve in a non-garbage collected language. So, this choice was driven by the need to develop the prototype quickly, and it is not likely to be an ideal choice for a production artefact.

Hierarchical Content vs Ranges

The HCS was originally designed with the fundamental belief that the core could be implemented using a hierarchical tree of nodes (like XML). After all, many of the modern document formats are based in XML. In fact HCS, originally stood for “Hierarchical Content Store”, and retains that name as it still can fulfl that purpose.

The original idea was to have a 'core' document tree and have a 'view' tree per view that linked to the core tree as required. However, as is exemplifed in appendix E, a tree structure is not sufcient to cleanly represent overlapping ranges of attributes. It is not legal XML to say:

<bold>This is bold,<italic> this is italic and bold,

</bold></italic> and this is italic only.

The correct XML representation might look like this:

<bold>This is bold,<italic> this is italic and bold,</italic>

</bold><italic> and this is italic only.</italic>

The problem with the correct XML representation is that the <italic> tag is repeated. LaTeX handles overlapping ranges like this: [29]

\series bold This is bold, \shape italic this is italic and bold, \series default and this is italic only. \shape default

'Element Stream'

The HCS attempts to handle both these patterns by including two concepts: 'nodes' for the XML style representation, and 'ranges' for the LaTeX style representation.

The result is a deceptively simple multi-linked list 'stream' of elements. Every element has two links to its neighbours. Range elements have an additional link to the other end of the range, and node elements, which inherit from range elements, have yet another link to their parent node.

Page 7: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Each range contains attributes including per view visibility. The resulting 'element stream' can retain a hierarchical node structure, a structure with overlapping ranges, attribute information for each range, version information for ranges of text (undo/redo, confict management, version control), and can retain diferent structural and content information for multiple views. The most important factor is that an element stream can contain all this information simultaneously for all the views. Views can share structure and attributes information, or not share this information, with nearly unlimited granularity.3

A client application that expects a hierarchical store of data uses the ITreeNode interface which exposes functionality similar to that of the 'Document Object Model' (DOM). [30]

A client application that expects a fat view of the document can iterate through the element stream. At each iteration the client application receives part of the document content and the attributes that apply to that content. This is delivered via the ITextRange interface.

Of course, any sophisticated document editor is likely to use a combination of these approaches.

When implementing the HCS-P element stream and the stream iterator, it became necessary to deal with the situation where part of a range is hidden to the particular view iterating through the document stream. Two possible approaches became apparent.

'Skipping Path'

The frst approach has been dubbed 'Skipping Path'. In skipping path the iterator looks at every element in the stream buts 'skips' over document content that is hidden. However, the iterator does not skip over attribute changes. This is important: consider the situation where the attribute 'Bold=True' occurred in a visible section of the document but the attribute 'Bold=False' occurred in a hidden section of the document. If the iterator skipped 'hidden' attributes the document text would remain bold when it should have reverted to not bold.

'Divergent Path'

The second approach is the 'Divergent Path' method. Using this method when a range of the document is 'hidden', the range must be searched for any attribute changes that are not entirely encompassed within the range. These attribute changes must then be duplicated at the beginning or end of the hidden section. Additionally these attribute change must then apply explicitly to the views that for which the range of text is hidden.

'Skipping Path' vs 'Divergent Path'

The prototype of the HCS was developed using the 'Skipping Path' method, however it would have been as approximately as easy to develop the HCS-P using 'Divergent Path'. With either method the issues of overlapping ranges must be considered and overcome.

Each has advantages. 'Skipping Path' doesn't need to worry about range starts and range ends (attributes being added

3* The only limit is that which is deliberately implemented by a client application to make a functional user interface.

and removed) because it doesn't ignore these as it skips over a hidden bit of text. However, it does need to step through (and ignore) each part of the hidden path purely because it must identify range starts and ends.

In 'Divergent Path', when a diferent path is created it must identify all the range start and ends (that don't start and end within the divergent path) and make sure to include them in the new path.

It is expected that sections of documents known as 'versions' (discussed above) would be stored as 'divergent paths'.

For an enlightening analogy further explaining 'Divergent Path' versus 'Skipping Path', please refer to Appendix B, and also visit Appendix C.

5.2 APIs

IUFC

The IUFC interface is the interface used to manage a UFC document. It includes the functionality to open and close documents, change to diferent document views, and store and retrieve resources (fles) used by client applications. For more detail refer to Appendix H.

This interface also provides access to the content through the HCS interfaces, ITextRange and ITreeNode.

ITextRange

The ITextRange interface is designed to provide a simple way to edit the content of the HCS as if it were a simple text array, or an array of lines of text. It was adapted from the IBufer interface found in SharpDevelop [21] and compared to the model described in 'A Logic Model for Text Editing'. [20] The interface to the Gtk+ 'TextView', which was used in the prototype artefact, also infuenced the design of ITextRange. [31]

The ITextRange interface had to difer from all of these models because of a feature of the HCS. The HCS allows the same 'blip' to appear in more than one place in the document. (This is a 'twin', see above.) Therefore a simple Insert(ofset, value) instruction that was inserting into the second occurrence of a 'blip' would increase the cursor ofset by twice the length of 'value' instead of the expected length of 'value'. To solve this problem the concept of a cursor location was introduced. The cursor location is always within the current most local range element. Inserts, deletions, and other requests to the HCS are ofset from the current cursor location. The cursor location only needs to be recalculated when the user actually moves the cursor to a new location on the screen. Other than that the current cursor location remains at the end of the most recently inserted or deleted text.

This interface is the usual way to edit a non-hierarchical document. To load the content, the entire document is presented as a range that can be iterated through. Each iteration return the next block of text or content, and tracks and returns what attributes should be applied to that block.

As the user moves the cursor about the screen the SetCursor() function determines where in the text any edits should apply. The location is cached and updated so that this calculation doesn't have to be made each time the user enters

Page 8: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

some text or content.

The client application can of course insert, remove, and move content, and apply attributes to selected ranges of content.

ITreeNode

The ITreeNode interface is concerned with manipulating nodes and the tree of nodes. Adapted from the W3 DOM core specifcation. [30] It was also checked against the interface in 'The Implementation and Experiences of a Structure-Oriented Text Editor' [22] to be sure that it included basic editing commands that may not have been required for an interface like the DOM which had a specifc purpose.

The ITreeNode interface is frst and foremost designed to provide the primitives to allow a DOM wrapper to be written around it, thus making the HCS, XML compatible.

The HCS does difer from the DOM in that it needs to maintain its own element types. Therefore any element types created using the DOM interface are in fact stored as 'DOM type' attributes inside the HCS.

ITreeNode is not in itself a full implementation of DOM. It merely provides the capability for an external wrapper to use the HCS to provide a full implementation of DOM. For example, the DOM 'Element' interface has a command 'removeAttributeNode'. The standard states that when this command is invoked, “If the removed attribute has a default value it is immediately replaced.” The HCS does not provide this functionality. A DOM wrapper around the HCS would implement this functionality.

The DOM also defnes 'attribute nodes'. These could be stored as invisible nodes or as referenced objects within the HCS. The HCS does not directly understand the concept of 'attribute nodes'.

Please note that the DOM specifcation is large. It has not been exhaustively checked for compatibility with the HCS.

IW00t

There is in fact no IW00t interface [32]. IW00t refers to the requirement to support collaborative editing.

The HCS-Collaborative Protocol (HCS-C) was designed to allow collaborative editing within HCS enabled client applications. HCS-C is a preliminary draft protocol, designed merely to indicate that the UFC framework is fexible enough to cope with collaborative editing of documents. HCS-C was not implemented in the proof of concept artefact.

There seem to be two main ways to implement collaborative editing. The frst is 'Operational Transformation' (OT) [11], used by Google Wave [9] and Google Docs [10], and 'Post-OT' methods such as 'WithOut Operational Transformation' (WOOT). [12]

Without doing more than a cursory analysis of existing collaborative models, HCS-C is merely a design of the simplest system that could work within the HCS. Any serious design of a collaborative system for the HCS would have to include reference to Grishchenko's work using causal trees. [13]

How does HCS-C compare to OT and WOOT?

• OT and WOOT attempt to allow collaborative editing down to the character level. The HCS only allows collaborative editing down to the element level: with the twist that placing the cursor will usually create a new version of the element.

• OT and WOOT don't directly attempt to deal with text formatting or range attributes. The HCS does.

• The HCS-C approach has a similarity to the WOOT approach in that when text is deleted it is merely marked as 'deleted' but not actually removed. The HCS does this at the element level while WOOT does it at the character level.

• WOOT doesn't require vector clocks. OT does (various types). For HCS-C it's a bit grey. The element IDs become a sort of vector clock. The HCS server and HCS client translate between server and client element IDs. Everything below a cursor lock is an HCS client element ID and will ultimately be changed to an element ID supplied by the HCS server.

• WOOT uses an all-peers arrangement. OT and HCS-C require a 'master' copy of the document.

• HCS-C is simpler than both OT and WOOT. It tries to be the simplest solution that will work in most situations without any particular guarantees. Partly it can remain simple because in the case of a confict it stores the confict for later resolution by the user; a type of optimistic replication. [33]

For greater detail about the design of the HCS-C (IW00t) protocol please refer to Appendix A.

Other Interfaces

There are several other interfaces for client application access to the UFC and HCS. Please refer to Appendix H.

5.3 Code Highlights

Most of the artefact code is fairly straightforward. All elements inherit from a base element as they share many attributes and functionality. Node elements inherit from range elements.

The only notably complex code was the code to provide range iteration. The code complexity increased as all the corner cases were discovered. Ultimately the complexity was resolved by creating a very simple iteration class (RangeIterator) and wrapping that class in progressively more functional classes (TextRangeIterator, VisibleRangeIterator). Classes such as TreeNode then use these iterators and add additional logic to return the desired result.

Page 9: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

6. Discussion

The frst purpose of this paper, as set out at the beginning, was to reduce the need for fle conversions, allow documents to have multiple views, and allow for all the variety of functionality and purpose provided by the many fle formats already in existence, and any that might come along in the future.

The proposed solution does not eliminate the requirement for document conversion but it does save the need to convert a document multiple times. Also, a converted document can be 'tweaked' and have the changes refected in the original version of the document.

The combination of the UFC framework and the HCS mostly allows for multiple views and variety of functionality by declining to take responsibility for most of these requirements. These requirements are mostly the responsibility of the 'host' document formats. Those requirements that are the responsibility of the UFC/HCS have been discussed and designed to some extent. They have not been designed to the level of detail that would be required before an implementation was attempted. There are many 'holes' in the argument, and many corner cases that have not been discussed. This comes under the scope of future work.

The second aim was to create an artefact that demonstrates the viability of the replacement for the in-memory document representation used by client application; i.e. the Hierarchical Content Store, the HCS. The artefact successfully achieves this goal. Documents can be created with multiple views, and those views can contain diferent content.

The artefact does begin to highlight some of the user interface issues that would come implementing a UFC/HCS solution. An application user interface would be complicated with the inclusion of multiple cut and paste options, some sort of text 'versions' interface, and too many choices about how to open and edit a document. For these reasons the proposed solution may be beyond the average users of the average application.

7. Conclusion

In this paper, we have seen a way that we might we able to have existing client applications and fle formats share document content without compromising the functionality and advantages of those existing formats. The concept would mitigate but not remove the need for fle format conversion.

The Unifed File Container (UFC) provides a credible framework in which existing client applications and formats could exist. The complementary software artefact has demonstrated the viability of providing a general and reusable in-memory document store which client applications could use, and the framework enables documents to share document content.

We have also seen some interesting and perhaps novel approaches to in-memory document editing, and ways to provide multiple views of a document.

Page 10: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

8. Appendices

8.1 Appendix A – HCS-C

In HCS-C collaborative editing comes in two favours: 'remote' and 'of-line'.

Remote Collaborative Editing

Remote editing is the process used when the 'master' copy of the document is available via an HCS-C server. The client HCS may be able to expect very quick response from the HCS-C server, or the response may be delayed. The protocol works slightly diferently in these cases.

Editing an Element:

The client HCS gets a cursor lock when an element is to be changed. The leaf element is converted to a parent range and the new text is inserted into a new element created at the cursor. The new element is cursor locked. The client HCS periodically sends an update to the server HCS. At that point the locked element is updated and a new locked element is created. The client HCS does not have to wait for this; it can just assume that a new locked element will be created and assign it its ID when a response is received from the server HCS. When the cursor is moved the client HCS sends a fnal element update and an UnlockCursor signal to the server HCS for the locked element.

Note that the client HCS has not had to wait for a response from the server before it allows the user to start editing the new locked element. Also note that if the server HCS doesn't hear from a client HCS for a specifed time-out (t) it can commit and unlock a locked element. The client HCS must assume that a time-out has occurred after (t/>1) to remove any possibility that the client HCS keeps sending updates to an element that is no longer locked.

If a second client HCS attempts to edit the same element the same pattern is repeated. The second client HCS can be allocated its own locked element; however the second client HCS will not be allowed to send updates to the locked element. The length of the locked element will depend on how frequently element updates are being send to the server HCS.

Range Changes:

The client HCS requests a lock on the lowest parent that encompasses the range. When the change has been made the client HCS sends the changes to the server HCS and they are committed and the range is unlocked.

A second client HCS can attempt to obtain a range lock. If the network is responsive then it will be told that the selected range already has a lock. The client application can choose to wait for the lock to be granted, or can continue, in which case the fnal commit/unlock will be marked as a 'confict'.

Note again that the client HCS does not have to wait for a range lock to be granted by the server HCS. It can continue in the belief that a lock has been granted. If duplicate locks are granted, when a client HCS sends a commit/unlock, the ranges are duplicated and marked as 'conficts'.

Conficts:

It may be that the server HCS can merge many type of conficts. For example, if the only changes were to range attributes and none of the range attributes were in confict then the range changes could perhaps be merged safely.

Conficts are available to the HCS-C implementation. Each GetConficts() function returns a list of ITextRange that confict with the committed range. Any client HCS can send a message to the server to commit a conficted range. If two clients simultaneously commit diferent conficted ranges then the server HCS just marks them as conficted again.

Off-line Collaborative Editing

Of-line mode works identically to 'remote' mode except of course that the client HCS should not expect to get a response from the server HCS when requesting an element or range lock. Commits and unlocks are delayed until the client HCS is again online. It is to be expected that there would be more conficted elements when working of-line.

Like WOOT, HCS-C will over time create a great many small elements and ranges. For example, every time a user moves the cursor a new element may be created. However, periodical host HCS commits executed when there are no connected HCS clients can clean up the excess elements and ranges.

Page 11: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

HCS-C Protocol Messages: (initial draft)

Note that all messages are assumed to have embedded the client ID, the document URL, and a request ID. For that reason the messages below don't show these parameters. The type 'XML' refers to the string value that HCS elements and other constructs can import and export. Also note that messages can be delivered in 'batch' mode to reduce trafc between client and server.

Message Response

Client → Server

ReqViewList() RespViewList(XML view_list)

ReqView(int view_id) RespView(XML view)

ReqCursorLock(int element_id, int ofset) RespCursorLock(int element_id, boolean granted, optional int locked_element_id)

SendElementChange(int element_id, XML element)1 RespElementChange(int element_id, boolean confict, optional XML conficts)2

SendElementCommit(int element, XML element)3 RespElementCommit(int element_id, boolean confict4, optional XML conficts)

ReqElement(int element_id)5 RespElement(int element_id, XML element)

ReqResolveConfict(int element_id, int confict_element_id)

RespResolveConfict(int element_id, boolean success, optional XML conficts)6

ReqSetElementVersion(int element_id, int version_element_id);

RespElementVersion(int element_id, boolean success, optional XML conficts)7

ReqPingResponse() RespPingRequest()

Server → Client

SendElementChange(int element_id, XML element)8 RespElementChange(element_id)

ReqPingResponse() RespPingRequest()

1. If the client was granted a cursor lock then they send back the newly created locked element. If not, they send back the smallest encapsulating range of the new locked element. It is in this case where there may be an element confict.

2. If the client does not receive a response to an element change request it should not immediately resend the request. It is not necessary: either the change request is sent in the next periodic client → server update, or when the element is committed.

3. If a client does not receive a response to a commit it must continue to resend the commit until it gets a response. At that time the client can unlock the local element.

4. If a client updates an element for which it has not received a lock, it may create an element 'confict'. The client application can choose to show the conficts to the user who can select the winner from them. As each view can only be held open by one client at a time, if a client tries to open a view that is already open they are given a copy of the original view. If a confict occurs, the version created by each client will be the default version seen by that client until the confict has been resolved.

5. The ReqElement message will usually be sent for the root range. The returned element stream overwrites the local stream except where the local tree is marked 'locked'.

6. It is not important that the client receives a resolve confict response. If a confict remains it will appear the next time the client refreshes the local copy of the document. The response may indicate that the resolve confict was not successful in the case where two users selected two diferent versions of the confict at the same time. Again this is not a problem: the confict can be resolved at a later time.

7. A set element version request is very similar to a resolve confict, however if there is a confict the version requests are converted to conficts.

8. In the initial hand-shake the client and server can negotiate for the server to use a push model to update the client. This reduces the need for the client to constantly ask for updates.

Page 12: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

8.2 Appendix B – The 'Horror Movie' and 'Zoo Trip' Analogies

'Skipping Path' – The 'Horror Movie' Analogy

The 'Skipping Path' is analogous to watching a romantic/horror/drama movie. You want to watch the horror and the drama but you don't like the romance, so you close your eyes (not leave the room) during the romantic bits. Your friend doesn't like the horror bits so he closes his eyes during the scary bits. You have sat through the entire movie, but you have only seen the bits that are to your taste. The movie is equivalent to a document, you have iterated through the entire document, however you have each seen bits of the document according to your taste.

'Divergent Path' – The 'Zoo Trip' Analogy

The 'Divergent Path' option is analogous to a day trip to the zoo. You want to see the 'dangerous' animals, and your friend wants to see the 'Australian Native' animals. You check the sign at the entrance and see that you should follow the 'red' path, and you friend should follow the 'blue' path. At each intersection you will see a sign-post indicating which way to go for your chosen colour. At some intersections the sign post will have several colours and some stops the sign-post will only have one colour and one option, (presumably the colour that you are following).

Each sign-post has the following; you check each option starting at the top to what applies to you:

1. Red/Yellow → take this path (text exclusive to these views)

2. Everybody else except Blue → take this path (text hidden for Blue)

3. Everybody else → take this path (you may or may not see this text depending on the previous intersections)

4. Old pathways → nobody go this way (old versions, deleted text)

Whichever route you each take, you will all end up at the exit gate.

The 'Zoo' is the document. The red and yellow paths are diferent views of that document depending on the interests of your reader.

.

Page 13: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

8.3 Appendix C – Skipping Path' vs 'Divergent Path' - A Diagrammatic View

The 'Skipping' and 'Divergent' path approaches can also be seen in the following diagram. The 'Divergent Path' elements 'had', ' lamb', '</b>', and 'and' each have s signpost indicating which way green and red should go. Similarly, the 'Skipping Path' has '<skip>' and '</skip>' elements indicating sections of the text that green and red should skip over. Note that even though the bold end element '</b>' appears during a section of text that is being skipped by red, red does not skip it. Attribute changes are always recognised, even if they appear in hidden sections. In this case, if the '</b>; element was skipped, red's text would incorrectly continue to be displayed as bold

Document - View 1: Mary had a little lamb, its fleece was white as snow, and everywhere...Document - View 2: Mary had an evil lamb with fangs as sharp as knives, and everywhere...

Divergent Path

Skipping Path

Mary <b> had

a

and everywhere...

little

an

its fleece was

lamb,

evil

</b>

white as snow,

with fangs as sharp as knives,

Mary <b> had a

and everywhere...

little an

its fleece

was

lamb,evil

white as snow,

with fangs

as sharp as knives,

<skip> </skip> <skip> </skip>

</b><skip> </skip> <skip> </skip>

<skip> </skip> <skip> </skip>

Page 14: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

8.4 Appendix D – 'Virtual Tree' Model (superseded)

The following diagram depicts the 'Virtual Tree' model. In this model the 'core' document content is stored in a hierarchical tree and each view of the document has a virtual tree which links to the core tree at some nodes. This model proves inadequate to properly support overlapping attribute ranges. The problems with this method are discussed in 'Hierarchical Content vs Ranges', and depicted in Appendix E.

Page 15: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

8.5 Appendix E – Node Restructure: Multiple Parent Nodes (superseded)

This diagram depicts the node restructuring required when applying attributes to content range that spans over multiple parent nodes. This is a fairly simple example but it can become quite complex to implement when ranges start and fnish at diferent levels in the hierarchy. As discussed in ' Hierarchical Content vs Ranges', this approach was superseded by the 'Element Stream' approach which does not sufer from the same problems.

Initial State: Instruction:AddRangeAttributes(A, B, {Colour == Magenta} )

Final State:

Node ID == 1, Bold == True

Parent Node

Node ID == 5, Italic == True

And everywhere that

Leaf Node

Lamb.¶IIts fleece

Mary had a little

was white as snow.¶

Mary went,¶that lamb

was sure to go.¶I

I == Cursor A

I == Cursor B

Legend:

} Colour == Magenta

Node ID == 1, Bold == True

New Node, ID == 12, Colour == Magenta

lamb.¶

Mary had a little

was white as snow.¶

New Node, ID == 10, Colour == Magenta

IIts fleece

It followed her...

Node ID == 5, Italic == True

And everywhere that

Mary went,¶that lamb

was sure to go.¶I It followed her...

Mary had a little lamb.Its fleece was white as snow.And everywhere that Mary went.That lamb was sure to go.It followed her...

Mary had a little lamb.Its fleece was white as snow.And everywhere that Mary went.That lamb was sure to go.It followed her...

Range attributes are repeated

Page 16: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

8.6 Appendix F – Existing Document Editors vs HCS Enabled Editors

DiskMyFile.UFC

HCS EnabledSlide Editor

HCS EnabledPage Editor

Microsoft PowerPoint Editor

In-MemoryStructure

MyFile.PPT

User Interface

File Load/SaveService

Format Conversion Utility

.PPT → .ODT

MyFile.ODT

OpenOffice Writer

In-MemoryStructure

User Interface

File Load/SaveService

content.HCS

UFC Service

File Load/SaveService

In-MemoryStructure

IUFC

ITreeNode

ITextRangeUser

Interface

PowerPoint/HCS

Wrapper

IUFC

ITreeNode

ITextRange

Writer/HCS

Wrapper

User Interface

MyFile.ODTMyFile.PPT

Existing Document Editors

HCS Enabled Document Editors

Page 17: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Appendix G – More about the UFC and the HCS

Objects

The HCS element stream only stores text, felds, and references to objects; it does not store the objects themselves within the stream. The objects must be stored as resources within the UFC fle. It is up to the client application to ensure that the object references match the available resources.

The 'OpStack'

Every element in the stream maintains the stack of the operations that are performed. This provides undo and redo capability. As operations are performed within the 'all views' context or within a single view context, the undo and redo also maintain the context. The undo/redo stack cannot be maintained by a client application because two client applications might be working within the same element or range using the 'all views' context.

The Undo/Redo (OpStack) functionality has not been designed. It is anticipated that such a design would not be overly complex or difcult, however there are likely to be some issues when documents are edited collaboratively.

Undo/Redo

The Undo/Redo capability also means that changes to the tree are not fnalised. For example, a node that is deleted will not be removed from the tree; it will merely be marked as hidden even if there is no view confict. Therefore we need the 'commit' functionality.

There is a known issue with Undo/Redo when using collaborative editing. Undo/Redo will apply to edits made by all users, however the server HCS can only apply changes when the client HCS sends updates, so the order of Undo/Redo may not be as expected, and may in some cases produce strange results.

Commits

It is expected that 'Commits' would be run either when opening or when saving a document. This is client application specifc: commits must be requested. The commit function takes a parameter to indicate how many previous operations not to commit in order to maintain at least a little change history. The 'Commit' function can also avoid committing deletes and inserts in order to provide the option of a visible change history to the client application.

Automatic Views

The HCS always saves text only copies of all views whenever the document is saved. If any view of any document has ever been saved as a read only version (typically PDF) then that read-only version will either be deleted or updated when the document itself is saved.

Persistent Ranges

HCS-C has a concept of locking ranges and elements while performing collaborative updates. However, the within the HCS a client application can also persist a range. A persistent range is a range that is guaranteed not to be removed by the HCS during a commit cycle. A client application would persist a range or element when it retains an external reference to that element, perhaps within its own client document format.

Punctuational Mark-up

A production implementation of the HCS should consider feld types dedicated to punctuational mark-up. Coombs et al [6] make a strong case for separating punctuation from the content.

Remote Content

A feld can also reference remote HCS content. Such content is inserted into the content stream in place of the feld. Such content may or may not be dynamic. All felds have a default text description; this text is inserted into the stream until remote content has become available to the HCS. Remote content may or may not be editable; this depends on the authentication arrangements, but, the local client application can mark such content as hidden allowing the user to customise remote content for their local view. This concept is much like web-site 'mash-ups'. [35]

Page 18: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Multiple Language Support

As ranges of content can have multiple versions, those versions could also be other language versions of the same content. A single document could contain all the available translations within that same document. Editing text in any language version could mark the other language versions as 'out-of-date' or 'invalid' until they have been updated and verifed.

Page 19: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

8.7 Appendix H – The APIs

ITextRange

The ITextRange interface is concerned with editing the text content of documents.

Method Name Parameters Notes Implemented?

Type Name Return

GetCursor Cursors

{A,B,C,D}

cursor ICursor SetCursor can be called on the root node or any node in the tree. Each node can maintain its own set of cursors, but these cursors cannot extend outside the range of that node. If SetCursor() is called with an ofset greater than the length of the node, the instruction is ignored and returns NULL. For a simple text document the SetCursor() instruction would usually be issued to the root node.

Yes

SetCursor Cursors cursor ICursor Yes

Integer ofset

ClearCursor Cursors cursor void Yes

SetLineCursor Cursors cursor ICursor No

Integer line_ofset

Integer column_ofset

MoveCursor Cursors cursor Boolean 'Up' and 'Down' movements will retain the same ofset in the new line as in the previous line if possible, however they do not take into account current fonts, character widths, etc. These must be maintained by the client application. A return of 'false' means that the cursor was not set or could not be moved the full distance.

Partially

CursorDirections

{Up, Down, Left, Right}

direction

Integer distance

Page 20: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Method Name Parameters Notes Implemented?

Type Name Return

MakeNode Cursors start ITextNode MakeNode() ensures that the range (e.g. CursorB → CursorA) is a separate node, and then returns the node. If the range splits covers multiple parent nodes, the nodes from the later parent will be moved to be part of the earlier parent (and thus their attributes may change).

No

Cursors end

MakeRange Cursors start ITextRange MakeRange is similar to MakeNode in that it ensures that the range can be referenced. For example, attributes can applied to a range. However, unlike node, ranges can overlap.

Many instructions will call MakeRange() before making their changes (SetRangeScope(), DeleteRange(), TwinRange(), etc).

Yes

Cursors end

SetScope Boolean exclusive ITextRange Instructions to the HCS either apply to all views of the document or only to the current view context. This is a global value and applies to all changes made anywhere within a view.

SetRangeScope fnds or creates a range, applies the scope and returns it.

Yes

SetRangeScope Cursors start ITextRange No

Cursors end

Boolean this_view_only

GetCursorScope Boolean Returns the scope of the node pointed to by the given cursor.

True = this_view_only, False = all_views.

No

InsertText Cursors cursor Where the new text is the frst insert in the element since the cursor what set, Insert moves the text of the element into a child leaf node and copies the text into another 'version' element. The element becomes a 'version' parent. A version parent is a special case of a parent element that cannot be 'seen' by client applications. It contains a list of elements and a pointer to which of those elements is the current version (allowing Undo and Redo).

Where the various inserts create a new element, that new element is returned. As appropriate the cursor is also automatically updated.

InsertXML is for pasting elements from the clipboard or other sources

Partially

String text

IList<IAttribute> attributes

InsertXML Cursors cursor ITextRange No

XML node

InsertField Cursors cursor ITextRange No

Page 21: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Method Name Parameters Notes Implemented?

Type Name Return

that are not within the HCS.

Fields represent common formatted objects such as dates, BibTex references, foating point numbers, and more. The contents are stored in the node.

More complex objects such as bitmap images, vector drawings, or sound fles are stored in the node as references to fles or byte arrays stored in the HCS as resources, or stored elsewhere.

'ObjectsTypes' includes all MIME types as well as a 'custom' option to be interpreted by the client application.

The default text is returned to client applications that do not interpret felds or objects, however such clients cannot edit the feld/object or its default text.

FieldTypes feld_type

String value

String default_text

InsertObject Cursors cursor ITextNode No

ObjectTypes object_type

String object_ref

string default_text

DeleteChar Cursors cursor void DeleteChar deletes a single character in the direction indicated. 'Up' and 'Down' are treated as 'Left' and 'Right' respectively. See 'InsertText' to see what happens when an element is frst edited.

Ranges are not actually deleted. They are just marked as invisible for the current scope. They may ultimately be removed completely when the 'Commit' function is invoked.

PasteRange makes the pasted range visible to all views. i.e. it inherits its visibility from its parent element. So normal practise is to do a DeleteRange followed by a PasteRange. If you want the source range to remain until it is pasted, issue a MakeRange followed by a PasteRange.

The 'Twin' paste option copies a reference to the source range to the new location. Editing a twinned blip in any location will result in it being changed in all locations. Each twin maintains its own set of attributes.

For a 'Move' range the HCS may copy the source range and then delete it depending on whether the HCS is in HSC-C (collaborative) mode and the exact implementation of the Undo/Redo functionality.

No

CursorDirections direction

DeleteRange Cursors start XML Yes

(as DeleteText)Cursors end

PasteRange Cursors from_start ITextNode No

Cursors from_end

Cursors to

PasteOptions

{Move, Copy, Twin}

paste_option

Page 22: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Method Name Parameters Notes Implemented?

Type Name Return

StartIterator The range iterators keep track of the attributes as they are added and removed while iterating through the range. Each iteration returns a section of text that has all the same attributes. The iterators act per view automatically hiding content that is not visible for a view.

Yes

(as StartRangeIterator)

GetNextIteration OUT String text Yes

OUT IList<IAttribute)

attributes

CloseIterator Yes

(as CloseRangeIterator)

GetAttributes Boolean in_context IList<IAttribute>

These functions work on ranges that have been created using 'MakeRange' or accessed using a range iterator or some other method. Cursors are not required as they apply to the entire range.

GetAttributes() will retrieve the attributes of the element. The 'in_context' fag will cause the entire document to be scanned so that the attribute list is in the context of the entire document, not just the range.

No

SetAttributes IList<IAttribute> attributes No

AddAttribute String key Yes

String value

RemoveAttribute String key No

Page 23: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

ITreeNode

The INode interface is concerned with manipulating nodes and the tree of nodes.

Method Name Parameters Notes Implemented?

Type Name Return

GetNodeByID Integer node_id ITextNode No

GetNodeAtCursor Cursors cursor ITextNode No

GetAttributes Boolean recursive IList<IAttribute> The 'recursive' fag retrieves all the attributes of the parent nodes recursively until the root node is reached. Attributes found lower in the tree will not be overwritten by attributes higher in the tree.

Where a list of attributes is returned it always represents the current attributes of the node.

No

SetAttributes IList<IAttribute> attributes

AddAttribute String key

String value

RemoveAttribute String key

ClearAttributes

GetParentNode ITextNode These methods are useful for navigating the tree. No

GetNextSibling ITextNode

GetPreviousSibling ITextNode

GetChildNodes IList<ITextNode> Yes

(as ChildNodes)

GetText String This returns the text content of leaf nodes or the text content of all the children of a parent node. In the latter case, feld and object nodes will return the 'default text' associated with the feld or object.

Yes

Page 24: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Method Name Parameters Notes Implemented?

Type Name Return

GetNodeText ITextRange GetText returns a string with no applicable attributes. GetNodeText returns ITextRange which includes formatting and other attributes.

Yes

(as NodeText)

SetText String text ITextNode The existing content of the node will be inserted into a new child node and marked as deleted. The new text is also put into a new child node. This new child is returned.

Yes

ExportXML XML<node> The contents of the node are exported in the XML format used and understood by the HCS.

No

ImportXML XML node ITextNode This is similar to 'GetText'. The existing contents of the node are archived, marked as deleted, and the XML node is inserted. The new node is returned. Compare this to 'InsertXML'.

No

AddChildNode NodeTypes node_type ITextNode 'Add' and 'Insert' are intended for hierarchical editors.

'DeleteNode' is merely a convenience that asks the node's parent to call 'DeleteChildNode'.

Yes

String text

InsertChildNode Integer index ITextNode No

DeleteChildNode Integer index XML<node> No

DeleteNode No

SetNodeScope Boolean this_view_only void The node scope determines whether it is visible to all views or just to the current view.

A 'persistent' node is guaranteed not be deleted or merged by the HCS. Any client application or structure (e.g. the DOM wrapper) that refers to the node by node ID knows that it won't unexpectedly disappear. For a 'persistent' node to go away either an explicit 'SetPersistent(false)' must be called or 'DeleteNode', 'DeleteChildNode()', or 'DeleteRange()' that encompasses the entire node.

No

GetNodeScope Boolean

SetPersistent Boolean persistent void

GetPersistent Boolean

SetAnchored Boolean anchored void

Page 25: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Method Name Parameters Notes Implemented?

Type Name Return

Child anchor nodes are invisible to the IText interface. Their length is counted as zero for SetCursor() instructions, and they are ignored for 'Attributes' instructions. An anchor node must be individually selected as an INode and then instructions performed on that INode. Note that a node can be anchored in one view but not anchored in another view, and the same for 'twinned' nodes. The purpose of anchor nodes is to aid client applications to implement headers, footers, foating frames, and similar.

GetAnchored Boolean

GetVersions Boolean get_other_views IList<ITextNode> 'GetVersions' returns a list of old versions of the node, including any 'confict' versions created by other users, and old or deleted versions, and optionally versions only visible to other views.

A returned version can be one of Deleted | Version Confict | Snapshot Version | Other View. It is up to the client application to determine how to represent to the user the possible multitude of versions and versions within diferent hierarchies.

'MakeVersion' makes a snapshot version of the node.

SetVersion makes archives and deletes the contents of current node and replaces it with the selected version of the node. The node ID does not change.

SetVersion Integer node_id ??

MakeVersion

GetNodeID Integer The node ID is guaranteed to be unique across the HCS.

Page 26: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

IUFC

This interface is concerned with the management of the UFC and the resources stored within it.

Method Name Parameters Notes Implemented?

Type Name Return

IUFC (File)

SaveDocument Boolean It is up to the client application to take care of asking the user for a fle name and location. The HCS just saves the document with few extra smarts. If 'Save' is called on a new document, the HCS will just return 'false': not saved.

These methods save the HCS only; any other fles must be stored within the HCS by the client application.

Yes

SaveDocumentAs String fle_spec Boolean Yes

Commit Integer version_history Boolean A 'Commit' cleans up the stream of elements. It removes empty elements and merges elements that have identical attribute sets. It does not touch 'locked' elements and will not remove 'persistent' elementsl. Refer to the main report for more information.

No

IUFC (Views)

CreateInitialView String description IView 'CreateView' can only be called on a new document that doesn't have any existing views. All subsequent views must be copies of previous views.

The client type specifes the type of client application that is expected to edit this view of the document. Exactly which client application is launched to edit or show this view of the document will depend on the fle associations of the host operating systems.

The description is free form text entered by the client application or the user. This description is used when a UFF fle is being opened and the user must select which view to open.

Yes

ClientTypes client_type

AddView Integer view_id IView Yes

String description

ClientTypes client_type

GetViews IList<IView> Yes

SetViewContext Integer view_id Yes

Page 27: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Method Name Parameters Notes Implemented?

Type Name Return

IUFC (Views) cont.

GetViewContext Integer Yes

GetRootNode ITreeNode The root node is the starting point of all views of the document.

IUFC (Resources) The UFF is merely an uncompressed ZIP folder of fles. The HCS fle specifcation is a path that specifes the directory and fle names of the resource within the UFF. The HCS takes no other responsibility for these resources.

To store byte array resources the HCS merely reads them to and from a fle as a convenience to the client application. The exact interface will depend largely on the language and environment in which the HCS is implemented.

No

GetResources IList<IResource>

StoreFile String os_fle_spec Boolean

String hcs_fle_spec

String Description

RetrieveFile String hcs_fle_spec Boolean

String os_fle_spec

StoreBytes Integer num_bytes Boolean

Byte[] bytes

String description

RetrieveBytes String hcs_fle_spec Boolean

Integer return_num_bytes

Byte[] bytes

DeleteResource String hcs_fle_spec Boolean

Page 28: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Other Interfaces

Method Name Parameters Notes Implemented?

Type Name Return

IList<T> The IList interface is a generic way to return read only collections of nodes, attributes, and any other HCS constructs. The exact implementation of IList is dependent upon the language environment being supported by the HCS. This interface is intended to be descriptive of usage not prescriptive.

The T is a place-holder for whatever type is being stored within the IList.

Enumerator Platform Dependent

Yes

ItemAt Integer index T Yes

ItemByKey String Key T Yes

Count Integer Yes

IAttribute Attributes come in several types. 'HCS' attributes are used by the HCS and are only visible to the HCS. 'DOM' attributes are used by wrappers that implement the Domain Object Model (DOM) interface.

'Client' attributes are shared across all client applications that use the HCS. It includes a basic set that all client applications are expected to implement in a similar way. 'Temp' attributes are removed by the HCS during a 'Commit'. They are a convenience for client applications to store temporary state.

Extremely long or non-text attributes can be stored as a resource and referenced via a 'Ref' attribute.

GetAttributeType AttributeTypes Yes

GetKey String Yes

GetValue String Yes

IView

GetViewID Integer Yes

GetDescription String Yes

Page 29: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Method Name Parameters Notes Implemented?

IResource

GetPath String

GetDescription String

ITextNode The ITextNode interface concatenates both the ITextRange and ITreeNode interfaces. It is more convenient that casting nodes between ITextRange and ITreeNode depending on the context. Note that functions with names identical signatures may have 'Range' or 'Node' inserted into the name to avoid conficts.

Enumerations

Enumeration Values Notes

AttributeTypes NotSet, Client_Temp, Client, HCS, DOM, Ref

Refer to 'IAttribute.GetAttributeType' for more information about attribute types.

CursorDirections Up, Down, Left, Right Used by 'MoveCursor' and 'DeleteChar'.

PasteOptions Move, Copy, Twin Refer to 'ITextRange.PasteRange' for more information about these options.

ClientTypes Yet to be determined Refer to 'IUFC.CreateView' for more information.

FieldTypes Date, Time, DateTime, URL, BibTex, Float, MIMEType, etc

ObjectTypes All_MIME_Types, Custom

Page 30: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

References

[1] “Online Presentation Software | PowerPoint Online | Web Presentation | SlideRocket.” [Online]. Available: http://www.sliderocket.com/. [Accessed: 05-Oct-2010].

[2] “SlideRocket raises bar in online presentations as startups challenge PowerPoint | ZDNet.” [Online]. Available: http://www.zdnet.com/blog/btl/sliderocket-raises-bar-in-online-presentations-as-startups-challenge-powerpoint/400012?tag=nl.e539. [Accessed: 05-Oct-2010].

[3] I. Barnes, “The Digital Scholar’s Workbench (Slides),” 13-Jun-2007.[4] “ICE: The Integrated Content Environment.” [Online]. Available: http://ice.usq.edu.au/. [Accessed: 04-Oct-2010].[5] I. Barnes, “The Digital Scholar’s Workbench,” in Proceedings ELPUB2007 Conference on Electronic Publishing – Vienna, Austria, 2007.[6] A. H. R. JAMES H. COOMBS and S. J. D. E, “MARKUP SYSTEMS AND THE FUTURE OF SCHOLARLY TEXT PROCESSING,” Communications of

the ACM, vol. 30, no. 11, Nov. 1987.[7] L. M. Gomez, D. F. Pratt, and M. R. Buckley, “Is universal document exchange in our future?,” in SIGDOC '88: Proceedings of the 6th annual

international conference on Systems documentation, pp. 69–74, 1988.[8] R. M. Adler, “Emerging standards for component software,” Computer, vol. 28, no. 3, pp. 68 -77, Mar. 1995.[9] “Google Wave - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Google_wave. [Accessed: 05-Oct-2010].[10] “Google Docs - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Google_docs. [Accessed: 05-Oct-2010].[11] “Operational transformation - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Operational_transformation.

[Accessed: 05-Oct-2010].[12] G. Oster, P. Urso, P. Molli, and A. Imine, Real time group editors without Operational transformation. INRIA, 2005, p. 24.[13] V. Grishchenko, “Deep hypertext with embedded revision control implemented in regular expressions,” in WikiSym '10: Proceedings of the 6th

International Symposium on Wikis and Open Collaboration, pp. 1–10, 2010.[14] K. G. Kumar et al., “The HotMedia architecture: progressive and interactive rich media for the Internet,” Multimedia, IEEE Transactions on, vol. 3,

no. 2, pp. 253 -267, Jun. 2001.[15] H. N. Chua, “WEB-PAGE ADAPTATION FRAMEWORK FOR PC & MOBILE DEVICE CO-BROWSING,” University of Nottingham, 2005.[16] S. M. Chung and S. B. Jesurajaiah, “Schemaless XML document management in object-oriented databases,” in Information Technology: Coding and

Computing, 2005. ITCC 2005. International Conference on, vol. 1, pp. 261 - 266 Vol. 1, 2005.[17] E. Bertino and B. Catania, “Integrating XML and databases,” Internet Computing, IEEE, vol. 5, no. 4, pp. 84 -88, Jul. 2001.[18] M. Yoshikawa and T. Amagasa, “XRel: a path-based approach to storage and retrieval of XML documents using relational databases,” ACM Trans.

Internet Technol., vol. 1, no. 1, pp. 110–141, 2001.[19] “Victorian Electronic Records Strategy - Forever Digital.” [Online]. Available: http://www.prov.vic.gov.au/vers/standard/. [Accessed: 09-Oct-2010].[20] M. Bieber and T. Isakowitz, “A logic model for text editing,” in System Sciences, 1989. Vol.III: Decision Support and Knowledge Based Systems Track,

Page 31: Unified File Container: Homogeneous Content Storage with ...courses.cecs.anu.edu.au/courses/CS_PROJECTS/10S2/Reports/Sand… · To test the viability of such a concept we also introduce

Proceedings of the Twenty-Second Annual Hawaii International Conference on, vol. 3, pp. 543 -552 vol.3, 1989.[21] C. Holm, M. Kruger, and B. Spuida, Dissecting a C# Application: Inside SharpDevelop, 1st ed. Wrox Press, 2003.[22] O. Strömfors and L. Jonesjö, “The implementation and experiences of a structure-oriented text editor.,” SIGPLAN Not., vol. 16, no. 6, pp. 22–27, 1981.[23] “Object Linking and Embedding - Wikipedia, the free encyclopedia.” [Online]. Available:

http://en.wikipedia.org/wiki/Object_linking_and_embedding. [Accessed: 06-Oct-2010].[24] “Writer/Core And Layout - OpenOfce.org Wiki.” [Online]. Available: http://wiki.services.openofce.org/wiki/Writer/Core_And_Layout.

[Accessed: 04-Oct-2010].[25] “gtkmm: Gtk::TextView Class Reference.” [Online]. Available: http://library.gnome.org/devel/gtkmm/unstable/classGtk_1_1TextView.html.

[Accessed: 06-Oct-2010].[26] “OpenDocument - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/OpenDocument. [Accessed: 07-Oct-2010].[27] “Ofce Open XML - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Docx. [Accessed: 07-Oct-2010].[28] “Mono (software) - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Mono_(software). [Accessed: 07-Oct-2010].[29] M. Ettrich, LyX. 2009.[30] “Document Object Model Core.” [Online]. Available: http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-2000010113/core.html. [Accessed:

04-Oct-2010].[31] “GtkTextView.” [Online]. Available: http://library.gnome.org/devel/gtk/stable/GtkTextView.html. [Accessed: 08-Oct-2010].[32] “w00t - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/W00t. [Accessed: 05-Oct-2010].[33] “Optimistic replication - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/Optimistic_replication. [Accessed:

05-Oct-2010].[34] X. Su, B. S. Prabhu, C. Chu, and R. Gadh, “Middleware for multimedia mobile collaborative system,” in Wireless Telecommunications Symposium,

2004, pp. 112 - 119, 2004.[35] “Mashup (web application hybrid) - Wikipedia, the free encyclopedia.” [Online]. Available:

http://en.wikipedia.org/wiki/Mashup_(web_application_hybrid). [Accessed: 09-Oct-2010].[36] “Common Object Request Broker Architecture - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/CORBA.

[Accessed: 09-Oct-2010].[37] “MIME - Wikipedia, the free encyclopedia.” [Online]. Available: http://en.wikipedia.org/wiki/MIME. [Accessed: 09-Oct-2010].