order

Peter Dobcsanyi p.dobcsanyi at designtheory.org
Thu Jul 24 14:06:06 BST 2003


                                                    "Order! Order!"

In this post I make an attempt to find some common ground for looking at
the problem of order in ext-rep documents.  Probably my terminology is
not very clear cut, all I try to do is drawing your attention to a few
problems.  Admittedly, I raise more questions than I answer :-)

Of course, we already have various orderings imposed on some subtrees.
Questions remain, however: should we order this or that; is the
semantics of the current orderings is the right one, is it sound; is the
related notation consistent and unambiguous etc.

An XML document is a representation of a hierarchical tree.  The actual
text of the document represents this tree by listing its nodes in the
order one would encounter with them by traversing the tree starting from
its root, going depth first from left to right, printing a start tag
when one first encounters with the particular node and printing an
end-tag after one finished with all children of the node.

    XML: <a> <b/> <b/> <c> <d/> <d/> </c> </a>

    XML pretty printed:

        <a>
            <b/>
            <b/>
            <c>
                <d/>
                <d/>
                <d/>
            </c>
        </a>

    tree:
                (a)
               / | \
              /  |  \
            (b) (b) (c)
                   / | \
                  /  |  \
                (d) (d) (d)

An XML document is *well-formed* if it conforms to some syntactical
rules and can be obtained by traversing a hierarchical tree with one
single root in the way outlined above.

An XML grammar, like our Relax NG schema "blockdesign.rnc", defines the
structure of this tree roughly by specifying what are the valid tags and
what is the strict [see notes [1] at the end) relative order of the
subtrees. An XML document is *valid* in respect to a given grammar if it
is well-formed and matches the structure defined by the grammar.

    a possible schema:

        a = element a {
            b+,
            c
        }

        c = element c {
            d*
        }

We say that two nodes of the tree have the same type if they have the
same tag and the same parent (they are siblings with the same name).
This defines an equivalence relation on the nodes of the tree. In a way,
the equivalence classes are the types, so we will use the two terms
interchangeably.

Let's distinguish two kind of types and call a type a multi type if the
corresponding equivalence class can have more than one element in a
valid document and call a type a simple type otherwise.

Strictly speaking the schema provides an order among types (in the
example above: [a], [b], [c], [d]) but says nothing about how to order
nodes within an equivalence class.  This latter ordering is up to the
implementation.

Without loss of generality, when comparing nodes we ignore attributes
and their values and are concerned only with element names. How we
compare nodes with the same names for ordering purposes we will come to
it later.

Some leaves are not XML elements but character data (CDATA)
representing, for example, numbers.  Note, that this kind of nodes are
always of simple types since they are the only children of their
parent. Nevertheless, we can compare such nodes by their "natural order"
depending what their represent (integers, strings etc.) and this comes
handy later.

We say that a subtree, identified with its root node, is "ordered" if
the root's children are ordered, including possible multi types.
If all of these children are of simple types the order is given in any
valid document and determined by the schema. So the interesting cases
when a subtree has multi types among its nodes on the first level.

Assume that we have a subtree with multi types, how we order it?
Example:

        <blocks>                       blocks
            <block>                   /   |  \
                <n>10</n>            /    |   \
                <n>11</n>           /     |    \
            </block>             block  block  block
            <block>             / |     / | \   /|\
                <n>1</n>       n  n    n  n n  n n n
                <n>6</n>       |  |    |  | |  | | |
                <n>9</n>      10 11    1  6 9  2 4 9
            </block>
            <block>
                <n>2</n>
                <n>4</n>
                <n>9</n>
            </block>
        </blocks>

We order blocks from bottom up. On the lowest level we have simple
types, we are done by default. On the next level we have a multi type
<n> in all <block>-s. First we compare them by the number of their
children, if it is equal by comparing their children's represented value
lexicographically.  And so on...

Hmm .... I am getting lost here, but it must be clear to anybody what I
mean :-). The ext-rep document already contains a description of our
"canonical" ordering in terms of nested lists.

Note, if a subtree contains at least one multi types on any level then it
can be ordered only if the whole subtree is ordered from bottom up.


When facing multi types, the implementor has several choices:

    1) Don't care about order of nodes of multi types at all.

    2) Care about the order of nodes in some multi types but not in
       other multi types.

    3) Impose order on nodes in all multi types
       (completely ordering the tree.)

We cannot take 1) since we are using functions on indices for simplicity
and compactness reasons and that requires to have ordered domains.

We cannot take 3) since the protocol nature of ext-rep and/or practical
computational limits can create situations when we could not provide
ordering even if we wanted to. (*can't order*)

That's left us with 2), the partial ordering. However, we took this road
probably not by the above reasoning but by common sense since there are
subtrees whose ordering wouldn't make much sense or in which we are not
interested.

Taking 2) the implementor has different ways of dealing with the
situation depending, among other things, on what he sees as
semantically important:

    2/a) Require that all types he cares about must be ordered in all
         documents. That implicitly assumes, of course, that all these
         multi types can be ordered in all valid documents.

                            order?
                           /     \
                        care   don't care
                         / 
                      ordered

    2/b) Allow that types he cares about are not guaranteed
         to be in order in all documents, however he want to indicate
         what the case exactly is.

                            order?
                           /     \
                        care    don't care
                       /   \
                      /     \
                ordered    no guarantee
                            for order

    2/c) He cares only about the fact whether order is guaranteed or not

                            order?
                           /     \
                          /       \
                      ordered    no guarantee
                                  for order

Note, that the (*can't order*) situation plays in one way or in another
in all 2/a-c).

Whatever is the case it is preferable that the Reader of an ext-rep
document can establish as much as possible about the order situation
directly from the document without "consulting" with the protocol
description.


If we choose 2/a then to satisfy this requirement is easy: simple add an
attribute "ordered" to the roots of all ordered subtrees.
Alternatively, we can just state in the protocol doc which subtrees are
ordered -- but that would be against explicitness.

Note, to go with 2/a we must restrict ourself to subtrees which we can
and want to order under any circumstances.


2/c seems to be similar in that its "decision tree" has two leaves.
However I feel it is a bit messy conceptually, since what is ordered
would be moving around from document to document without a clear policy
reflected explicitly in the documents.


2/b is what we have (more or less) so far for some subtrees. The
notational convention we use (or should use) is that subtrees whose
orders matters have a logical attribute "ordered" indicating the state
of ordering and subtrees which we don't care about don't have such
attribute.
Note, that this convention is not consistently applied at the moment
(for details se below).

    Recursive Ordered attribute (RO)

Having the "ordered" attribute in the root of a subtree could indicate
that everything below is ordered so we don't have to repeat it over and
over again in multi types. I am not sure how consistently this can be
applied to all situations (see the list of designs case below).  What if
we have an "insulating" simple types only layer in the subtree so it is
already ordered by the schema so there is no need to go deeper. However
below it there can be unordered multi types.

There is yet another subtle detail lurking here.  Consider the following
grammar:

    a = element a {
	b+,
	c,
	d*
    }

If 'a' has "ordered" attribute it is not so obvious what its value
should be in a document in which, say, 'b' is ordered but 'd' is not.
One could say it should be False since subtree 'a' is not completely
ordered. But what if 'd' declared of no interest regarding ordering.
Is this status of 'd' is visible (explicit) from the document? Because
of (RO) neither b or c would have "ordered" attribute.

The current schema has an interesting property: any node has at most one
multi type among its children. Let's call that "dominance property".  If
a grammar has this property the above problem disappears.


Current state (probably not exhausting):

    We specified an order for list of block designs but this element do
    not have "ordered" attribute. Should we add it?  But how about (RO)?
    Treat it as a special exception to the rule since it is the root
    node?

    <blocks> are ordered from bottom up but have no "ordered" attribute.
    Although, in this case, we probably don't want ordered attributes
    deeper down.

    <map_entry>-s when are ordered by their <preimage>-s in the
    <function_on_xxx> subtrees when they are functions, accordingly
    <function_on_xxx> have ordered attribute. We may want to extend
    order for quasi functions and state some order for their preimages.
    mixing original and collapsed preimages within a quasi function is
    still pending since, I believe, I have received no comments on this
    matter to my "functions - summary" post.

    A similar ordering could be defined for <resolutions> subtree by
    ordering its multi type <resolution> entries by their
    <function_on_indices>'s comparing two functions (already ordered!)
    based on their map entries using our "canonical" ordering method.

    Anything else should be here?


[1]

There is some exception to strict relative order among branches
since Relax NG allows arbitrary order with the "&" construct.
We use this in <indicators>, <optimality_criteria_values> and
<automorphism_group_properties> (maybe more). The idea behind this was
to make lief easier for Expander (Read/Process/Write) application, so
when it computes new elements in these subtrees it does not need to care
about where to insert them.

There are disadvantages however:

    - our schema cannot be transformed into DTD therefore, for the time
      being, we cannot write real validating Readers.

    - it messes up this "clear" :-) picture on the principles of ordering.

I lean toward replacing "&" with "," that is to have strict order among
simple types. Any feelings on that?

--             ,
    Peter Dobcsanyi




More information about the Developers mailing list