XML I/O
Peter Dobcsanyi
p.dobcsanyi at designtheory.org
Fri Jul 11 01:12:52 BST 2003
This post is to lay down some basic terminology related to XML I/O
operations on data in the standard External Representation (Ext-Rep for
short). It also gives a general specification for the XML I/O modules
currently developed parallel in GAP, R, and Python.
Introduction
============
All components in DTRS and any external entity communicating with
DTRS exchange information in the Ext-Rep standard format which is an
XML based language defined in "blockdesign.rnc" using Relax NG
schema.
"blockdesign.rnc" version beta is due to be published online on our
web-site by 2003-08-10.
Therefore all these software components must implement the necessary
XML input and output operations for Ext-Rep.
Fundamental XML I/O components
==============================
XML output -- the Writer
While programs are free to use whatever internal representations of
designs, they are required to produce output in Ext-Rep standard
format. The software component (procedure(s), module(s), package(s)
etc.) providing this output service is called an "Ext-Rep Writer" or
just "Writer" for short. The Writer takes a collection of designs
represented in the memory by some data structure and produces
a <list_of_designs> XML document in Ext-Rep format. This document is
produced in an external media like files or pipes.
XML input -- the Reader
Programs are also required to provide a suitable XML input
component, called the "Ext-Rep Reader" or just "Reader". The Reader
takes a <list_of_designs> XML document in Ext-Rep format from an
external source (file, socket etc.) and in a multilayered process
eventually transforms it into a memory representation of designs
suitable for further processing by the program.
The Reader provides two basic functionalities:
"parsing"
That is parsing and checking the syntactical and structural
correctness of the input XML document. Parsing is provided as a
"standard" XML parser library so we don't have to write our own,
just use this software library. R, GAP, and Python have such
modules.
If the parser checks the document against a schema definition
like our blockdesign.rnc it is called a "validating" parser.
Basically there are two kind of parsers: SAX is an event driven
parser, it reads the XML document as a stream and every time it
recognizes a syntactical entity (for example a tag or a number)
it serves it to the application for further processing through
call-back functions.
A DOM parser reads in the entire document and rebuild the XML
document tree in memory. The application can then traverse
this tree using procedures provided by the DOM parser.
"transformation"
This is the part we need to write "on top of the parser
library". The parser provides the bits and pieces of the XML
document as character strings, the reader need to convert these
into the appropriate binary format and using the results, build
the internal representation of the designs.
In SAX based Readers the parsing and transformation are concurrent,
while in a DOM based Reader they are consecutive.
Specification for XML I/O modules
=================================
What follows is a outline of a gradual implementation of
Reader/Writer modules with the corresponding requirements. The
implementation process is broken down to three phases implementing
mandatory requirements, partial categorized requirements, and full
implementation.
Requirements for a Writer
-------------------------
Mandatory:
- A Writer must produce a valid XML document with proper XML
version and name-space headers.
- As an absolute minimum, a Writer must implement the mandatory
structures for a <list_of_designs> document as it is defined
in blockdesign.rnc.
General optional requirements and guide lines:
- A Writer does not need to implement indentation. If it does
then indentation must be optional and the default be "off".
- Find a reasonable balance between writing the whole document
into one single line and breaking up each word into different
lines. Spare the unnecessary white spaces as much as
reasonable. While compactness is important, keep in your mind
that from time to time humans will need to have a look into
these files.
I will provide a pretty printer which can be used both on files
and as filter on pipes.
Categorized:
There are three categories and a "utility" layer. These are
loose definitions the implementor is free to choose. The
more is the merrier :-).
- Combinatorial components, like <block_concurrencies>.
- Group theoretical, like <automorphism_group>.
- Statistical, like <optimality_criteria_values>.
- Utility layer, things used all over, like <functions_on...>.
Full implementation:
What can I say, its a never ending story :-)
Requirements for a Reader
-------------------------
Mandatory:
- A Reader must be able to parse any valid Ext-Rep document,
including a "full blown" one.
- Any Reader must provide transformation at least for the
mandatory subtree of Ext-Rep document.
- A Reader should provide as much semantical checking as
possible for all Ext-Rep components for which it provides
transformation.
Note, at the moment validating is not a requirement but it might
become one in the close future. The problem is that currently we
don't have validating parser working with Relax NG but our
Ext-Rep is too complex to be transformed into DTD.
Optional:
- Transformation of various subtrees, see categories above, up
to the complete Ext-Rep tree.
General guide lines and remarks:
- Considering the sizes of our current and future Ext-Rep
documents, DOM based parsers are not really an option for
most of our applications. They are slow and memory hogs.
- Once one got a minimal Reader working, to make it a full
implementation is not as big deal as it seems.
Notes on development
====================
General
-------
Make frequent cvs update in your sandbox. The Ext-Rep definition is
still changing.
Testing:
--------
To test an Ext-Rep document against the blockdesign.rnc schema go to
the xml sub-tree in your sandbox on DTRS. Copy your whatever.xml
files into the test sub-dir then (still in the xml/ dir) type:
make test
All .xml files in test/ will be checked. This procedure also tests
the syntax and structure of blockdesign.rnc. Should you discover
any problem with the schema itself please send me the error
messages.
Note, this testing works only on DTRS since it uses quite a complex
infrastructure behind the scenes.
-- ,
Peter Dobcsanyi
More information about the Developers
mailing list