XML I/O

Peter Dobcsanyi p.dobcsanyi at designtheory.org
Fri Jul 11 01:12:52 BST 2003


This post is to lay down some basic terminology related to XML I/O
operations on data in the standard External Representation (Ext-Rep for
short).  It also gives a general specification for the XML I/O modules
currently developed parallel in GAP, R, and Python.

Introduction
============

    All components in DTRS and any external entity communicating with
    DTRS exchange information in the Ext-Rep standard format which is an
    XML based language defined in "blockdesign.rnc" using Relax NG
    schema.

    "blockdesign.rnc" version beta is due to be published online on our
    web-site by 2003-08-10.

    Therefore all these software components must implement the necessary
    XML input and output operations for Ext-Rep.


Fundamental XML I/O components
==============================

  XML output -- the Writer

    While programs are free to use whatever internal representations of
    designs, they are required to produce output in Ext-Rep standard
    format. The software component (procedure(s), module(s), package(s)
    etc.) providing this output service is called an "Ext-Rep Writer" or
    just "Writer" for short. The Writer takes a collection of designs
    represented in the memory by some data structure and produces
    a <list_of_designs> XML document in Ext-Rep format. This document is
    produced in an external media like files or pipes.

  XML input -- the Reader

    Programs are also required to provide a suitable XML input
    component, called the "Ext-Rep Reader" or just "Reader". The Reader
    takes a <list_of_designs> XML document in Ext-Rep format from an
    external source (file, socket etc.) and in a multilayered process
    eventually transforms it into a memory representation of designs
    suitable for further processing by the program.

    The Reader provides two basic functionalities:

    "parsing"

	That is parsing and checking the syntactical and structural
	correctness of the input XML document. Parsing is provided as a
	"standard" XML parser library so we don't have to write our own,
	just use this software library.  R, GAP, and Python have such
	modules.

	If the parser checks the document against a schema definition
	like our blockdesign.rnc it is called a "validating" parser.

	Basically there are two kind of parsers: SAX is an event driven
	parser, it reads the XML document as a stream and every time it
	recognizes a syntactical entity (for example a tag or a number)
	it serves it to the application for further processing through
	call-back functions.

	A DOM parser reads in the entire document and rebuild the XML
	document tree in memory. The application can then traverse
	this tree using procedures provided by the DOM parser.

    "transformation"

	This is the part we need to write "on top of the parser
	library".  The parser provides the bits and pieces of the XML
	document as character strings, the reader need to convert these
	into the appropriate binary format and using the results, build
	the internal representation of the designs.

    In SAX based Readers the parsing and transformation are concurrent,
    while in a DOM based Reader they are consecutive.


Specification for XML I/O modules
=================================

    What follows is a outline of a gradual implementation of
    Reader/Writer modules with the corresponding requirements.  The
    implementation process is broken down to three phases implementing
    mandatory requirements, partial categorized requirements, and full
    implementation.


  Requirements for a Writer
  -------------------------

    Mandatory:

	- A Writer must produce a valid XML document with proper XML
	  version and name-space headers.

	- As an absolute minimum, a Writer must implement the mandatory
	  structures for a <list_of_designs> document as it is defined
	  in blockdesign.rnc.
    
    General optional requirements and guide lines:

	- A Writer does not need to implement indentation. If it does
	  then indentation must be optional and the default be "off".

	- Find a reasonable balance between writing the whole document
	  into one single line and breaking up each word into different
	  lines. Spare the unnecessary white spaces as much as
	  reasonable. While compactness is important, keep in your mind
	  that from time to time humans will need to have a look into
	  these files.
	
	I will provide a pretty printer which can be used both on files
	and as filter on pipes.
    
    Categorized:

	There are three categories and a "utility" layer. These are
	loose definitions the implementor is free to choose. The
	more is the merrier :-).

	- Combinatorial components, like <block_concurrencies>.
	
	- Group theoretical, like <automorphism_group>.

	- Statistical, like <optimality_criteria_values>.

	- Utility layer, things used all over, like <functions_on...>.

    Full implementation:

	What can I say, its a never ending story :-)


  Requirements for a Reader
  -------------------------

    Mandatory:

	- A Reader must be able to parse any valid Ext-Rep document,
	  including a "full blown" one.
	
	- Any Reader must provide transformation at least for the
	  mandatory subtree of Ext-Rep document.
	
	- A Reader should provide as much semantical checking as
	  possible for all Ext-Rep components for which it provides
	  transformation.
	
	Note, at the moment validating is not a requirement but it might
	become one in the close future. The problem is that currently we
	don't have validating parser working with Relax NG but our
	Ext-Rep is too complex to be transformed into DTD.

    Optional:

	- Transformation of various subtrees, see categories above, up
	  to the complete Ext-Rep tree.
    
    General guide lines and remarks:

	 - Considering the sizes of our current and future Ext-Rep
	   documents, DOM based parsers are not really an option for
	   most of our applications.  They are slow and memory hogs.

	- Once one got a minimal Reader working, to make it a full
	  implementation is not as big deal as it seems.


Notes on development
====================

General
-------

    Make frequent cvs update in your sandbox.  The Ext-Rep definition is
    still changing.

Testing:
--------

    To test an Ext-Rep document against the blockdesign.rnc schema go to
    the xml sub-tree in your sandbox on DTRS. Copy your whatever.xml
    files into the test sub-dir then (still in the xml/ dir) type:

	make test

    All .xml files in test/ will be checked.  This procedure also tests
    the syntax and structure of blockdesign.rnc.  Should you discover
    any problem with the schema itself please send me the error
    messages.

    Note, this testing works only on DTRS since it uses quite a complex
    infrastructure behind the scenes.

--             ,
    Peter Dobcsanyi




More information about the Developers mailing list