Satine

XML Data binding for Python

Home Documentation FAQ Download Satine WS VERSION 0.99

Table of Contents

Introduction
A quick example
About xlists
The XML regular expression format (xre)
Queries
Validation
Streams

Introduction

Satine is a Python library that makes XML managment easy and complete. Satine converts XML documents to Python lists with attributes (xlist). This technology allows to:

translate documents with namespaces, both in elements and attributes
translate both documents without XMLSchema and documents with it. If the XMLSchema is available, the document can be easily validated.
random and partial access to XML documents
work very fast. The data binding technology is coded in C.

The Satine WS module is a simple HTTP server that supports both normal HTTP and SOAP requests. Hence Satine WS is a web service that supports a human interface, too.

A quick example

The following example shows how to convert a simple XML document to a xlist, how to modify and print it

			Figure 1

	1: 2: 3: 4: 5: 6: 7: 8:		import satine doc = """<message from="bill"> hello world </message>""" py = xml2py(doc) py.from = "linus" py[0] = "HELLO WORLD" print py

Line 1 import the Satine namespace, that is it extends the python __builtin__ namespace with the new type xlist and some functions. Among these functions there is xml2py (line 5), that converts a XML document to a xlist. Once the object py has been created, you can change attributes or items like at lines 6 and 7.

About xlists

XML documents are composed by XML elements. Each XML element is an information that is difficult to describe in usual programming languages. In fact a XML element has named attributes like objects, but it has also items like lists. Moreover a XML element has a type identifier, the tag, and a namespace. No Python native type shows all those properties. This assumption suggests to introduce a new type with the required features. Satine define a new type named xlist. An xlist inherits from the native type list the capability of containing items and its methods such as append, remove, index.

A xlist add an important property to its base type list. It supports attributes like Python objects do. Figure 2 shows a xlist definition, and some assigment operations.

			Figure 2

	1: 2: 3: 4:		import satine py = xlist() py.language = "english" py.append("HELLO WORLD")

	1: 2: 3:		<satine:xlist language="english"> 'Hello World' </satine:xlist>

At the figure top, line 3 adds the attribute language to the xlist. At line 4, the method append adds a new item to the xlist. A xlist behaves both as a Python list and a Python object. The figure bottom shows the representation of a xlist. Evidently this representation suggests a very simple binding between XML elements and xlist: element attributes correspond to xlist attributes and element nested items correspond to xlist items. If the attribute has the same identifier a method has (like append, sort...), you can access attributes with the [] operator. For example, in order to set the attribute count, you use the expression py["count"] = 12.

A XML element has not only attributes and items, but a tag and a namespace too. A xlist has a default tag,"xlist", and a default uri, "http://satine.sourceforge.net/schemas/kernel". The developer changes the default tag and uri by creating a new class that extends the xlist type. Figure 3 shows how to define a new type with tag Envelope and namespace "http://schemas.xmlsoap.org/soap/envelope/".

			Figure 3

	1: 2: 3: 4:		import satine class Envelope(xlist):pass xspace(soap="http://schemas.xmlsoap.org/soap/envelope/")

	1: 2:		<soap:Envelope> </soap:Envelope>

Usually the istances tag is the class name. If the tag is a Python keyword, the tag could be indicated with a class variable __tag__. The uri is not a class property but a module property. The function xspace (line 4) defines the uri for all types that extend xlist in the Python module. Also, the function defines the default prefix used in the representation. In the example, the uri for type Envelope is "http://schemas.xmlsoap.org/soap/envelope/", and the default prefix is soap. The signature for the function is:

xspace(<default_prefix> = "<uri>")

This method must be always after the classes definition. Different modules could share the same uri. Before converting a document with same new xlist you have defined through inheritance, you must import the module with the classes definitions.

The type xlist extends the native type list and, hence, it provides its parent operations: append, remove, count... Further these operations, there are other few methods and attributes:

copy creates a complete copy of the xlist through all its structure
iter([<criteria> [, <style>]]) returns an iterator that satisfies some criteria (see xre and queries).
query(<criteria> [, <style>]) searches for some patterns in the xlist (see xre and queries).
visit(<patterns_and_handlers>) searches for some patterns in the xlist and call the appropriate handlers (see xre and queries)
validate() checks if the xlist is compliant with its schema (see validation)
next_item() returns the list of the types that are valid items according to the schema and the current xlist content (see validation)
__repr__ returns a xlist representation. It is used with the repr and print keywords. The representation looks XML, but it is not really compliant. For a compliant representation, you should use the function py2xml.
__tag__ is the xlist tag
__uri__ is the xlist uri
__dict__ is the dictionary with xlist attributes
[key] manages attributes or items. If key is an integer, it returns a reference to the key-th element among items; if key is a string, it refers to an attribute; if key is a tuple, it refers to an attribute with namespace.

The library provides some important features as built-in functions:

xml2py(<doc: string>[, <suggested_prefixes: dict>]) converts a XML fragment to a xlist. If the developer defined appropriate classes for the elements in the document, these classes are used to instantiate the xlist; otherwise, new classes (with no validation support) are created at run-time. The optional parameter suggested_prefixes defines the uri for each prefix in the document. If the parameter is absent and the document does not contain a xmlns attribute, the uri for each prefix is chosen according to the default prefixes. The function removes all multiple whitespace, and special characters (\n \t...).
xml2seq(<doc: string>[, <suggested_prefixes: dict>]) converts a XML fragment to a sequence of empty xlists and strings. This function allows to convert XML document parts.
py2xml(<obj: xlist>[, <suggested_prefixes: dict>]) converts a xlist to a XML document. The parameter suggested_prefixes defines for each uri the corresponding prefix. The parameter is useless, if the prefixes are the default ones (like in most cases).
seq2xml(<obj: xlist>[, <suggested_prefixes: dict>]) converts a sequence of empty xlists to a XML document.NOTE: at the moment it doesn't work
py2seq(<obj: xlist>) converts an xlist from the normal to the sequential format (see streams)
seq2py(<obj: xlist>) converts from the sequential format to the normal one (see streams)
xspace(<prefix> = <uri>) assigns the uri and the defaul prefix for the classes defined in the same module where the function is

Other functions in the satine module are:

satine.is_xlist(<obj>) returns 1 if the parameter is a xlist
satine.strip(<sequence: list>) removes from a sequence of xlists and strings multiple whitespace, and special characters (\n \t...).
satine.xre(<xre_expr: string>) converts a xre expression to its deterministic finite automate (DFA).

The XML regular expression format (XRE)

Regular expressions are a powerful method to search in text documents. Satine provides a similar format to retrieve data from a xlist, the XML regular expression (XRE) format. A XRE is a documents that contains XML elements and some special characters. The XML elements are the atomic items like normal characters in regular expressions. Special characters indicates repetitions or well-known patterns. In particular:

*''**	causes the resulting XRE to match 0 or more repetitions of the preceding XRE, as many repetitions as are possible. <table><tr>* will match '<table>', '<table><tr>', or '<table>' followed by any number of '<tr>'s.
'+'	causes the resulting XRE to match 1 or more repetitions of the preceding XRE. <table><tr>+ will match '<table>' followed by any non-zero number of '<tr>'s; it will not match just '<table>'. (at the moment no support)
'?'	causes the resulting XRE to match 0 or 1 repetitions of the preceding XRE. <table><tr>? will match either '<table>' or '<table><tr>'.
"."	causes the resulting XRE to match any single item both a XML element and a string
"\|"	creates a XRE that will match either A or B
"$"	causes the resulting XRE to match a string.
(...)	groups a sequence of XREs in a single one

To create a XRE is very simple. For example the soap element Envelope accepts, as items, an optional element <soap:Header> and a mandatory element <soap:Body>. The XRE that describes this pattern is "<soap:Header>?<soap:Body>". Any XML element in a XRE matches the elements (xlists) whose attributes that are present in the XRE, have the same value. For example the XRE "<xhtml:table bgcolor="#212121">" will match"<xhtml:table bgcolor="#212121" width="80%">", but it wil not match either "<xhtml:table bgcolor="#211111">" or "<xhtml:table>".

Some examples might be as follows:

<a>.<b>	matches <a>hello<b>, <a><c><b>...
(<a><b>)*<c>	matches <c>, <a><b><c>, <a><b><a><b><c>...
<a>$	matches <a>Hello, <a>World
.*	matches anything
<a>\|<b><c>	matches <a><c> and <b><c>
(...)	groups a sequence of XREs in a single one

Queries

XML documents often cointain information that must be retrieved. Satine provides a powerful technology to perform queries once a XML document has been converted to a xlist. There are three methods in the xlist type that support queries:

xlist.query(<criteria> [, <style>]) finds the matching items and returns them in a list
xlist.iter(<criteria> [, <style>]) returns an iterator that retrieves the matching items.
xlist.visit(<patterns_and_handlers>) executes a call-back function for each matching item

The query and iter methods have the same signature. The first parameter criteria defines the matching rule. If the optional parameter styles is missing, criteria is a XRE expression followed by a extraction pattern:

<XRE_Expression>|[attr1[, attr2...[,attrN]...]]

The XRE expression doesn't apply to the sequence of items that a xlist contains but it applies to the sequence of nested items starting from the root. Figure 4 shows a xlist representation and highlights in bold the nodes that match with the expression "<addressbook><person name="linus">|".

			Figure 4

	1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:		<addressbook> <updated> 02-15-2003 </updated> <person name="bill" surname="gates"> <email> gates@msn.com </email> </person> <person name="linus" surname="tolvald"> <email> gates@msn.com </email> </person> </addressbook>

The result of the query is the xlist from line 8 to line 12. If we need only some attributes from the result, you can specify those you are interested in the extraction pattern. For example, if you need only the surname of people with name "bill", the parameter criteria is "<addressbook><person name="linus">|surname".

Both query and iter supports other two styles. The first is the tag style. A query with tag style retrieves all the items in the xlist whose tag is equal to criteria. For example the expression query("person", style="tag") retrieves the two xlists from line 3 to line 12.
The other style name is pyfun. This style calls the function passed as parameter criteria for each first-level item in the xlist. The result contains only the items for which the function returns true. For example the expression query(lambda item: item.__tag__=="person" and item.name=="linus", style="pyfun") retrieves the same result the first XRE in the example does.

The method iter returns an iterator with the capability to move in the result set. The iterator provides two methods tell and seek. The former returns the index of the current matching item in the original xlist. The latter changes the iterator position.

The method visit has a different goal; it doesn't return any result but it provides a call-back analizer similar to SAX. This method scans the xlist content looking for the patterns that satisfy some XREs, and it calls a corresponding function for each matching pattern. This method accepts a single parameter, that is a list of XREs and their call-back functions; each XRE and its function are grouped in a tuple:

xlist.visit([(xre1,fun1),(xre2,fun2),...(xreN,funN)]

Figure 5 shows how this method might print the name of people in figure 4.

Figure 5

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:

import satine
def call_back(item):
print "person", item.name, item.surname,
"is in the addressbook"

addressbook = xml2py("""<addressbook><updated>...""")
addressbook.visit([("<addressbook><person>",call_back)])

dfa = satine.xre(".<person>")
addressbook.visit([(dfa,call_back)])

The code from line 2 to line 4 defines the call-back function; the call-back has as parameter the item that matchs a XRE. Line 6 and 7 creates the xlist and call the method visit. Lines 9 and 10 show a possible variation for line 7. The function satine.XRE converts a XRE to a deterministic finite automate (DFA). This object could be uses in place of a string in the method visit. The task to convert a XRE into a DFA is mandatory before any search and moreover it is very slow. If the method visit is invoked always with the same XREs, a good optimization is to create the DFA only once. An other tip is to replace the first XML element in the XRE with a dot; at least in this example and in general when the visit regards always xlists with the same root, the XREs "<addressbook><person> and ".<person>" are equivalent.

Validation

XML schemas are documents that define the structure and the elements of a XML namespace. A technology for XML data binding can really take advantages of XML schemas. In fact, while translating XML to objects, the converter can check if the document is compliant with its schemas. In case of violation, the converter stops and notifies the error. The process of checking if a document is compliant to its schemas, is named XML validation.

Actually Satine doesn't perform any validation check during conversion from XML, but it provides a method validate for any xlist istance. This method operates in order to make the xlist and its items compliant with validation rules. The method converts xlist attributes to the type defined in the rules and assign default value to attributes that are not present; in case of violation, the method returns the fault and its exact position in the xlist.
The validation rules are described in the classes that define xlists. Figure 6 shows a simplified definition for some soap elements.

Figure 6

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:

import satine

class Envelope(xlist):
__items__ = """<soap:Header>?<soap:Body>"""

class Header(xlist):
__attrs__ = """<xsd:boolean>mandatory"""
....

xspace(soap="http://schemas.xmlsoap.org/soap/envelope/")

Validation rules are specified by using two class fields,__attrs__ and __items__, that apply respectively to the attributes and to the items in the xlist. The first field is a sequence of the pair XML element-string. The string is the name of the attribute the information applies to. The XML element denotes a Python class that extends the xlist type and supports the two methods xml2py and py2xml. Through these methods, the validation engine converts any attribute, from the string type to the correct Python type and viceversa. For example, line 7 suggests that the attribute mandatory has to be converted through a xlist with prefix xsd and tag boolean. Classes for most XML datatypes are available in the module satine.dt. The XML element accepts an attribute default that points the default value for the attribute and a boolean attribute required that points if the attribute is mandatory.

The class definition second field, that is__items__, is XRE that restrict the content. The validation engine checks if items match the XRE. For example, line 4 indicates that, in xlists with tag Envelope, the content has to be an optional SOAP header followed by a mandatory SOAP body.

This validation support has two problems: it doesn't use XMLSchema documents and it doesn't provide all features that XMLSchema does. The first issue is faced by the satine.schemaclib module and by the tool schemac that generates a Python module with validating classes from a XMLSchema document. The second issues is still open; anyway I believe that in most cases the validation supplied from Satine is more than enough.

Streams

Satine provides a interesting tecnology to treat files. Through streams, developers are able to parse fractions of XML documents or to move randomly in files like they usually do with text documents. This feature is innovative because XML tools, such as DOM and SAX, usually process the whole document. Also the streams support is very simple. The satine.stream module provides two classes reader and writer, that allow respectely to read and write a file. The constructor parameter is Python stream where the XML information is stored. Once an object has been instantiated, like for all xlists, an appropriate iterator can be retrieved through the method iter. The iterator reads data from the file only when it is required and it supports a random access through the methods tell/seek, like section about queries describe. The data returned by the iterator is not a usual the data binding format. Instead it is a serialized format where all xlists are empty and where close tags are explicit and pointed by th Python object None. This format is the same used in the xml2seq and seq2xml functions. This serialized approach is required in order to support a partial and sequential access to the stream. Figure 7 shows a XML document and its corresponding serialized form.

			Figure 3

	1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:		<addressbook> <updated> 02-15-2003 </updated> <person name="bill" surname="gates"> <email> gates@msn.com </email> </person> <person name="linus" surname="tolvald"> <email> gates@msn.com </email> </person> </addressbook>

	1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:		<addressbook/> <updated/> 02-15-2003 None <person name="bill" surname="gates"/> <email/> gates@msn.com None None <person name="linus" surname="tolvald"/> <email/> gates@msn.com None None None

Figure 8 shows a program that analyses the data from a file. Line 1 creates a stream from a Python file. Once the stream is created the access is like for normal objects created by the function xml2seq. But, unlikely normal xlists, data is loaded in memory only when it is required.

			Figure 8

	1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:		stream = reader(open("addressbook.xml")) for element in st: if is_xlist(element): if element.__tag__ == "addressbook": print "starting..." elif element.__tag__ == "person": print "person",element.name,element.surname elif element == None: print "---------------------" else: print "info:", element

At the moment there is no support for read-and-write streams. Also streams are a bit unstable.