APIs specific to lxml

lxml tries to follow established APIs wherever possible. Sometimes, however, the need to expose a feature in an easy way led to the invention of a new API.

Contents

lxml.etree
Other Element APIs
Trees and Documents
Iteration
Parsers
iterparse and iterwalk
Error handling on exceptions
Python unicode strings
XPath
XSLT
RelaxNG
XMLSchema
xinclude
write_c14n on ElementTree

lxml.etree

lxml.etree tries to follow the ElementTree API wherever it can. There are however some incompatibilities (see compatibility). The extensions are documented here.

If you need to know which version of lxml is installed, you can access the lxml.etree.LXML_VERSION attribute to retrieve a version tuple. Note, however, that it did not exist before version 1.0, so you will get an AttributeError in older versions. The versions of libxml2 and libxslt are available through the attributes LIBXML_VERSION and LIBXSLT_VERSION.

The following examples usually assume this to be executed first:

>>> from lxml import etree
>>> from StringIO import StringIO

Other Element APIs

While lxml.etree itself uses the ElementTree API, it is possible to replace the Element implementation by custom element subclasses. This has been used to implement well-known XML APIs on top of lxml. The lxml.elements package contains examples. Currently, there is a data-binding implementation called objectify, which is similar to the Amara bindery tool.

Additionally, the lxml.elements.classlookup module provides a number of different schemes to customize the mapping between libxml2 nodes and the Element classes used by lxml.etree.

Trees and Documents

Compared to the original ElementTree API, lxml.etree has an extended tree model. It knows about parents and siblings of elements:

>>> root = etree.Element("root")
>>> a = etree.SubElement(root, "a")
>>> b = etree.SubElement(root, "b")
>>> c = etree.SubElement(root, "c")
>>> d = etree.SubElement(root, "d")
>>> e = etree.SubElement(d,    "e")
>>> b.getparent() == root
True
>>> print b.getnext().tag
c
>>> print c.getprevious().tag
b

Elements always live within a document context in lxml. This implies that there is also a notion of an absolute document root. You can retrieve an ElementTree for the root node of a document from any of its elements:

>>> tree = d.getroottree()
>>> print tree.getroot().tag
root

Note that this is different from wrapping an Element in an ElementTree. You can use ElementTrees to create XML trees with an explicit root node:

>>> tree = etree.ElementTree(d)
>>> print tree.getroot().tag
d
>>> print etree.tostring(tree)
<d><e/></d>

All operations that you run on such an ElementTree (like XPath, XSLT, etc.) will understand the explicitly chosen root as root node of a document. They will not see any elements outside the ElementTree. However, ElementTrees do not modify their Elements:

>>> element = tree.getroot()
>>> print element.tag
d
>>> print element.getparent().tag
root
>>> print element.getroottree().getroot().tag
root

The rule is that all operations that are applied to Elements use either the Element itself as reference point, or the absolute root of the document that contains this Element (e.g. for absolute XPath expressions). All operations on an ElementTree use its explicit root node as reference.

Iteration

The ElementTree API makes Elements iterable to supports iteration over their children. Using the tree defined above, we get:

>>> [ el.tag for el in root ]
['a', 'b', 'c', 'd']

Tree traversal is commonly based on the element.getiterator() method:

>>> [ el.tag for el in root.getiterator() ]
['root', 'a', 'b', 'c', 'd', 'e']

lxml.etree also supports this, but additionally features an extended API for iteration over the children, following/preceding siblings, ancestors and descendants of an element, as defined by the respective XPath axis:

>>> [ el.tag for el in root.iterchildren() ]
['a', 'b', 'c', 'd']
>>> [ el.tag for el in root.iterchildren(reversed=True) ]
['d', 'c', 'b', 'a']
>>> [ el.tag for el in b.itersiblings() ]
['c', 'd']
>>> [ el.tag for el in c.itersiblings(preceding=True) ]
['b', 'a']
>>> [ el.tag for el in e.iterancestors() ]
['d', 'root']
>>> [ el.tag for el in root.iterdescendants() ]
['a', 'b', 'c', 'd', 'e']

Note how element.iterdescendants() does not include the element itself, as opposed to element.getiterator(). The latter effectively implements the 'descendant-or-self' axis in XPath.

All of these iterators support an additional tag keyword argument that filters the generated elements by tag name:

>>> [ el.tag for el in root.iterchildren(tag='a') ]
['a']
>>> [ el.tag for el in d.iterchildren(tag='a') ]
[]
>>> [ el.tag for el in root.iterdescendants(tag='d') ]
['d']
>>> [ el.tag for el in root.getiterator(tag='d') ]
['d']

See also the section on the utility functions iterparse() and iterwalk() below.

Parsers

One of the differences is the parser. There is support for both XML and (broken) HTML. Both are based on libxml2 and therefore only support options that are backed by the library. Parsers take a number of keyword arguments. The following is an example for namespace cleanup during parsing, first with the default parser, then with a parametrized one:

>>> xml = '<a xmlns="test"><b xmlns="test"/></a>'

>>> et     = etree.parse(StringIO(xml))
>>> print etree.tostring(et.getroot())
<a xmlns="test"><b xmlns="test"/></a>

>>> parser = etree.XMLParser(ns_clean=True)
>>> et     = etree.parse(StringIO(xml), parser)
>>> print etree.tostring(et.getroot())
<a xmlns="test"><b/></a>

HTML parsing is similarly simple. The parsers have a recover keyword argument that the HTMLParser sets by default. It lets libxml2 try its best to return something usable without raising an exception. You should use libxml2 version 2.6.21 or newer to take advantage of this feature:

>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

>>> parser = etree.HTMLParser()
>>> et     = etree.parse(StringIO(broken_html), parser)

>>> print etree.tostring(et.getroot())
<html><head><title>test</title></head><body><h1>page title</h1></body></html>

Lxml has an HTML function, similar to the XML shortcut known from ElementTree:

>>> html = etree.HTML(broken_html)
>>> print etree.tostring(html)
<html><head><title>test</title></head><body><h1>page title</h1></body></html>

The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.

The use of the libxml2 parsers makes some additional information available at the API level. Currently, ElementTree objects can access the DOCTYPE information provided by a parsed document, as well as the XML version and the original encoding:

>>> pub_id  = "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>> sys_url = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
>>> doctype_string = '<!DOCTYPE html PUBLIC "%s" "%s">' % (pub_id, sys_url)
>>> xml_header = '<?xml version="1.0" encoding="ascii"?>'
>>> xhtml = xml_header + doctype_string + '<html><body></body></html>'

>>> tree = etree.parse(StringIO(xhtml))
>>> docinfo = tree.docinfo
>>> print docinfo.public_id
-//W3C//DTD XHTML 1.0 Transitional//EN
>>> print docinfo.system_url
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
>>> docinfo.doctype == doctype_string
True

>>> print docinfo.xml_version
1.0
>>> print docinfo.encoding
ascii

iterparse and iterwalk

As known from ElementTree, the iterparse() utility function returns an iterator that generates parser events for an XML file (or file-like object), while building the tree. The values are tuples (event-type, object). The event types are 'start', 'end', 'start-ns' and 'end-ns'.

The 'start' and 'end' events represent opening and closing elements and are accompanied by the respective element. By default, only 'end' events are generated:

>>> xml = '''\
... <root>
...   <element key='value'>text</element>
...   <element>text</element>tail
...   <empty-element xmlns="testns" />
... </root>
... '''

>>> context = etree.iterparse(StringIO(xml))
>>> for action, elem in context:
...     print action, elem.tag
end element
end element
end {testns}empty-element
end root

The resulting tree is available through the root property of the iterator:

>>> context.root.tag
'root'

The other types can be activated with the events keyword argument:

>>> events = ("start", "end")
>>> context = etree.iterparse(StringIO(xml), events=events)
>>> for action, elem in context:
...     print action, elem.tag
start root
start element
end element
start element
end element
start {testns}empty-element
end {testns}empty-element
end root

You can modify the element and its descendants when handling the 'end' event. To save memory, for example, you can remove subtrees that are no longer needed:

>>> context = etree.iterparse(StringIO(xml))
>>> for action, elem in context:
...     print len(elem),
...     elem.clear()
0 0 0 3
>>> context.root.getchildren()
[]

WARNING: During the 'start' event, the descendants and following siblings are not yet available and should not be accessed. During the 'end' event, the element and its descendants can be freely modified, but its following siblings should not be accessed. During either of the two events, you must not modify or move the ancestors (parents) of the current element. You should also avoid moving or discarding the element itself. The golden rule is: do not touch anything that will have to be touched again by the parser later on.

If you have elements with a long list of children in your XML file and want to save more memory during parsing, you can clean up the preceding siblings of the current element:

>>> for event, element in etree.iterparse(StringIO(xml)):
...     # ... do something with the element
...     element.clear()                # clean up children
...     if element.getprevious():      # clean up preceding siblings
...         del element.getparent()[0]

You can use while instead of if if you skipped siblings using the tag keyword argument. The more selective your tag is, however, the more thought you will have to put into finding the right way to clean up the elements that were skipped. Therefore, it is sometimes easier to traverse all elements and do the tag selection by hand in the event handler code.

The 'start-ns' and 'end-ns' events notify about namespace declarations and generate tuples (prefix, URI):

>>> events = ("start-ns", "end-ns")
>>> context = etree.iterparse(StringIO(xml), events=events)
>>> for action, obj in context:
...     print action, obj
start-ns ('', 'testns')
end-ns None

It is common practice to use a list as namespace stack and pop the last entry on the 'end-ns' event.

lxml.etree supports two extensions compared to ElementTree. It accepts a tag keyword argument just like element.getiterator(tag). This restricts events to a specific tag or namespace.

>>> context = etree.iterparse(StringIO(xml), tag="element")
>>> for action, elem in context:
...     print action, elem.tag
end element
end element

>>> events = ("start", "end")
>>> context = etree.iterparse(StringIO(xml), events=events, tag="{testns}*")
>>> for action, elem in context:
...     print action, elem.tag
start {testns}empty-element
end {testns}empty-element

The second extension is the iterwalk() function. It behaves exactly like iterparse(), but works on Elements and ElementTrees:

>>> root = context.root
>>> context = etree.iterwalk(root, events=events, tag="element")
>>> for action, elem in context:
...     print action, elem.tag
start element
end element
start element
end element

Error handling on exceptions

Libxml2 provides error messages for failures, be it during parsing, XPath evaluation or schema validation. Whenever an exception is raised, you can retrieve the errors that occured and "might have" lead to the problem:

>>> etree.clearErrorLog()
>>> broken_xml = '<a>'
>>> try:
...   etree.parse(StringIO(broken_xml))
... except etree.XMLSyntaxError, e:
...   pass # just put the exception into e
>>> log = e.error_log.filter_levels(etree.ErrorLevels.FATAL)
>>> print log
<string>:1:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in tag a line 1

This might look a little cryptic at first, but it is the information that libxml2 gives you. At least the message at the end should give you a hint what went wrong and you can see that the fatal error (FATAL) happened during parsing (PARSER) line 1 of a string (<string>, or filename if available). Here, PARSER is the so-called error domain, see lxml.etree.ErrorDomains for that. You can get it from a log entry like this:

>>> entry = log[0]
>>> print entry.domain_name, entry.type_name, entry.filename
PARSER ERR_TAG_NOT_FINISHED <string>

There is also a convenience attribute last_error that returns the last error or fatal error that occurred:

>>> entry = e.error_log.last_error
>>> print entry.domain_name, entry.type_name, entry.filename
PARSER ERR_TAG_NOT_FINISHED <string>

Alternatively, lxml.etree supports logging libxml2 messages to the Python stdlib logging module. This is done through the etree.PyErrorLog class. It disables the error reporting from exceptions and forwards log messages to a Python logger. To use it, see the descriptions of the function etree.useGlobalPythonLog and the class etree.PyErrorLog for help. Note that this does not affect the local error logs of XSLT, XMLSchema, etc. which are described in their respective sections below.

Python unicode strings

lxml.etree has broader support for Python unicode strings than the ElementTree library. First of all, where ElementTree would raise an exception, the parsers in lxml.etree can handle unicode strings straight away. This is most helpful for XML snippets embedded in source code using the XML() function:

>>> uxml = u'<test> \uf8d1 + \uf8d2 </test>'
>>> uxml
u'<test> \uf8d1 + \uf8d2 </test>'
>>> root = etree.XML(uxml)

This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding:

>>> etree.XML(u'<?xml version="1.0" encoding="ASCII"?>\n' + uxml)
Traceback (most recent call last):
  ...
ValueError: Unicode strings with encoding declaration are not supported.

Similarly, you will get errors when you try the same with HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

To serialize the result, you would normally use the tostring module function, which serializes to plain ASCII by default or a number of other encodings if asked for:

>>> etree.tostring(root)
'<test> &#63697; + &#63698; </test>'

>>> etree.tostring(root, 'UTF-8', xml_declaration=False)
'<test> \xef\xa3\x91 + \xef\xa3\x92 </test>'

As an extension, lxml.etree has a new tounicode() function that you can call on XML tree objects to retrieve a Python unicode representation:

>>> etree.tounicode(root)
u'<test> \uf8d1 + \uf8d2 </test>'

>>> el = etree.Element("test")
>>> etree.tounicode(el)
u'<test/>'

>>> subel = etree.SubElement(el, "subtest")
>>> etree.tounicode(el)
u'<test><subtest/></test>'

>>> et = etree.ElementTree(el)
>>> etree.tounicode(et)
u'<test><subtest/></test>'

The result of tounicode() can be treated like any other Python unicode string and then passed back into the parsers. However, if you want to save the result to a file or pass it over the network, you should use write() or tostring() with an encoding argument (typically UTF-8) to serialize the XML. The main reason is that unicode strings returned by tounicode() never have an XML declaration and therefore do not specify their encoding. These strings are most likely not parsable by other XML libraries.

In contrast, the tostring() function automatically adds a declaration as needed that reflects the encoding of the returned string. This makes it possible for other parsers to correctly parse the XML byte stream. Note that using tostring() with UTF-8 is also considerably faster in most cases.

XPath

lxml.etree supports the simple path syntax of the findall() etc. methods on ElementTree and Element, as known from the original ElementTree library. As an extension, these classes also provide an xpath() method that supports expressions in the complete XPath syntax.

There are also specialized XPath evaluator classes that are more efficient for frequent evaluation: XPath and XPathEvaluator. See the performance comparison to learn when to use which. Their semantics when used on Elements and ElementTrees are the same as for the xpath() method described here.

For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative):

>>> f = StringIO('<foo><bar></bar></foo>')
>>> tree = etree.parse(f)

>>> r = tree.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'

>>> r = tree.xpath('bar')
>>> r[0].tag
'bar'

When xpath() is used on an element, the XPath expression is evaluated against the element (if relative) or against the root tree (if absolute):

>>> root = tree.getroot()
>>> r = root.xpath('bar')
>>> r[0].tag
'bar'

>>> bar = root[0]
>>> r = bar.xpath('/foo/bar')
>>> r[0].tag
'bar'

>>> tree = bar.getroottree()
>>> r = tree.xpath('/foo/bar')
>>> r[0].tag
'bar'

Optionally, you can provide a namespaces keyword argument, which should be a dictionary mapping the namespace prefixes used in the XPath expression to namespace URIs:

>>> f = StringIO('''\
... <a:foo xmlns:a="http://codespeak.net/ns/test1"
...       xmlns:b="http://codespeak.net/ns/test2">
...    <b:bar>Text</b:bar>
... </a:foo>
... ''')
>>> doc = etree.parse(f)
>>> r = doc.xpath('/t:foo/b:bar', {'t': 'http://codespeak.net/ns/test1',
...                                'b': 'http://codespeak.net/ns/test2'})
>>> len(r)
1
>>> r[0].tag
'{http://codespeak.net/ns/test2}bar'
>>> r[0].text
'Text'

There is also an optional extensions argument which is used to define extension functions in Python that are local to this evaluation.

The return values of XPath evaluations vary, depending on the XPath expression used:

True or False, when the XPath expression has a boolean result
a float, when the XPath expression has a numeric result (integer or float)
a (unicode) string, when the XPath expression has a string result.
a list of items, when the XPath expression has a list as result. The items may include elements, strings and tuples. Text nodes and attributes in the result are returned as strings (the text node content or attribute value). Comments are also returned as strings, enclosed by the usual  markers. Namespace declarations are returned as tuples of strings: (prefix, URI).

A related convenience method of ElementTree objects is getpath(element), which returns a structural, absolute XPath expression to find that element:

>>> a  = etree.Element("a")
>>> b  = etree.SubElement(a, "b")
>>> c  = etree.SubElement(a, "c")
>>> d1 = etree.SubElement(c, "d")
>>> d2 = etree.SubElement(c, "d")

>>> tree = etree.ElementTree(c)
>>> print tree.getpath(d2)
/c/d[2]
>>> tree.xpath(tree.getpath(d2)) == [d2]
True

XSLT

lxml.etree introduces a new class, lxml.etree.XSLT. The class can be given an ElementTree object to construct an XSLT transformer:

>>> f = StringIO('''\
... <xsl:stylesheet version="1.0"
...     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...     <xsl:template match="/">
...         <foo><xsl:value-of select="/a/b/text()" /></foo>
...     </xsl:template>
... </xsl:stylesheet>''')
>>> xslt_doc = etree.parse(f)
>>> transform = etree.XSLT(xslt_doc)

You can then run the transformation on an ElementTree document by simply calling it, and this results in another ElementTree object:

>>> f = StringIO('<a><b>Text</b></a>')
>>> doc = etree.parse(f)
>>> result = transform(doc)

The result object can be accessed like a normal ElementTree document:

>>> result.getroot().text
'Text'

but, as opposed to normal ElementTree objects, can also be turned into an (XML or text) string by applying the str() function:

>>> str(result)
'<?xml version="1.0"?>\n<foo>Text</foo>\n'

The result is always a plain string, encoded as requested by the xsl:output element in the stylesheet. If you want a Python unicode string instead, you should set this encoding to UTF-8 (unless the ASCII default is sufficient). This allows you to call the builtin unicode() function on the result:

>>> unicode(result)
u'<?xml version="1.0"?>\n<foo>Text</foo>\n'

You can use other encodings at the cost of multiple recoding. Encodings that are not supported by Python will result in an error:

>>> xslt_tree = etree.XML('''\
... <xsl:stylesheet version="1.0"
...     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...     <xsl:output encoding="UCS4"/>
...     <xsl:template match="/">
...         <foo><xsl:value-of select="/a/b/text()" /></foo>
...     </xsl:template>
... </xsl:stylesheet>''')
>>> transform = etree.XSLT(xslt_tree)

>>> result = transform(doc)
>>> unicode(result)
Traceback (most recent call last):
  [...]
LookupError: unknown encoding: UCS4

It is possible to pass parameters, in the form of XPath expressions, to the XSLT template:

>>> xslt_tree = etree.XML('''\
... <xsl:stylesheet version="1.0"
...     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...     <xsl:template match="/">
...         <foo><xsl:value-of select="$a" /></foo>
...     </xsl:template>
... </xsl:stylesheet>''')
>>> transform = etree.XSLT(xslt_tree)
>>> f = StringIO('<a><b>Text</b></a>')
>>> doc = etree.parse(f)

The parameters are passed as keyword parameters to the transform call. First let's try passing in a simple string expression:

>>> result = transform(doc, a="'A'")
>>> str(result)
'<?xml version="1.0"?>\n<foo>A</foo>\n'

Let's try a non-string XPath expression now:

>>> result = transform(doc, a="/a/b/text()")
>>> str(result)
'<?xml version="1.0"?>\n<foo>Text</foo>\n'

There's also a convenience method on the tree object for doing XSL transformations. This is less efficient if you want to apply the same XSL transformation to multiple documents, but is shorter to write for one-shot operations, as you do not have to instantiate a stylesheet yourself:

>>> result = doc.xslt(xslt_tree, a="'A'")
>>> str(result)
'<?xml version="1.0"?>\n<foo>A</foo>\n'

By default, XSLT supports all extension functions from libxslt and libexslt as well as Python regular expressions through EXSLT. Note that some extensions enable style sheets to read and write files on the local file system. See the document loader documentation on how to deal with this.

If you want to know how your stylesheet performed, pass the profile_run keyword to the transform:

>>> result = transform(doc, a="/a/b/text()", profile_run=True)
>>> profile = result.xslt_profile

The value of the xslt_profile property is an ElementTree with profiling data about each template, similar to the following:

<profile>
  <template rank="1" match="/" name="" mode="" calls="1" time="1" average="1"/>
</profile>

Note that this is a read-only document. You must not move any of its elements to other documents. Please deep-copy the document if you need to modify it. If you want to free it from memory, just do:

>>> del result.xslt_profile

RelaxNG

lxml.etree introduces a new class, lxml.etree.RelaxNG. The class can be given an ElementTree object to construct a Relax NG validator:

>>> f = StringIO('''\
... <element name="a" xmlns="http://relaxng.org/ns/structure/1.0">
...  <zeroOrMore>
...     <element name="b">
...       <text />
...     </element>
...  </zeroOrMore>
... </element>
... ''')
>>> relaxng_doc = etree.parse(f)
>>> relaxng = etree.RelaxNG(relaxng_doc)

You can then validate some ElementTree document against the schema. You'll get back True if the document is valid against the Relax NG schema, and False if not:

>>> valid = StringIO('<a><b></b></a>')
>>> doc = etree.parse(valid)
>>> relaxng.validate(doc)
1

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> relaxng.validate(doc2)
0

Calling the schema object has the same effect as calling its validate method. This is sometimes used in conditional statements:

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> if not relaxng(doc2):
...     print "invalid!"
invalid!

If you prefer getting an exception when validating, you can use the assert_ or assertValid methods:

>>> relaxng.assertValid(doc2)
Traceback (most recent call last):
  [...]
DocumentInvalid: Document does not comply with schema

>>> relaxng.assert_(doc2)
Traceback (most recent call last):
  [...]
AssertionError: Document does not comply with schema

Starting with version 0.9, lxml now has a simple API to report the errors generated by libxml2. If you want to find out why the validation failed in the second case, you can look up the error log of the validation process and check it for relevant messages:

>>> log = relaxng.error_log
>>> print log.last_error
<string>:1:ERROR:RELAXNGV:ERR_LT_IN_ATTRIBUTE: Did not expect element c there

You can see that the error (ERROR) happened during RelaxNG validation (RELAXNGV). The message then tells you what went wrong. Note that this error is local to the RelaxNG object. It will only contain log entries that appeares during the validation. The DocumentInvalid exception raised by the assertValid method above provides access to the global error log (like all other lxml exceptions).

Similar to XSLT, there's also a less efficient but easier shortcut method to do one-shot RelaxNG validation:

>>> doc.relaxng(relaxng_doc)
1
>>> doc2.relaxng(relaxng_doc)
0

XMLSchema

lxml.etree also has a XML Schema (XSD) support, using the class lxml.etree.XMLSchema. This support is very similar to the Relax NG support. The class can be given an ElementTree object to construct a XMLSchema validator:

>>> f = StringIO('''\
... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
... <xsd:element name="a" type="AType"/>
... <xsd:complexType name="AType">
...   <xsd:sequence>
...     <xsd:element name="b" type="xsd:string" />
...   </xsd:sequence>
... </xsd:complexType>
... </xsd:schema>
... ''')
>>> xmlschema_doc = etree.parse(f)
>>> xmlschema = etree.XMLSchema(xmlschema_doc)

You can then validate some ElementTree document with this. Like with RelaxNG, you'll get back true if the document is valid against the XML schema, and false if not:

>>> valid = StringIO('<a><b></b></a>')
>>> doc = etree.parse(valid)
>>> xmlschema.validate(doc)
1

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> xmlschema.validate(doc2)
0

Calling the schema object has the same effect as calling its validate method. This is sometimes used in conditional statements:

>>> invalid = StringIO('<a><c></c></a>')
>>> doc2 = etree.parse(invalid)
>>> if not xmlschema(doc2):
...     print "invalid!"
invalid!

If you prefer getting an exception when validating, you can use the assert_ or assertValid methods:

>>> xmlschema.assertValid(doc2)
Traceback (most recent call last):
  [...]
DocumentInvalid: Document does not comply with schema

>>> xmlschema.assert_(doc2)
Traceback (most recent call last):
  [...]
AssertionError: Document does not comply with schema

Error reporting works like for the RelaxNG class:

>>> log = xmlschema.error_log
>>> error = log.last_error
>>> print error.domain_name
SCHEMASV
>>> print error.type_name
SCHEMAV_ELEMENT_CONTENT

If you were to print this log entry, you would get something like the following. Note that the error message depends on the libxml2 version in use:

<string>:1:ERROR::SCHEMAV_ELEMENT_CONTENT: Element 'c': This element is not expected. Expected is ( b ).

Similar to XSLT and RelaxNG, there's also a less efficient but easier shortcut method to do XML Schema validation:

>>> doc.xmlschema(xmlschema_doc)
1
>>> doc2.xmlschema(xmlschema_doc)
0

xinclude

Simple XInclude support exists. You can let lxml process xinclude statements in a document by calling the xinclude() method on a tree:

>>> data = StringIO('''\
... <doc xmlns:xi="http://www.w3.org/2001/XInclude">
... <foo/>
... <xi:include href="doc/test.xml" />
... </doc>''')

>>> tree = etree.parse(data)
>>> tree.xinclude()
>>> etree.tostring(tree.getroot())
'<doc xmlns:xi="http://www.w3.org/2001/XInclude">\n<foo/>\n<a xml:base="doc/test.xml"/>\n</doc>'

write_c14n on ElementTree

The lxml.etree.ElementTree class has a method write_c14n, which takes a file object as argument. This file object will receive an UTF-8 representation of the canonicalized form of the XML, following the W3C C14N recommendation. For example:

>>> f = StringIO('<a><b/></a>')
>>> tree = etree.parse(f)
>>> f2 = StringIO()
>>> tree.write_c14n(f2)
>>> f2.getvalue()
'<a><b></b></a>'