lxml.etree versus ElementTree
=============================
A lot of care has been taken to ensure compatibility between etree and
ElementTree. Nonetheless some differences and incompatibilities exist:
* Importing etree is obviously different; etree uses a lower case
package name, while ElementTree a combination of upper-case and
lower case in imports::
# etree
from lxml.etree import Element
# ElementTree
from elementtree.ElementTree import Element
# ElementTree in the Python 2.5 standard library
from xml.etree.ElementTree import Element
When switching over code from ElementTree to lxml.etree, and you're using
the package name prefix 'ElementTree', you can do the following::
# instead of
from elementtree import ElementTree
# use
from lxml import etree as ElementTree
* Some minor parts of the API of ElementTree have not yet been implemented and
are thus missing in lxml.etree. Feel free to help out!
* Then again, lxml.etree offers a lot more functionality, such as
XPath, XSLT, Relax NG, and XML Schema support, which (c)ElementTree
does not offer.
* etree has a different idea about Python unicode strings than ElementTree.
In most parts of the API, ElementTree uses plain strings and unicode strings
as what they are. This includes Element.text, Element.tail and many other
places. However, the ElementTree parsers assume by default that any string
(`str` or `unicode`) contains ASCII data. They raise an exception if
strings do not match the expected encoding.
etree has the same idea about plain strings (`str`) as ElementTree. For
unicode strings, however, etree assumes throughout the API that they are
Python unicode encoded strings rather than byte data. This includes the
parsers. It is therefore perfectly correct to pass XML unicode data into
the etree parsers in form of Python unicode strings. It is an error, on the
other hand, if unicode strings specify an encoding in their XML declaration,
as this conflicts with the characteristic encoding of Python unicode
strings.
* ElementTree allows you to place an Element in two different trees at the
same time. Thus, this::
a = Element('a')
b = SubElement(a, 'b')
c = Element('c')
c.append(b)
will result in the following tree a::
and the following tree c::
In lxml, this behavior is different, because lxml is built on top of a tree
that maintains parent relationships for elements (like W3C DOM). This means
an element can only exist in a single tree at the same time. Adding an
element in some tree to another tree will cause this element to be moved.
So, for tree a we will get::
and for tree c we will get::
Unfortunately this is a rather fundamental difference in behavior, which
will be hard to solve. It won't affect some applications, but if you want
to port code you must unfortunately make sure that it doesn't.
* When trying to set a subelement using __setitem__ that is in fact not an
Element but some other object, etree raises a TypeError, and ElementTree
raises an AssertionError. This also applies to some other places of the
API. In general, etree tries to avoid AssertionErrors in favour of being
more specific about the reason for the exception.
* When parsing fails in ``iterparse()``, ElementTree raises an ExpatError
instead of a SyntaxError. lxml.etree follows the other parts of the parser
API and raises an (XML)SyntaxError.
* The ``iterparse()`` function in lxml is implemented based on the libxml2
parser. This means that modifications of the document root or the ancestors
of the current element during parsing can irritate the parser and even
segfault. While this is not a problem in the Python object structure used
by ElementTree, the C tree underlying lxml suffers from it. The golden rule
for ``iterparse()`` on lxml therefore is: do not touch anything that will
have to be touched again by the parser later on. See the lxml API
documentation on this.
* ElementTree ignores comments and processing instructions when parsing XML,
while etree will read them in and treat them as Comment or
ProcessingInstruction elements respectively.
* ElementTree has a bug when serializing an empty Comment (no text argument
given) to XML, etree serializes this successfully.
* ElementTree adds whitespace around comments on serialization, lxml does
not. This means that a comment text "text" that ElementTree serializes as
"" will become "" in lxml.
* ElementTree merges the target of a processing instruction into ``PI.text``,
while lxml.etree puts it into the ``.target`` property and leaves it out of
the ``.text`` property.
* Because etree is built on top of libxml2, which is namespace prefix aware,
etree preserves namespaces declarations and prefixes while ElementTree tends
to come up with its own prefixes (ns0, ns1, etc). When no namespace prefix
is given however, etree creates ElementTree style prefixes as well.
* etree has a 'prefix' attribute (read-only) on elements giving the Element's
prefix, if this is known, and None otherwise (in case of no namespace at
all, or default namespace).
etree further allows passing an 'nsmap' dictionary to the Element and
SubElement element factories to explicitly map namespace prefixes to
namespace URIs. These will be translated into namespace declarations on
that element. This means that in the probably rare case that you need to
construct an attribute called 'nsmap', you need to be aware that unlike in
ElementTree, you cannot pass it as a keyword argument to the Element and
SubElement factories directly.
* etree elements can be copied using copy.deepcopy() and copy.copy(), just
like ElementTree's. copy.copy() however does *not* create a shallow copy
where elements are shared between trees, as this makes no sense in the
context of libxml2 trees. Note that lxml can deep-copy trees considerably
faster than ElementTree.
* etree allows navigation to the parent of a node by the ``getparent()``
method and to the siblings by calling ``getnext()`` and ``getprevious()``.
This is not possible in ElementTree as the underlying tree model does not
have this information.