Hachoir is the French name for a mincer: a tool used by butchers to cut meat. Hachoir is also a tool written for hackers to cut a file or any binary stream. A file is split in a tree of fields where the smallest field can be a bit. There are various field types: integer, string, bits, padding, sub file, etc. This document is a presentation of Hachoir API. It tries to show most the interesting part of this tool, but is not exhaustive. Ok, let's start! hachoir.stream: Stream manipulation =================================== To split data we first need is to get data :-) So this section presents the "hachoir.stream" API. In most cases we work on files using the FileInputStream function. This function takes one argument: a Unicode filename. But for practical reasons we will use StringInputStream function in this documentation. >>> data = "point\0\3\0\2\0" >>> from hachoir_core.stream import StringInputStream, LITTLE_ENDIAN >>> stream = StringInputStream(data) >>> stream.source '' >>> len(data), stream.size (10, 80) >>> data[1:6], stream.readBytes(8, 5) ('oint\x00', 'oint\x00') >>> data[6:8], stream.readBits(6*8, 16, LITTLE_ENDIAN) ('\x03\x00', 3) >>> data[8:10], stream.readBits(8*8, 16, LITTLE_ENDIAN) ('\x02\x00', 2) First big difference between a string and a Hachoir stream is that sizes and addresses are written in bits and not bytes. The difference is a factor of eight, that's why we write "6*8" to get the sixth byte for example. You don't need to know anything else to use Hachoir, so let's play with fields! hachoir.field: Field manipulation ================================= Basic parser ------------ We will parse the data used in the last section. >>> from hachoir_core.field import Parser, CString, UInt16 >>> class Point(Parser): ... endian = LITTLE_ENDIAN ... def createFields(self): ... yield CString(self, "name", "Point name") ... yield UInt16(self, "x", "X coordinate") ... yield UInt16(self, "y", "Y coordinate") ... >>> point = Point(stream) >>> for field in point: ... print "%s) %s=%s" % (field.address, field.name, field.display) ... 0) name="point" 48) x=3 64) y=2 `point` is a the root of our field tree. This tree is really simple, it just has one level and three fields: name, x and y. Hachoir stores a lot of information in each field. In this example we just show the address, name and display attributes. But a field has more attributes: >>> x = point["x"] >>> "%s = %s" % (x.path, x.value) '/x = 3' >>> x.parent == point True >>> x.description 'X coordinate' >>> x.index 1 >>> x.address, x.absolute_address (48, 48) The index is not the index of a field in a parent field list, '1' means that it's the second since the index starts at zero. Parser with sub-field sets -------------------------- After learning basic API, let's see a more complex parser: parser with sub-field sets. >>> from hachoir_core.field import FieldSet, UInt8, Character, String >>> class Entry(FieldSet): ... def createFields(self): ... yield Character(self, "letter") ... yield UInt8(self, "code") ... >>> class MyFormat(Parser): ... endian = LITTLE_ENDIAN ... def createFields(self): ... yield String(self, "signature", 3, charset="ASCII") ... yield UInt8(self, "count") ... for index in xrange(self["count"].value): ... yield Entry(self, "point[]") ... >>> data = "MYF\3a\0b\2c\0" >>> stream = StringInputStream(data) >>> root = MyFormat(stream) This example presents many interesting features of Hachoir. First of all, you can see that you can have two or more levels of fields. Here we have a tree with two levels: >>> def displayTree(parent): ... for field in parent: ... print field.path ... if field.is_field_set: displayTree(field) ... >>> displayTree(root) /signature /count /point[0] /point[0]/letter /point[0]/code /point[1] /point[1]/letter /point[1]/code /point[2] /point[2]/letter /point[2]/code A field set is also a field, so it has the same attributes than another field (name, address, size, path, etc.) but has some new attributes like stream or root. Lazy feature ------------ Hachoir is written in Python so it should be slow and eat a lot of CPU and memory, and it does. But in most cases, you don't need to explore an entire field set and read all values; you just need to read some values of some specific fields. Hachoir is really lazy: no field is parsed before you ask for it, no value is read from stream before you read a value, etc. To inspect this behaviour, you can watch "current_length" (number of read fields) and "current_size" (current size in bits of a field set): >>> root = MyFormat(stream) # Rebuild our parser >>> print (root.current_length, root.current_size) (0, 0) >>> print root["signature"].display "MYF" >>> print (root.current_length, root.current_size, root["signature"].size) (1, 24, 24) Just after its creation, a parser is empty (0 fields). When we read the first field, its size becomes the size of the first field. Some operations requires to read more fields: >>> print root["point[0]/letter"].display 'a' >>> print (root.current_length, root.current_size) (3, 48) Reading point[0] needs to read field "count". So root now contains three fields. List of field types =================== Number: * Bit: one bit (True/False) ; * Bits: unsigned number with a size in bits ; * Bytes: vector of know bytes (e.g. file signature) ; * UInt8, UInt16, UInt24, UInt32, UInt64: unsigned number (size: 8, 16, ... bits) ; * Int8, Int16, Int24, Int32, Int64: signed number (size: 8, 16, ... bits) ; * Float32, Float64, Float80: IEEE 754 floating point number (32, 64, 80 bits) ; Text: * Character: 8 bits ASCII character ; * String: fixed length string ; * CString: string ending with nul byte ("\\0") ; * UnixLine: string ending with new line character ("\\n") ; * PascalString8, PascalString16 and PascalString32: string prefixed with length in a unsigned 8 / 16 / 32 bits integer (use parent endian) ; Timestamp: * TimestampMSDOS32: 32-bit MS-DOS, since January 1st 1980 ; * TimestampUnix32: 32-bit UNIX, seconds since January 1st 1970 ; * TimestampMac32: 32-bit Mac, seconds since January 1st 1904 ; * TimestampWin64: 64-bit Windows, nanoseconds since January 1st 1600. Padding and raw bytes: * PaddingBits/PaddingBytes: padding with a size in bits/bytes ; * NullBits/NullBytes: null padding with a size in bits/bytes ; * RawBits/RawBytes: unknown content with a size in bits/bytes. * SubFile: a file contained in the stream ; To create your own type, you can use: * GenericInteger: integer ; * GenericString: string ; * FieldSet: Set of other fields ; * Parser: The main class to parse a stream. Field class =========== Read only attributes: * name (str): Name of the field, is unique in parent field set * address (long): address in bits relative to parent address * absolute_address (long): address in bits relative to input stream * parent (GenericFieldSet): parent field (is a field set) * is_field_set (bool) <~~~ can be replaced: the field contains other fields? * index (int): index of the field in parent field set (first index is 0) Read only and lazy attributes: * size (long), cached: size of the field in bits * description (str|unicode), cached: informal description * display (unicode): value with human representation as unicode string * raw_display (unicode): value with raw representation as unicode string * path (str): concatenation with slash separator of all field name from the root field Method that can be replaced: * createDescription(): create value of 'description' attribute * createValue(): create value of 'value' attribute * createDisplay(): create value of 'display' attribute * _createInputStream(): create an InputStream containing the field content Aliases (method): * __str__() <=> read display attribute * __unicode__() <=> read display attribute * __getitem__(key): alias to getField(key, False) Other methods: * static_size: helper to compute field size. If the value is an integer, the type has constant size. If it's a function, the size depends of the arguments. * hasValue(): check if the field has a value or not (default: self.value is not None) * getField(key, const=True): get the field with specified key, if const is True the field set will not be changed * __contains__(key) * getSubIStream(): return a tagged InputStream containing the field content * setSubIStream(): helper to replace _createInputStream (the old one is passed to the new one to allow chaining) Field set class =============== Read only attributes: * endian: value is BIG_ENDIAN or LITTLE_ENDIAN, the way the bits are written in input stream <~~ can be replaced * stream (InputStream): input stream * root (FieldSet): root of all fields * eof (bool): End Of File: are we at the end of the input stream? * done (bool): The parser is done or not? Read only and lazy attributes: * current_size (long): Current size in bits * current_length (long): Current number of children Methods: * connectEvent(event, handler, local=True): connect an handler to an event * raiseEvent(event, \*args): raise an event * reset(): clear all caches but keep its size if we know it * setUniqueFieldName(): for field with name ending with "[]", replaces "[]" with an unique identifier like, "item[]" => "item[0]". * seekBit(address, ...): create a field to seek to specified address or returns None if we are already there * seekByte(address, ...): create a field to seek to specified address or returns None if we are already there * replaceField(name, fields): replace a field with one or more fields <~~~ I don't like this method :-( * getFieldByAddress(address, feed=True): get the field at the specified address * writeFieldsIn(old, address, new): helper for replaceField() <~~~ can be an helper? Lazy methods: * array(): create a FakeArray to easily get a field by its index (see FakeArray API to get more details) * __len__(): number of children in the field set * readFirstFields(number): read first 'number' fields, returns number of new fields * readMoreFields(number): read more 'number' fields, returns number of new fields * __iter__(): iterate over children * createFields(): main function of the parser, create the fields. Don't call this function directly.