Considering String Allocation and Copying in Tla 1.x
Is String Copying in libawk
a Problem Worth Fixing?
Hypothesis:
Excessive string copying is a significant (attention deserving) performance problem in
tla
, and most of the problem is caused by the behavior oflibawk
.
Background
libawk
in tla
provides two data structures used throughout most
of tla
: a relational table (a 2-d, dynamically sized, array of
strings) and an associative table (a mutable associative mapping
of strings to strings). Some basic relational operators are
provided for the tables.
These data structures are represented quite simply. The relational
tables are dynamically allocated arrays of pointers (each element a
table row), pointing to dynamically allocated arrays of pointers
(each element a column within one row), pointing to dynamically
allocated 0-terminated strings of unsigned char
values. The
associative tables are simple hash tables, with keys and values
represented by privately allocated, 0-terminated strings of unsigned
char
values.
A noteworthy property of the libawk
implementation is that
operations which create a new table (relational or associative) must
newly allocate copies of all of the strings in that table,
regardless of where the original data comes from.
For example: a join
operation accepts two tables as input and
produces as output a third table. All strings in the output table
have identical contents to some string in one or both of the input
tables, but join
as currently implemented allocates a new copy of
those strings anyway.
Another example: suppose that I want two copies of a single table,
each sorted using a different column as the sort key (as is common
in some tla
operations). Each string in either table is
duplicated in the other, the duplicate being a separately allocated
copy.
Envelope Calculations and Anecdotal Profiling/Tuning Results
To put some rough numbers to this, I examined an arch import of the linux kernel prepared by John Goerzen. I computed an inventory:
% tla inventory --source --both --ids > ,foo
which produces a list of all source files which are not arch control files, listing the relative path to each file along with that files inventory id, one such pair per line. This listing disregards all arch control files on the grounds that, in theory, arch might be modified to shrink the list of such files by a large amount.
The size of the resulting listing is a reasonable approximation of the amount of memory that must be dynamically allocated to create an in-core table containing the inventory. It is also approximately the amount of memory needed to allocate the keys and values of an associative table mapping file paths to ids or vice versa.
The number of lines in that listing is basically half the number of
calls to malloc
needed to create such a table and half the number
of calls to free
needed to free such a table.
Some performance critical operations in arch are likely to create
(approximately but also "at least") two (distinctively sorted)
tables of the (file-path, inventory-id)
relationship and two
associative tables mapping file paths to ids and ids to inventory
ids.
The listing size for this particular copy of the kernel has a bit over 30,00 lines and is just a bit over 1.5MB long.
Four tables (two associative, two relational) will require
n_allocations == 4 * (30000 * 2) == 240000 n_string_bytes == 4 * 1.5MB == 6MB
All four tables can be constructed from a single one, returned by
the inventory
primitivie. That construction, in addition to
allocating new storage for strings, will copy the contents of each
string three times: 4.5MB of string copying spread across 180000
separate strings.
These expenses are likely to be small relative to the costs of system calls made by arch and, especially on small trees, are likely to be insignificant. Alas, there is a limit to how much arch's use of system calls can be optimized and my sense is that we are nearing that limit.
In absolute terms (such as "wall clock time") these and related
string and table expenses are quite noticable. For example, several
arch developers have obtained noteworthy performance improvements
simply by replacing the unoptimized string copying operations with
calls to the highly optimized versions typically found in the native
libc
of each host system.
In my own experience, substitution of a silghtly slower malloc
for a native malloc also has a tangible performance impact and
quick-n-dirty profiling with gprof
suggests that the vast majority
of these calls originate in libawk
.
These calculations and experience with profiling and optimizing
suggest that attention to libawk
, especially with the aim of
eliminating string allocations and copies, is likely to be rewarded
with a somewhat faster, significantly more memory-efficient
tla
.
Secondary Reasons to Want to Change String Handling
The performance incentives for changing string handling are
not the only reason to reconsider string handling in libawk
.
I have three other reasons (at least):
The libawk
Abstractions Do Not Require String Copying
libawk
was originally intended to provide mutable tables
of immutable strings. In other words, clients of the library
should be free to add or delete rows or columns, keys and values --
but they should not ever directly modify a string which is an
element of a table.
As currently presented, the libawk
API does not enforce that
aspect of its abstraction. To the degree it is successfully
honored by the rest of the code in tla
(which is not entirely
clear), the abstraction is honored only by careful coding with
no particular assistence from the C type system.
Modifying the libawk
API to enforce the immutability of
strings would make client code of that library easier to
understand, maintain, verify, and modify.
The Representation Type of libawk
Strings Will Probably Change Anyway
It is conceivable that, as tla
acquires support for
extended character sets in filenames and other strings,
these will be simply be represented internally as
Unicode strings, using the UTF-8 encoding form.
It is also conceivable (and in my view more sensible) if strings might internally be represented as Unicode strings using a different encoding form (such as UTF-16 or UTF-32, depending on circumstance).
Regardless, either way, the shift from presumed-ASCII strings
to strings which sometimes include extended characters is
likely to be a difficult one to implement. (Indeed, one
justification for beginning tla 2.0
is to avoid having to make
this shift on the tla 1.x
code base by instead providing a
cleaner-from-the-start foundation in tla 2.0
).
If we want to potentiate making the shift to extended characters in
tla 1.x
, modifying the libawk
API so that internal string
representations are better hidden is a definate help.
Currently, much code in tla 1.x
assumes that, for example,
the string in the third column of the second row of a table
can be accessed as a value of type t_uchar *
using a
construct such as:
t_uchar * table_elt = table[1][2];
Eliminating such constructs by hiding them behind a procedural abstraction (i.e., not having clients use array subscripting but, instead, having them call a function to access elements) would be progress on both eliminating needless string copying and preparing the way for extended character sets.
Hash-consed and/or Constructive Strings May be Desirable
[This is, by far, the most speculative rationale for
changing libawk
.]
String comparison is an important operation in tla
, both
comparison for equality and comparison for lexical order.
Unicode, necessarily, makes comparison operations considerably more expensive than their ASCII-only analogs.
One sometimes applicable technique for reducing the expense of such operations in a string-intensive application is to (a) use only immutable strings; (b) never allocate two strings with identical contents (always share a single string) -- this gives rise to a fast equality test; (c) memoize the results of all string comparisons for order, doing so in a manner that allows a quick inference of the order between two previously uncompared strings provided that they have both been compared to some other string or to two other strings whose order is known (or less expensively inferrable).
Another sometimes applicable technique for reducing string copying and, in particular, speeding up operations such as substring extraction and string concatenation is to use what might be called "constructive strings" (sometimes called "ropes" after an implementation by Hans Boehm).
Using reference-counted immutable strings in libawk
opens the
door to trying out either of these techniques.
Program for Research
I am interested in finding an easy way to eliminate libawk
-related
string copying.
There are many ways a libawk
-style API might work without so
much string copying but I think that the most straightforward
of these would be to:
1. Reference-count
libawk
Strings Strings do not contain references to other objects and thus are not subject to being part of data structure containing a cycle of references. For that reason, reference counting is a natural fit for memory managing them.2. Make the
libawk
Types Opaque The internal representations inlibawk
should be hidden from clients.
My intention, then, isf to proceed first by making such changes to
libawk
. This will immediately break a good deal of code
throughout tla
. The breakage will be of a variety that a
C compiler will report it -- this is convenient because the
compiler will help to trace down all code which must be modified
to honor the corrected libawk
abstractions.
Current Questions
Have other Arch developers tried a similar experiment already?
Do other Arch developers agree with the conclusions I draw from my envelope calculations and the anecdotal profiling and tuning results?
What is the comfort-level among Arch developers regarding the
prospects of migrating such a change, which is likely to be
very large in scope, from a tla-1.3.1
branch to other branches
such as baz1.x
and tla-1.4
?
Copyright
Copyright (C) 2004 Tom Lord
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
See the file COPYING
for further information about
the copyright and warranty status of this work.