Tom Lord's Hackery

Considering String Allocation and Copying in Tla 1.x

Is String Copying in libawk a Problem Worth Fixing?

Hypothesis:

Excessive string copying is a significant (attention deserving) performance problem in tla, and most of the problem is caused by the behavior of libawk.

Background

libawk in tla provides two data structures used throughout most of tla: a relational table (a 2-d, dynamically sized, array of strings) and an associative table (a mutable associative mapping of strings to strings). Some basic relational operators are provided for the tables.

These data structures are represented quite simply. The relational tables are dynamically allocated arrays of pointers (each element a table row), pointing to dynamically allocated arrays of pointers (each element a column within one row), pointing to dynamically allocated 0-terminated strings of unsigned char values. The associative tables are simple hash tables, with keys and values represented by privately allocated, 0-terminated strings of unsigned char values.

A noteworthy property of the libawk implementation is that operations which create a new table (relational or associative) must newly allocate copies of all of the strings in that table, regardless of where the original data comes from.

For example: a join operation accepts two tables as input and produces as output a third table. All strings in the output table have identical contents to some string in one or both of the input tables, but join as currently implemented allocates a new copy of those strings anyway.

Another example: suppose that I want two copies of a single table, each sorted using a different column as the sort key (as is common in some tla operations). Each string in either table is duplicated in the other, the duplicate being a separately allocated copy.

Envelope Calculations and Anecdotal Profiling/Tuning Results

To put some rough numbers to this, I examined an arch import of the linux kernel prepared by John Goerzen. I computed an inventory:

      % tla inventory --source --both --ids > ,foo

  

which produces a list of all source files which are not arch control files, listing the relative path to each file along with that files inventory id, one such pair per line. This listing disregards all arch control files on the grounds that, in theory, arch might be modified to shrink the list of such files by a large amount.

The size of the resulting listing is a reasonable approximation of the amount of memory that must be dynamically allocated to create an in-core table containing the inventory. It is also approximately the amount of memory needed to allocate the keys and values of an associative table mapping file paths to ids or vice versa.

The number of lines in that listing is basically half the number of calls to malloc needed to create such a table and half the number of calls to free needed to free such a table.

Some performance critical operations in arch are likely to create (approximately but also "at least") two (distinctively sorted) tables of the (file-path, inventory-id) relationship and two associative tables mapping file paths to ids and ids to inventory ids.

The listing size for this particular copy of the kernel has a bit over 30,00 lines and is just a bit over 1.5MB long.

Four tables (two associative, two relational) will require


                n_allocations == 4 * (30000 * 2)
                              == 240000


                n_string_bytes == 4 * 1.5MB
                               == 6MB

  

All four tables can be constructed from a single one, returned by the inventory primitivie. That construction, in addition to allocating new storage for strings, will copy the contents of each string three times: 4.5MB of string copying spread across 180000 separate strings.

These expenses are likely to be small relative to the costs of system calls made by arch and, especially on small trees, are likely to be insignificant. Alas, there is a limit to how much arch's use of system calls can be optimized and my sense is that we are nearing that limit.

In absolute terms (such as "wall clock time") these and related string and table expenses are quite noticable. For example, several arch developers have obtained noteworthy performance improvements simply by replacing the unoptimized string copying operations with calls to the highly optimized versions typically found in the native libc of each host system.

In my own experience, substitution of a silghtly slower malloc for a native malloc also has a tangible performance impact and quick-n-dirty profiling with gprof suggests that the vast majority of these calls originate in libawk.

These calculations and experience with profiling and optimizing suggest that attention to libawk, especially with the aim of eliminating string allocations and copies, is likely to be rewarded with a somewhat faster, significantly more memory-efficient tla.

Secondary Reasons to Want to Change String Handling

The performance incentives for changing string handling are not the only reason to reconsider string handling in libawk.

I have three other reasons (at least):

The libawk Abstractions Do Not Require String Copying

libawk was originally intended to provide mutable tables of immutable strings. In other words, clients of the library should be free to add or delete rows or columns, keys and values -- but they should not ever directly modify a string which is an element of a table.

As currently presented, the libawk API does not enforce that aspect of its abstraction. To the degree it is successfully honored by the rest of the code in tla (which is not entirely clear), the abstraction is honored only by careful coding with no particular assistence from the C type system.

Modifying the libawk API to enforce the immutability of strings would make client code of that library easier to understand, maintain, verify, and modify.

The Representation Type of libawk Strings Will Probably Change Anyway

It is conceivable that, as tla acquires support for extended character sets in filenames and other strings, these will be simply be represented internally as Unicode strings, using the UTF-8 encoding form.

It is also conceivable (and in my view more sensible) if strings might internally be represented as Unicode strings using a different encoding form (such as UTF-16 or UTF-32, depending on circumstance).

Regardless, either way, the shift from presumed-ASCII strings to strings which sometimes include extended characters is likely to be a difficult one to implement. (Indeed, one justification for beginning tla 2.0 is to avoid having to make this shift on the tla 1.x code base by instead providing a cleaner-from-the-start foundation in tla 2.0).

If we want to potentiate making the shift to extended characters in tla 1.x, modifying the libawk API so that internal string representations are better hidden is a definate help.

Currently, much code in tla 1.x assumes that, for example, the string in the third column of the second row of a table can be accessed as a value of type t_uchar * using a construct such as:

        t_uchar * table_elt = table[1][2];

    

Eliminating such constructs by hiding them behind a procedural abstraction (i.e., not having clients use array subscripting but, instead, having them call a function to access elements) would be progress on both eliminating needless string copying and preparing the way for extended character sets.

Hash-consed and/or Constructive Strings May be Desirable

[This is, by far, the most speculative rationale for changing libawk.]

String comparison is an important operation in tla, both comparison for equality and comparison for lexical order.

Unicode, necessarily, makes comparison operations considerably more expensive than their ASCII-only analogs.

One sometimes applicable technique for reducing the expense of such operations in a string-intensive application is to (a) use only immutable strings; (b) never allocate two strings with identical contents (always share a single string) -- this gives rise to a fast equality test; (c) memoize the results of all string comparisons for order, doing so in a manner that allows a quick inference of the order between two previously uncompared strings provided that they have both been compared to some other string or to two other strings whose order is known (or less expensively inferrable).

Another sometimes applicable technique for reducing string copying and, in particular, speeding up operations such as substring extraction and string concatenation is to use what might be called "constructive strings" (sometimes called "ropes" after an implementation by Hans Boehm).

Using reference-counted immutable strings in libawk opens the door to trying out either of these techniques.

Program for Research

I am interested in finding an easy way to eliminate libawk-related string copying.

There are many ways a libawk-style API might work without so much string copying but I think that the most straightforward of these would be to:

1. Reference-count libawk Strings Strings do not contain references to other objects and thus are not subject to being part of data structure containing a cycle of references. For that reason, reference counting is a natural fit for memory managing them.

2. Make the libawk Types Opaque The internal representations in libawk should be hidden from clients.

My intention, then, isf to proceed first by making such changes to libawk. This will immediately break a good deal of code throughout tla. The breakage will be of a variety that a C compiler will report it -- this is convenient because the compiler will help to trace down all code which must be modified to honor the corrected libawk abstractions.

Current Questions

Have other Arch developers tried a similar experiment already?

Do other Arch developers agree with the conclusions I draw from my envelope calculations and the anecdotal profiling and tuning results?

What is the comfort-level among Arch developers regarding the prospects of migrating such a change, which is likely to be very large in scope, from a tla-1.3.1 branch to other branches such as baz1.x and tla-1.4?

Copyright

Copyright (C) 2004 Tom Lord

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

See the file COPYING for further information about the copyright and warranty status of this work.