Yes, this is a lame attempt at explaining what I've built, in the vain hope that someone will read it and improve it. I'm writing this with only about 4 hours sleep, so my coherency may not be particularly high. There are a few steps to doing incremental training tests: 1. Get your corpora. It's best if they're contemporaneous and single source, because that makes it much easier to sequence and group them. The corpora need to be in the good old familiar Data/{Ham,Spam}/{reservoir,Set*} tree. For my purposes, I wrote the es2hs.py tool to grab stuff out of my real MH mail archive folders; other people may want some other method of getting the corpora into the tree. 2. Sort and group the corpora. When testing, messages will be processed in sorted order. The messages should all have unique names with a group number and an id number separated by a dash (eg. 0123-004556). I wrote sort+group.py for this. sort+group.py sorts the messages into chronological order (by topmost Received header) and then groups them by 24-hour period. The group number (0123) is the number of full 24-hour periods that elapsed between the time this msg was received and the time the oldest msg found was received. The id number (004556) is a unique 0-based ordinal across all msgs seen, with 000000 given to the oldest msg found. Note that this script will run through *all* the files in the Data directory, not just those in Data/Ham and Data/Spam. 3. Distribute the corpora into multiple sets so you can do multiple similar runs to gauge validity of the results (similar to a cross-validation, but not really). When testing, all but one set will be used for a particular run. I personally use 5 sets. Distribution is done with mksets.py. It will evenly distribute the corpora across the sets, keeping the groups evenly distributed, too. You can specify the number of sets, limit the number of groups used (to make short runs), and limit the number of messages per group*set distributed (to simulate less mail per group, and thus get more fine-grained results). 4. Run incremental.py to actually process the messages in a training and testing run. How training is done is determined by what regime you specify (regimes are defined in the regimes.py file; see the perfect and corrected classes for examples). For large corpora, you may want to do the various set runs separately (by specifying the -s option), instead of building nsets classifiers all in parallel (memory usage can get high). Make sure to save the output of incremental.py into a file... by itself it's ugly, but postprocessing can make it useful. 5. Postprocess the incremental.py output. I made mkgraph.py to do this, outputting datasets for plotmtv. plotmtv is a really neat data visualization tool. Use it. Love it. Gods, I need more sleep. See dotest.sh for a sample of automating steps 4 & 5. Please, somebody rewrite this file.