#!/bin/sh -xe # README.linux.words - file used to create linux.words # Created: Wed Mar 10 09:12:49 1993 by faith@cs.unc.edu (Rik Faith) # Revised: Sat Mar 13 17:02:08 1993 by faith@cs.unc.edu # # Care was taken to be sure that the linux.words list was free of # copyright. This makes linux.words a suitable /usr/dict/words # replacement for the Linux community. # # Since the majority of the words are from Tanenbaum's minix.dict file, # the notice from Barry Brachman, included below, should accompany any # redistribution of this list. # Here is a detailed explaination of how I created the linux.words file. # # This README.words file is actually a shell script that you can use to # recreate the linux.words file from original sources. # # First, I started with minix.dict # from cs.ubc.ca:/pub/local/src/sp-1.5/wordlists-1.0.tar.Z # # The following is from the NOTES file in wordlists-1.0.tar.Z: # NOTES> These word lists were collected by Barry Brachman # NOTES> at the University of British Columbia. They # NOTES> may be freely distributed as long as this notice accompanies them. # NOTES> # NOTES> ================================================================== # NOTES> Info for minix.dict: # NOTES> # NOTES> Article 1997 of comp.os.minix: # NOTES> From: ast@botter.UUCP # NOTES> Subject: A spelling checker for MINIX # NOTES> Date: 6 Jan 88 22:28:22 GMT # NOTES> Reply-To: ast@cs.vu.nl (Andy Tanenbaum) # NOTES> Organization: VU Informatica, Amsterdam # NOTES> # NOTES> This dictionary is NOT based on the UNIX dictionary so it is free # NOTES> of AT&T copyright. I built the dictionary from three sources. # NOTES> First, I started by sorting and uniq'ing some public domain # NOTES> dictionaries. Second, as some of you probably know, I have # NOTES> written somewhere between 3 and 6 books (depending on precisely # NOTES> what you count) and an additional 50 published papers on operating # NOTES> systems, networks, compilers, languages, etc. This data base, # NOTES> which is online, is nonnegligible :-) Finally, I added a number of # NOTES> words that I thought ought to be in the dictionary including all # NOTES> the U.S. states, all the European and some other major countries, # NOTES> principal U.S. and world cities, and a bunch of technical terms. # NOTES> I don't want my spelling checker to barf on arpanet, diskless, # NOTES> modem, login, internetwork, subdirectory, superuser, vlsi, or # NOTES> winchester just because Webster wouldn't approve of them. All in # NOTES> all, the dictionary is over 40,000 words. If you have any # NOTES> suggestions for additions or deletions, please post them. But # NOTES> please be sure you are not infringing on anyone's copyright in # NOTES> doing so. # NOTES> # NOTES> Andy Tanenbaum (ast@cs.vu.nl) # The main problem with minix.dict is that many proper names are not # capitalized. So, I got english.tar.Z from ftp.uu.net:/doc/dictionaries, # which is a mirror of nic.funet.fi:/pub/unix/security/dictionaries. # # Here is part of the README file for english.tar.Z: # README> # README> FILE: english.words # README> VERSION: DEC-SRC-92-04-05 # README> # README> EDITOR # README> # README> Jorge Stolfi # README> DEC Systems Research Center # README> # README> AUTHORS OF ORIGIONAL WORDLISTS # README> # README> Andy Tanenbaum # README> Barry Brachman # README> Geoff Kuenning # README> Henk Smit # README> Walt Buehring # # [stuff seleted] # # README> AUXILIARY LISTS # README> # README> In the same directory as englis.words there are a few # README> complementary word lists, all derived from the same sources # README> [1--8] as the main list: # README> # README> english.names # README> # README> A list of common English proper names and their derivatives. # README> The list includes: person names ("John", "Abigail", # README> "Barrymore"); countries, nations, and cities ("Germany", # README> "Gypsies", "Moscow"); historical, biblical and mythological # README> figures ("Columbus", "Isaiah", "Ulysses"); important # README> trademarked products ("Xerox", "Teflon"); biological genera # README> ("Aerobacter"); and some of their derivatives ("Germans", # README> "Xeroxed", "Newtonian"). # README> # README> misc.names # README> # README> A list of foreign-sounding names of persons and places # README> ("Antonio", "Albuquerque", "Balzac", "Stravinski"), extracted # README> from the lists [1--8]. (The distinction betweeen # README> "English-sounding" and "foreign-sounding" is of course rather # README> arbitrary). # README> # README> org.names # README> # README> A short lists names of corporations and other institutions # README> ("Pepsico", "Amtrak", "Medicare"), and a few derivatives. # README> # README> The file also includes some initialisms --- acronyms and # README> abbreviations that are generally pronounced as words rather # README> than spelled out ("NASA", "UNESCO"). # README> # README> english.abbrs # README> # README> A list of common abbreviations ("etc.", "Dr.", "Wed."), # README> acronyms ("A&M", "CPU", "IEEE"), and measurement symbols # README> ("ft", "cm", "ns", "kHz"). # README> # README> english.trash # README> # README> A list of words from the original wordlists # README> that I decided were either wrong or unsuitable for inclusion # README> in the file english.words or any of the other auxiliary # README> lists. It includes # README> # README> typos ("accupy", "aquariia", "automatontons") # README> spelling errors ("abcissa", "alleviater", "analagous") # README> bogus derived forms ("homeown", "unfavorablies", "catched") # README> uncapitalized proper names ("afghanistan", # README> "algol", "decnet") # README> uncapitalized acronyms ("apl", "ccw", "ibm") # README> unpunctuated abbreviations ("amp", "approx", "etc") # README> British spellings ("advertize", "archaeology") # README> archaic words ("bedight") # README> rare variants ("babirousa") # README> unassimilated foreign words ("bambino", "oui", "caballero") # README> mis-hyphenated compounds ("babylike", "backarrows") # README> computer keywords and slang ("lconvert", "noecho", "prog") # README> # README> (I apologize for excluding British spellings. I should have # README> split the list in three sublists--- common English, British, # README> American---as ispell does. But there are only so many hours # README> in a day...) # README> # README> english.maybe # README> # README> A list of about 5,000 lowercase words from the "mts.dict" # README> wordlist [6] that weren't included in english.words. # README> # README> This list seems to include lots of "trash", like # README> uncapitalized proper names and weird words. It would # README> take me several days to sort this mess, so I decided to # README> leave it as a separate file. Use at your own risk... # # [stuff deleted] # # README> (NON-)COPYRIGHT STATUS # README> # README> To the best of my knowledge, all the files I used to build these # README> wordlists were available for public distribution and use, at least # README> for non-commercial purposes. I have confirmed this assumption with # README> the authors of the lists, whenever they were known. # README> # README> Therefore, it is safe to assume that the wordlists in this # README> package can also be freely copied, distributed, modified, and # README> used for personal, educational, and research purposes. (Use of # README> these files in commercial products may require written # README> permission from DEC and/or the authors of the original lists.) # README> # README> Whenever you distribute any of these wordlists, please distribute # README> also the accompanying README file. If you distribute a modified # README> copy of one of these wordlists, please include the original README # README> file with a note explaining your modifications. Your users will # README> surely appreciate that. # README> # README> (NO-)WARRANTY DISCLAIMER # README> # README> These files, like the original wordlists on which they are # README> based, are still very incomplete, uneven, and inconsitent, and # README> probably contain many errors. They are offered "as is" without # README> any warranty of correctness or fitness for any particular # README> purpose. Neither I nor my employer can be held responsible for # README> any losses or damages that may result from their use. # subtract english.trash cat minix.dict english.trash english.trash | sort | uniq -u > dict.1 # subtract english.maybe cat dict.1 english.maybe english.maybe | sort | uniq -u > dict.2 # build subtraction list of proper names and abbreviations cat english.names misc.names org.names computer.names english.abbrs > sub.1 tr 'A-Z' 'a-z' < sub.1 | sort | uniq -u > sub.2 # subtract proper names with incorrect capitalization cat dict.2 sub.2 sub.2 | sort | uniq -u > dict.3 # build proper name list without possessives cat english.names misc.names org.names computer.names | fgrep -v \'s > names.1 # add in proper names (use sort twice to get uppercase before lowercase) cat dict.3 names.1 | sort | sort -df | uniq > linux.words # clean up rm dict.[123] sub.[12] names.1