Putting the Cookies on a Lower Shelf--
Getting Started with Conc
Making Concordances, Indices, and Word Lists
Ed_Beach@sil.org
- Introductory Level Tutorial
- Intermediate Level Tutorial
- Advanced Level Tutorial
One of the most useful and powerful functions of a computer is to generate concordances, indices of words and characters, and word lists. I am frequently surprised at how many computer users do not know how easy this is to do, and how many Bible translators let the making of a word list go until their translation is practically done, supposing that this is one of the *final* steps for checking a translation and that it has to be performed by an expert. If you are one of those people, then here's good news for you! The Conc concordance software for Macintosh by John Thomson of SIL is an easy-to-use tool that computer neophytes and experts alike can profit from greatly.
This tutorial is organized into three consecutive levels: Introductory Level, Intermediate Level, and Advanced Level.
Here are some typical uses of Conc, all of which can be done by a linguist who is a novice Mac user:
- Analyze the environments of various phones.
- Check concordance and spelling of terms in a New Testament translation.
- Make a single list of all characters used in sample texts with their numbers of occurrences for planning a primer.
- Make a concordance of various interlinear texts generated by IT.
- Make a specialized concordance of key terms, names, etc.
- Make a word list to import into a word processor's user dictionary.
- Publish a concordance of any literature on disk.
Conc is safe for anyone to use since it merely views your original document and cannot change it in any way. No need to worry that you or a coworker will accidently alter a document!
Conc's learning curve is very short. I recently worked with a Mother Tongue Translator who was not particularly computer savvy, but within one hour, he was generating his own concordances.
Some of the features that Conc provides include:
- Alignment of key words/characters can be freely changed.
- Concordances and indices can be saved to disk, printed with basic page formatting and numbering, or exported for use by other software.
- Fast generation and navigation of concordances/indices--Conc works solely in a Mac's pure electronic memory (RAM) until you save to disk.
- Immediate access to concordances/indices of word processor documents without having to load the text into a special database first.
- Interlinear (IT) or normal text documents can be used for source documents.
- Multiple documents can be batched together for a single concordance or index.
- Progressively selective concordances are easy to make.
- Reference systems of various types are supported, e.g. SIL Standard Format, line numbers, no references, etc.
- Searching can be based on multiple criteria, including simple or complex formulas using standard or custom pattern-matching symbols, occurrences more or less than X number of times, listed words included or excluded, etc.
- Sorting can be based on user-defined alphabetic sequences.
- Statistics about words or characters.
- Tile text, concordance, and index windows automatically; click an item in one window to instantly select corresponding item(s) in the other windows.
- Typeface and size is user-selectable for both concordances and indices.
Conc comes with a user's manual in electronic form that contains far more information than can be contained in NOAM. You can print it out or use it on disk.
Conc has been designed so that it can be translated into other languages using Apple's free resource editor, ResEdit (available from JAARS). Computer-wise, this is an astoundingly easy process; the challenge however, is the decision-making process concerning the choice of terminology to be used in menus, dialogs, and messages. For this reason, translation of Conc into other languages should be done by those who are familiar with computer terminology in the target language, Conc, and Macintosh interface standards.
The last general release of Conc is version 1.76, dating back to 1993. Various beta versions have been available since then which have added some features, but 1.76 was the last widely tested and debugged version. Little development work has been done on Conc in recent years due to the need to focus on development of SIL's LinguaLinks software.
Conc can be obtained from any of the following sources:
- You can download Conc version 1.76 here.
- A CD-ROM containing the program may be ordered by sending an email to
and asking for Conc for MAC. The disk has other MAC programs. There will be a small charge for the CD-ROM.
- You will need a BinHex program to decode it. "176" here and below refers
to version 1.76.
- Conc can be downloaded by anonymous FTP from ftp.sil.org [208.145.80.1].
Do these commands:
CD [.software.mac]
GET conc176.sea_hqx
Or, just click
here.
You will need a BinHex program to decode it.
Conc is RAM-based software. (RAM means "Random Access Memory" and refers to the little memory chips inside a computer. This is where Conc does almost all its work, except when slows down to access information on a disk, e.g., when you open a text document or when you save/export a concordance or index.) This simply means that Conc works with all text, concordances, and indices entirely in your computer's pure electronic memory. An important implication of this is that the source text(s), concordance, and index can all fit into the RAM space allotted for Conc. Generally speaking, you need at least three times as much room for all this as the size of your source text. Until Mac OS 8 is available in 1996, you have to check and set this manually. Here's how:
Before starting Conc, select its icon and choose "Get Info" from the Finder's File menu. In the area of the lower right corner called "Memory Requirements," edit the number to be at least three times the size of your source text. Once the Info Dialog is closed, the setting is saved as the future default. If your Mac has only 4 Mb of RAM, then you may run into difficulty using Conc to process large amounts of text material.
For language work in general, linguists should have at least 8 Mb of RAM for pre-PowerPC Macs such as I have (I also use RAM Doubler), or at least 16 Mb for a PowerPC. (If you think that's expensive, note that this is what JAARS is basically recommending (see NOC 14.3.28) and that it is what LinguaLinks will require.)
Here is a sample of Conc's Concordance and Index windows. The top window shows the text of Acts 1. Below that is a concordance that has been generated and which shows the key words in bold, aligned down the center, and with chapter and verse references along the left side. Below that is an alphabetized index of words showing how many times each occurs and the chapter and verse references of where it occurs.
Note that in the text window I have clicked on one of the "Spirit" entries in the concordance. Conc instantly selected that item in the text as well as in the index.
It took just one minute to do this! In that time, I...
- opened a text document containing Acts 1,
- set options for the concordance and index, and
- generated both the concordance and the index.
Here's how you can do that...
1. Start Conc by double-clicking its icon.
2. Open a source document from Conc's File menu.
(Formatted Nisus documents are okay; others have to be "plain text.") Include additional documents by choosing Append.
3. Define a custom sort order, if needed.
4. Define reference markers (e.g., chapter and verse) and word separator characters.
5. Define parameters for specialized concordances.
6. Tell Conc to make the concordance.
7. Tell Conc to make an index.
Conc can be started up in various ways:
- Double-click Conc's icon or an alias of it.
- Double-click a Conc options document, concordance or index.
- Select Conc's icon, or the icons of a Conc options document, concordance, or index, and then choose "Open" from the Finder's file menu.
- Select Conc from the Apple menu if you have stored an alias of it in the Apple Menu Items folder (in the System Folder).
Do this from *Conc's* File menu! If you double-click the source text itself, you will merely open it in the application with which you created it. Using Conc's File "Open" command let's Conc look at a document created by another application.
Conc normally requires that the source text be a text-only document, but any *formatted* Nisus or Nisus Writer document can be used directly by Conc. Documents created by most other word processors contain hidden formatting information that will confuse Conc. To make a text-only document in most word processors, open the document in that application; then open the "Save As..." dialog from the File menu, choose the setting for a text-only version, and give the text a new name so as not to overwrite the original.
Saving a document as "text only" in the "Save As" dialog of Microsoft Word
Note about Microsoft Word: Always check the "Make Backup" check box (or whatever its equivalent is in your version of Word). Word's "Fast Save" option can lead to various problems, not the least of which is that other word processors have trouble reading documents saved with Word's Fast Save option.
The text-only version of your source text is now ready for use by Conc.
Select "Sorting..." on Conc's Options menu. When the dialog opens, note that the Font menu is active and can be used to set the font in this dialog for special orthographies such as seen in the example below.
Options menu: Sorting...
Note the following about this dialog:
- Characters are grouped into "primary sort groups" of corresponding upper and lower case characters.
- If you frequently use various sort orders, store them in the Scrapbook or in a text file for easy retrieval.
- Conc uses the "secondary sort sequence" characters to alphabetize when two words are otherwise identical. Refer to the general Conc user's manual for more information.
- If the hyphen is used as a word forming character to distinguish unitary characters from digraphs, as in the Mam word *t-xic,* include the hyphen in the Secondary Sort Sequence and ensure that it is not listed as a word separation character in the Text Properties dialog. (Don't confuse hyphen, en dash, and em dash!)
- Activating "Characters within primary sort groups are distinctive" means that Conc will distinguish uppercase and lowercase characters; this should be your normal default setting.
- If you use the straight apostrophe as a glottal character, put it at the end of the Primary Sort Sequence (in Mayan languages).
- Conc 1.76 added support for distinguishing multigraphs, but to date has seemed buggy to me. Refer to the Read Me document that accompanies it for instructions.
Open the Text Properties dialog for Conc's Options menu. Here are the settings I use for SIL Standard Format (SF) Scripture when making a single concordance/index from multiple books:
- Only list the SIL Standard Format codes you want used in concordance and index references. For instance, "bk c v" will produce references such as "Matt 1.1". Limiting the list to "c v" however, will produce references such as "1.1". Book names are not needed in a single book concordance or index, and save both screen and page space.
- Do not include the SF backslash marker in the dialog. (Conc assumes it.)
- Leave a single space between each SF code listed.
- It is assumed that at all SF codes in the source document are correct. If you have doubts, run Chapter Verse Check.
- Word separator characters are listed with no spaces between them.
- Conc assumes that white space characters--space, tab, and carriage return--are word separators.
- In the above example, the straight apostrophe is not listed as a word separator because we are using it in orthographies here in Guatemala to designate glottal stop / glottalization; thus it is an orthographic character listed in the Sorting Parameters dialog. The straight quote (inch mark) is not used in our manuscripts.
This step may lead you to select one or two more dialogs accessed from the Options menu:
The Include Words dialog is a veritable power house which will be discussed later. First timers should select "Include all words."
Important: Words specified here will be included only if they are not excluded by the Omit Words dialog.
This dialog is a handy way to make an initial spell check by enabling you to limit a concordance/index to rare words--likely suspects for spelling mistakes. If this is your first time to use Conc, leave at least the first two items unchecked.
Layout menu: Display...
- "Show references within text from flat text files" allows you to control whether references are displayed in the text window.
- Changing the threshold of what to show to reduce wastage to a very low or even zero percentage will use all the screen/page space efficiently, but will give you many partial words.
- The right hand limit is not the edge of the window, but rather the edge of the paper selected in the Page Setup dialog after adjusting for the margins specified in the Page Layout dialog (on the File menu).
- "Show secondary field word after main in concordance" and "Limit context to current unit in interlinear documents" applies only to interlinear documents.
This is the fun part! Simply choose "Word concordance" on the Build menu.
Before building a large concordance, ensure that you have plenty of hard disk space! Concordances can easily be twice as long as the original file.
Use the standard Save and Print items on the File menu for saving and printing your new concordance. Note that you can also export it as a text file to use in a word processor or other program.
Troubleshooting: If you wind up with a concordance that is different than you thought it would be, check all the option settings. Also make sure you are using a text-only file as your starting point.
Getting Statistics: If you would like to know some statistics about your concordance or index, choose Conc's "Statistics" command on the Build menu.
An index of words or characters tells how many of each occurs and where they occur. To create such an index in Conc, you first build a concordance (as done above) and then simply choose "Index" on the Build menu.
Use the standard Save and Print items on the File menu for saving and printing your new index. Note that you can also export it as a text file to use in a word processor or other program.
If what you really wanted was a plain vanilla word list, i.e., a list of unique words without any references or the number of occurrences, then refer to the Intermediate Level tutorial below. (Most people use indices. A true "word list" is actually a quasi power user tool.)
This is the end of the Introductory Level tutorial for Conc. Your are invited to continue on to the Intermediate Level.
Now that you have a concordance and an index, you can find things very fast in your document.
- Click on a word in any of the three windows--text, concordance, or index--and Conc will instantly locate the corresponding entries in the other two windows. (Of course, if you click on a word in the text window, but had told Conc to omit it from the concordance, or had chosen a pattern that would simply not include it , Conc will have nothing to scroll to.)
- Type a letter on the keyboard, and Conc instantly locates the first word in the concordance or index that starts with that letter--or the next greater word if nothing starts with it--and all windows scroll to the selected word. (This assumes, of course, that you elected to have Conc sort the concordance!)
- Type several letters rapidly, and Conc instantly locates the first word that starts with the string you typed. A short pause (equal to the current double-click interval set on the Mac's Keyboards control panel) will allow you to start again, typing a new string.
Want to quick change the look of your concordance or index?
- Use the Font menu to alter typeface and size in any windows or dialog.
- Adjust column width in the concordance and index windows simply by dragging the triangles at the top of the windows.
- Tile windows with the Window menu's Tile command.
- Choose the Set Wrap Length on the Layout menu to make text in the source document window fill out to the width of the window.
- Enlarge windows with the re-size or zoom boxes in the scroll bars.
- The total width available for context is determined by the paper chosen in the Page Setup dialog and the margins you have selected in the Page Layout dialog. To maximize the amount of context you can see by scrolling sideways, choose a sideways page layout and zero margins.
- Tell Conc how to use available space and whether or not to show references using the Display Options dialog on the Layout menu.
Need a word list? This is useful for importing into a user dictionary in other software such as Nisus. Strip an index of references and numbers of occurrences by doing the following. (This assumes you are working from a document with no spelling errors!)
- Export your index (click in the index window, then go to the Export Index command on the File menu).
- Open the exported file with a word processor.
- Use the word processor's replace tool to strip the file of the numbers of occurrences (and references if you elected to include them). Microsoft Word 5 and Nisus make this a quick and easy task.
You can save time by saving your options settings in a file there they can all be retrieved at once. Simply choose the "Save All Options" from Conc's File menu..
Next time you start Conc, do so by double-clicking on your options document.
Restore a given set of options merely by using the File menu's Open command to open the needed options file, or simply double-click it in the Mac's Finder window.
The Revert command restores the options and concordance that were in effect the last time the concordance was saved. (It does nothing with an unsaved concordance.)
Note: Conc 1.76 beta may leave Save and Save As grayed out and unavailable when an options dialog is open. Generating or opening a concordance should solve the problem.
Long documents are normally saved in pieces. Conc enables you to easily build concordances and indices of such multiple file documents. Use the File "Open" command as usual to open the first text file for your concordance or index. Now use "Append" on the File menu to open additional texts to be included in the concordance/index.
The "Export..." command on the File menu creates a plain text file of your concordance or index, depending on which window is active. This can be useful for printing it with other software in more sophisticated ways than Conc is capable of doing. I frequently export concordances and indices for processing with Nisus. If you have already saved the document in Conc format this command appears as Export file name As.
Exporting can produce very large files, more than ten times the size of the original text. You can halt the export process by clicking the Abort button on the progress indicator. Consider whether you could better use "Export Selection" (becomes active when part of a concordance or index has been selected by dragging over it) to save just part of the concordance.
If you desire to get a word list from Conc, then select "Index..." on the Options menu. In the dialog, choose "Index entries show at most 0 references," then go ahead and make your index. Finally, you will have to export the index and strip it of the numbers of occurrences (and their preceding tabs) using a word processor. You may have to fill in the number zero. Don't forget to set the Index Options dialog back to "Index entries show all references" as a default!
"Print" on the File menu prints the current concordance or index, depending on which window is active. To print just a part of a concordance or index, select the lines you want (by dragging or shift-clicking as usual) and then choosing "Print Selection" on the file menu.
The Page Setup command is stock standard. Page Layout and Header/Footer dialogs are self explanatory.
A key to Conc's usefulness is its ability to generate special concordances based on user-defined words or, more strictly speaking, rules called "patterns." Finding word or character sequences that fulfill rules is called "pattern matching." Pattern matching is accomplished by:
- Specifying a word or list of words to match.
- Specifying character strings to match.
- Specifying formulaic patterns for kinds of words or strings to match.
Conc's Include Words dialog has boxes ("fields") for two patterns and a third box where whole words can be listed. Words matching whichever of these three is activated will be included in the concordance.
The simplest pattern is just a group of ordinary letters. E.g., Specifying the pattern "ing" will create a concordance of words containing "ing."
Formulaic patterns are even more powerful. This is an advanced feature of Conc, so if you don't feel ready for it, skip this section until your felt need for it is high enough to motivate you. Be forewarned that Conc is no smarter in pattern matching than you make it! The burden for creating proper formulas is completely on the analyst.
In pattern matching formulas, special meanings are assigned to certain characters. Conc comes with a default set of special characters already set up for this purpose. (I, general, these are world-wide standard GREP (Global Regular Expression Parser) symbols from the world of Unix.) For instance, "ing$" is the pattern that matches all words with "ing" endings because Conc knows that $ means "end of word."
Select the Pattern Matching dialog on the Options menu to see Conc's default pattern matching symbols.
You may be tempted to change the characters in this dialog. However, it's best to at least start out using these defaults since they are a standard in the world of computing. (Note the similarity to Nisus Writer's PowerSearch Pro find mode.)
Hint: Leave the dialog open while you write pattern matching formulas. (If you change the settings, you will need to click okay and then reopen it.)
Normally, a pattern is written to find single words that match it. You can, however, look for multiple word sequences by specifying a number of words to include in the comparison. Do this by filling in a number other than "1" in the Include Words dialog item that reads "Include groups of __ words."
- bapti
- All occurrences of baptism, baptist, baptize, baptized, baptizing.
- ^[aA]
- All words beginning with either a or A.
- ^[A-Z]
- All words beginning with an uppercase letter (in ASCII range of A through Z).
- [^aeiou][^aeiou][^aeiou]
- All strings of at least three consonants.
- [!?."][ ]*[a-z]
- All words beginning with a lower case letter after zero or more spaces following end of sentence punctuation (Have to set search to include groups of two words.)
- ^a_*p$
- All words that start with a and end with p. The _* matches any word forming character that may or may not come between.
- ^\([aeiou]\)_*1
- Vowel initial words, provided the same vowel occurs elsewhere in the word
- b[aeiou]%b
- All strings where a b is followed by any number of vowels and then another b.
Here is how Conc's default pattern matching works:
1. The . (period) matches any character.
- The pattern m.n matches "man" and "money", but not "mark".
- The pattern k.x matches both "kaxic" and "k-xic".
2. The _ (underline) matches any character considered to be part of a word.
- The pattern k_x matches "kaxic" but not "k-xic" if the hyphen is listed as a word separation character on the Sorting options dialog.
3. The # matches any character that is not part of a word.
- The pattern time#c matches "time capsule" and "time clock", but not "timecard" because time and c must be separated by a character that is not a word-forming character, such as a space.
4. The backslash \ followed by any character--except a digit or the parenthesis characters--matches the character that follows the backslash. This is useful if you want to look for characters that are normally special.
- The pattern ing\. matches words that have ing followed by a period.
5. Square brackets identify a set of characters.
- The pattern [aeiou] matches any one of a, e, i, o, or u . In such a bracketed string, the backslash \ has no special meaning, and the closing bracket ] may only appear as the first letter.
- A set of characters can be negated using the caret symbol ^ as the first character inside the brackets. Thus the pattern [^aeiou] matches any character except a, e, i, o, or u.
6. Shorthand, such as [a-s], may be used where a and s are in ascending ASCII order and a-s represents the inclusive range of ASCII characters. (Unfortunately, this means exactly what it says in the present version of Conc. Hence, if you have a special character in a font that alphabetically occurs between, say, a and s, but has an ASCII number higher than s, then it would not be included in the set. A future version of Conc or its successor will most likely employ a user-defined collating sequence. )
- The pattern [f-q][aeiou] finds words where any letter between f and q (in the ASCII order) is followed by a vowel.
7. An element of a pattern that is followed by * matches a sequence of 0 or more occurrences of that element.
- The pattern sn*o matches "snob" and "snow" as well as "sonic".
8. If an element of a pattern is followed by % then it matches one or more occurrences of that pattern element.
- The pattern sn%o matches "snob" and "snow", but not "sonic".
9. Are you ready to adjust your eyeballs? Backslashes paired with parentheses \( and \) are opening and closing brackets that cause Conc to remember whatever is between them for further comparison. (Note that Nisus has a beautiful way to handle this in its PowerSearch mode. I have found the best way to get accustomed to this feature of Conc is using the "found" feature in Nisus PowerSearch.)
10. This is used in combination with the next convention: A backslash \ followed by a digit n matches a copy of the string that the bracketed pattern beginning with the nth \( matched. This is useful when what is inside the brackets could match several things (it includes other special characters) and you later want to check for a repeat of the thing that matched.
- The pattern \(_\)\1 matches (believe it or not!) any word that has a double letter. To read it, first consider that the underline matches any letter. Hence, \(_\) (underline enclosed in digraph brackets) matches any letter and the digraph brackets make Conc remember which the letter. Finally the \1 requires there to be a repeat of whatever was in the first pair of brackets; in this case, another occurrence of the same letter. Note: The 1 stands for "first" string remembered, not for "one" or "single" occurrence.
- The pattern \(__\)\1 matches any word where a sequence of two characters is repeated. (There are two underline characters in the pattern.)
- The pattern \(__\)_*\1 is the same except that there may be other characters between the pair that repeat. Note that the three underlines don't all have to match the same character.
11. Any pattern such as mentioned above that is preceded by the caret symbol ^ is restricted to matches at the beginning of words (or group of words, if that option is selected).
- The pattern ^a matches all words that start with the letter a.
- The pattern ^[^aeiou] matches all words beginning with a consonant.
12. Any pattern such as mentioned above that is followed by $ is restricted to matches at the end of words (or group of words).
- The pattern ing$ matches words that end in ing.
The check box item "Characters within primary sort groups are distinctive for pattern matching" in the Pattern Matching dialog functions independently of the corresponding check box item in the Sorting dialog.
- With the "Characters...are distinctive" feature turned off in the Pattern Matching dialog, a will match either a or A; [aeiou] will match any vowel, upper or lower case; and ^\(_\)\1 will match any word starting with a double letter, whether or not the first character is upper case.
If certain characters are specified for the secondary sort sequence, they are ignored for pattern matching if "Characters...are distinctive" is off.
The check box item "Include word separation characters" controls whether characters that are not considered to be in words are included in the match.
- With this option on, if you select words beginning with a (using a pattern ^a), you would miss words that begin a quotation since such a word would start with a quote (as in "a quote"). This is why Conc normally excludes punctuation. But if you want to find all the words that begin quotations --using for example ^["] as your pattern--it would be annoying to have none of them found because Conc leaves out such punctuation characters. This check box item allows you to specify which should be done.
A very useful strategy for analyzing text is to make progressively more focused concordances. For example, suppose you want all the words that contain a and e and i, in any order. There is no single pattern match that will do this. However, if you first limit the concordance to words containing a, then build a concordance from that one of words containing e, and then a concordance based on that of words containing i, you can easily get the required result.
Building one concordance from another is as easy as having a concordance open, and then selecting the Build Word Concordance command. Conc will ask if you want to use the current concordance or the original text as the starting point for your new concordance, and you will click on "Present concordance".
A particularly valuable way to use the "Present concordance" option is to begin your study of a large text by creating a concordance containing all the words you ever expect to be interested in (possibly all the words in the document). Then save this as a base concordance, using the Save command on the File menu. Then, for each set of words of interest, open this base concordance, change the set of words to include, and choose "Present concordance." Use the Revert command on the File menu when you want to start again. In most cases (especially for large files and if you have a hard disk) the combination of Revert followed by building a new concordance based on the current one will be much faster than building a new concordance based on the original text.
- Source text, concordance, and index, must fit in available RAM.
- 128 pages per printout.
- Lists of words to include or exclude are limited to 240 characters.
- No way to interrupt sorting a large file.
- Under certain low-memory conditions, Conc may suddenly quit.
Date created: 14-Dec-1995
Last modified: 14-Dec-1995
URL:
http://www.sil.org/computing/conc/tutorial.html
Questions/Comments:
WWW@sil.org
[SIL Home Page |
Conc Home Page |
SIL Computing]
Copyright © 1995, Ed Beach and Summer Institute of Linguistics