Genealogy from the perspective of a member of The Church of Jesus Christ of Latter-day Saints (Mormon, LDS)

Monday, January 18, 2016

Record Hints vs Searching on Your Own

One of the fastest changing technologies that direct impacts family history research is the implementation of automatic or semi-automatic, integrated search programs. To write such program that will function effectively, it is a prerequisite to have available an extensive database of indexed, source records. Essentially, you can't afford to plumb a dry well. Unless you have a large enough base of records to search, developing search programs is pointless. Let me illustrate the concept of developing such a system starting with a basic manual search.

Let's suppose I collect ten business cards from contacts at a convention. I can put those cards on my desk in a pile and if I need to contact one of the people whose card is in the stack, I can find the card without too much trouble. What if the number increases exponentially? Now I have 100 cards on my desk. About this this time I begin to think of ways to "organize" the cards so that I can efficiently find one I am looking for so I start putting the cards in alphabetical order. This might work, even if the number of cards increases exponentially. If the cards were organized alphanumerically, then I could still use the stack to locate one card. This assumes, of course, that I know the name or company of the person I am looking for and that a card for the person or company exists in the pile of cards.

These same considerations would apply if we used an example of searching for an ancestor. If I have ten records to search, the process is trivial. If I have ten thousand records to search, the process might be overwhelming. Obviously genealogists are not the only people who want to search large databases. A paper telephone book, even a very large one, is an efficient way to organize a large number of people and make finding any one of them possible in a relatively short time. In the phone book example the people subscribed to the telephone company in a somewhat random order. But their listings in the phone book were arranged alphanumerically and that made them accessible to a search.

This same level of organization applies to genealogical records. If I have ten thousand records and I organize them in alphanumeric order, theoretically I can find the one name I am searching for. If I am doing this manually, searching 10,000 records will be about the same level of difficulty as searching through 100 records assuming they are sorted alphanumerically. How is genealogical research different than looking for a name in a phone book or other alphanumeric list? In making searches for genealogical information, we are not always aware of the names, dates or places recorded in the records we are searching. In many cases we make assumptions about the information we are searching for that is inaccurate or completely wrong in that our search terms are not the same as those in the historical records. For example I may be searching for an ancestor named John Smith born in 1800 but the name of my ancestor was actually Albert Smith and he was born in 1790. This lack of correspondence between what we are searching for and what is recorded in the historical records renders even an alphanumeric list useless.

If the genealogical record is an index then the process is similar. But if you think about it, a phonebook is unidirectional. You can look for a name (assuming you know it) but you cannot do the opposite; look for a telephone number.

These examples point out a basic limitation of all searches: any accurate search depends on the researcher's knowledge. If I know the person's name and need to know a telephone number, I can find that easily. But in order for me to manually find the name of a person with a known telephone number, I need to have a list organized by telephone number. In either case, I need to know either the name or the telephone number to use a paper-based directory. The issue of needing some basic information also applies to indexes created on a computer. As long as any additional search terms or segregated classes of information are indexed in some definable order, a manual search will be able to find information eventually that is associated with the entries in that particular field. For example, if we add a date field to our telephone record and the database is sorted by date, we can use a known date to find a record. Likewise, if we have a name, date and place for an ancestor that corresponds to the same information in an historical record, we may eventually find the record. The main limitation is and always has been the number of records to be searched.

To summarize, when I search for genealogical information, I am essentially trying to match what I think I know to information I hope is contained in some historical record. I am limited by the time and effort it takes to make such a search. If the records are organized in some way, alphanumerically or by date or by place or whatever, my search may be more efficient. For example, if I am searching church parish registers in which the information was entered chronologically and I know the name of the person I am searching for and an approximate date, I can efficiently go through a section of the records and determine if the person I am searching for is contained in the record.

In this case, matching what I already know, the ancestor's name etc. to a record may help me discover information I do not know. Using the parish register as an example, I may find the names of my ancestor's parents in a christening record.

In all these cases, a successful search depends on starting with information that we know before we begin the search; hence, the commonly used genealogical maxim to start searching from what you already know. See the following list of references to starting with what you know.

Dowell, David R. Crash Course in Genealogy. ABC-CLIO, 2011.

Helm, Matthew L., and April Leigh Helm. Genealogy Online For Dummies. John Wiley & Sons, 2008.

McGinnis, Carol. Michigan Genealogy: Sources & Resources. Genealogical Publishing Com, 2005.

“Start With What You Know.” Daughters of the American Revolution. Accessed January 18, 2016.

Here we depart from manual searches to the world of computers. A computer program can be written to take a random list of names, telephone numbers any any number of additional fields, that will find or match the contents of any field. The order of the database corpus can be entirely random and given the speed of the current computers, a very large list of multiple fields can be searched in a very short time. However, for a computer program to search paper records, the information must be extracted into a searchable format. For some types of printed documents that can be done by even more sophisticated programs called optical character recognition programs. Today there are vast databases of books, newspapers and other printed documents that can be searched by computer programs for information that matches the researcher's search terms. For example, I can find the name of an ancestor in a collection of millions of pages of newspapers. In this case my success in finding my ancestor's name will depend on my knowing how his name might have been recorded in the historical record.

In all my examples so far errors can occur that prevent me from finding the information if the historical text is unreadable for any reason or if the information I am searching for is missing from the historical record.

Now let's move on to a more complex example involving handwritten documents. Many genealogically significant records contain names, dates and places as well as other useful information. Let's further suppose that we are dealing with a collection of paper records that have multiple significant "fields" or specific classes of information. In this case, since there are currently no adequate handwriting recognition programs for genealogically important documents, the extraction of the information must still be done by human effort. We call this process "indexing." However, in effect, we are not indexing anything, we are merely transferring information from a handwritten format into a text format that can be "recognized" by a computer program. The "indexed" records are not organized in any way by the indexer, but the individual items of information, i.e. names, dates etc., are entered into fields that can be searched by a computer program.

It should be obvious that the ability of a researcher to use the "indexed" information depends entirely on the accuracy of the extraction operation. Only when the information is in a computer usable format, such as a text file of some sort, can a computer program can try to match what is entered by the researcher with what has been produced by the indexers (read extractors).

The entry of the search terms can be done manually by the researcher or another program can be written to use information already present in a family tree or other database. Here we are. As genealogists we are now either entering in our own search terms into a program that will then search the extracted (indexed) database or we are relying on a program written to automatically match information already present in the database (family tree). Once again, in both instances the effectiveness of the search depends on the accuracy of the information supplied either manually or by a program. As has been said for a long time in computer circles, "garbage in - garbage out."

In both cases, using a manually entered set of search terms and relying on a computer program that matches existing information with information extracted from historical records or documents, what we are hoping to find is additional information about our ancestors that we did not know previous to the search. Ultimately, we hope to identify additional individual ancestors. Here we come to the basic limitation of the automated search programs: they can only supply information about people already present in the database (family tree or whatever). It is only through the fact that the records may incidentally contain information about other ancestors (or relatives) that are not already present.

The automatic search programs can be made to be quite sophisticated. They can be programmed to search for alternate names, name variants, approximate dates and places that are nearby to the search terms present either entered by the researcher or already in the database.

One of the consequences of searching manually is that the search programs (commonly called search engines) will produce multiple responses to a search. Ideally any search should produce all and only all of the records pertinent to a particular set of search terms. Automated search technologies attempt to do this by suggesting "record hints" rather than record matches. In every case the so-called "record hint" must be carefully evaluated by the researcher to assure that the person suggested matches the one already in the database.

Here we come to the great flaw in both manual and automated, computer assisted searches. Both rely entirely on the accuracy of the records already present in the researcher's mind or database. Like I pointed out previously, if I am searching for "John Smith" I will find "John Smith" even if I am searching for the wrong name. In practice this occurs most frequently in the phenomena we call "same name = same person." An equally as common issue in research is failing to identify the place an event occurred accurately or consistently. Some programs try to compensate for this problem by encouraging "standardized" place names but the standardized place name may actually be inaccurate in that it fails to match the place as recorded in a historical record.

This whole subject is extremely complicated. Genealogical researchers today are in the midst of a huge transition moving from repetitious and routine searches to depending on automated search programs. Those programs will not do our research for us. We cannot assume that the results are accurate, any more than we can rely on our own efforts absent our consistent production of accurate information. Whether manual or automated, research into the unknown always relies on the accuracy of what we think we already know.

No comments:

Post a Comment