Genealogy from the perspective of a member of The Church of Jesus Christ of Latter-day Saints (Mormon, LDS)

Friday, September 18, 2015

Auto-Indexing added to FamilySearch.org

"Internet Archive book scanner 1" by Dvortygirl - Own work. Licensed under GFDL via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Internet_Archive_book_scanner_1.jpg#/media/
File:Internet_Archive_book_scanner_1.jpg
Optical Character Recognition or OCR has been around for quite a while. It is routinely used by businesses, such as the U.S. Postal Service, to expedite delivery. I have been using desktop computer based OCR programs for many years. At one point, I lost the files for my personal journal but fortunately had a printed copy and used an OCR program to reconstruct the missing portions of my journal. Lately, I have been using OCR's cousin, Voice Recognition or VR software to save the time and effort of keyboarding everything I write.

More importantly for family historians, the huge collections of digitized and transcribed books online is dependent on this technology. For example, the tens of thousands of books in the FamilySearch.org Books collection have been subjected to OCR scanning and the text is completely searchable.

There have been a couple of mentions of using OCR technology to expedite the entry of some documents into the vast FamilySearch.org Historical Record Collections. The latest was in an update by Steve Anderson in the FamilySearch.org Blog entitled, "What's New in FamilySearch -- September, 2015." Here is the short paragraph about the subject.
Search: Auto-indexed Record Feedback 
FamilySearch.org has begun publishing collections that contain searchable indexed information that was extracted from images by computer algorithms. This monumental advancement promises to dramatically increase the indexed information available for the many image-only collections currently published on FamilySearch.org.

While we are developing these automated indexing tools, your feedback on the accuracy of these records will greatly accelerate the improvement of the tools. On auto-indexed records only, you will see a new tab at the bottom labeled “Errors?” When you click Errors?, you will be able to provide direct feedback to the engineer on the type and specific nature of any errors you encounter.
First of all, this only works on images of typeset documents. OCR technology does not yet work consistently with handwriting or script. Even at the U.S. Postal Service, human intervention in the recognition process is still required in some cases although I suspect that the volume of hand addressed mail has decreased dramatically. OCR technology is still entirely dependent on the quality of the original image.

I always wondered why, when indexing was such a labor intensive activity, that the larger indexing companies, such as FamilySearch.org, did not get more heavily involved. For some time now, Ancestry.com has had a system where users could correct the "indexed" record and add readings of the original that become part of the search terms. Anyone who has spent time researching in indexes, realizes that the indexing system, either human or electronic, makes mistakes in reading the characters in a record.  Here is an example from Ancestry.com.

My Grandfather's name is Leroy Parkinson Tanner. Here is the indexed entry from Ancestry.com where his name is recorded as "Leroy T Tanner."


In this case, the indexing system was accurate. The original image is definitely a "T."


The problem is that this type of error may prevent the record from being found by an unsophisticated search engine. Advances in the search engine technology have diminished the impact of this type of error, but where the original record is not available, where there is only an index, these errors can prevent the researcher from finding the record altogether.

As FamilySearch.org mentions in the announcement, limited user input is now available to help correct this type of entry problem, whether it originated in the original input at the time the record was created or somewhere else in process, it is important to make the original records available with every index, unless, of course, the original record was an index.

This is definitely a move in the right direction by FamilySearch.org.

2 comments:

  1. In response to Steve Anderson's bog re. auto-indexing, I have asked several times to be given an example of such a record on FS Search - with no answer. Do you know of one?

    ReplyDelete