Genealogy from the perspective of a member of The Church of Jesus Christ of Latter-day Saints (Mormon, LDS)

Tuesday, May 28, 2019

Big Data and the FamilySearch Family Tree: How Accurate Can It Be?


The term "Big Data" is used to describe extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. If you think about this definition, especially the references to associations and interactions, you might think that the FamilySearch.org unified Family Tree falls into the definition. Here are some statistics that would give you the impression that working with the FamilySearch.org website and in particular, the Family Tree,  has crossed into the realm of Big Data. These statistics come from a recent presentation by FamilySearch and are as of March 2019.
  • Number of Searchable names in the Historical Records   7.01 Billion
  • Digital images published in FamilySearch's Historic Collections online  1.37 Billion
  • Digital Images published only in the FamilySearch Catalog online  1.014 Billion
  • Number of searchable records  4.66 Billion
  • Registered FamilySearch users  12.4 million
Usually, we think of Big Data in terms of information currently being generated and stored. However, in the case of FamilySearch, the information being accumulated is both current and historical. Much of the concern about Big Data is its accuracy and likewise, a significant amount of concern has been expressed over the years about the accuracy of online family trees in general and in particular the accuracy of the FamilySearch Family Tree.

Because the Family Tree is a wiki, I have always maintained that over time, the Family Tree will become more accurate. This works in specific areas, but how will it continue to work if as the Family Tree continues to grow and the number of sources and Memories added continue to grow rapidly? Some of the challenges faced by the Family Tree are hardware issues that relate to the cost and complexity of maintaining a huge online database. But more fundamentally from the standpoint of the users, the issue of accuracy in the data faces the development of significant accuracy and consistency issues. For users, the issues of accuracy and duplication of entries are paramount.

From a software engineering and database management standpoint, accuracy and error management focus almost exclusively on the hardware and the programming. This is sometimes referred to as "fault tolerance." Fault Tolerance is the ability of the infrastructure to continue providing service to underlying applications even after the failure of one or more component pieces in any layer.  See "What is fault-tolerance in cloud computing?" The actual structure of the Family Tree and the rest of the FamilySearch.org website is actually robust and well developed. The distinguishing factor of the Family Tree is that unlike the usual Big Data operation where the accuracy of the content is analyzed statistically, the tolerance for a margin of error in the Family Tree should not be measured assuming an analyzed standard margin of error that is acceptable.

Because the Family Tree is a wiki, no margin of error is in a sense acceptable. We have fundamental religious beliefs supported by scripture that admonish us to work towards not only personal perfection but also perfection in record keeping. Of course, we are forced to rely on imperfect and incomplete historical records, but the idea of working towards perfection should be ingrained into everything that pertains to the Family Tree. Notwithstanding this religious motivational standard, the Family Tree is still Big Data and prone to some of the same issues as other collections of Big Data. In the world of Big Data, these issues are identified under the heading of Data Quality.

If we look at the Family Tree as a formal data system, we should recognize that it has a very specific internal structure, something that is not present in most very large data sets. Everything in the Family Tree has an assigned virtual location i.e. every individual who has been born or will be born has a specific virtual location within the Family Tree. The wiki format allows the Family Tree to grow exponentially and still maintain the same structure. Granted, some component data parts of the Family Tree are not connected together but that problem is not the fault of the structure, it is the fault of the accuracy and completeness of the data. Over time and with extensive upgrading of software systems that detect duplication and provide improved and reliable record matching capabilities, the Family Tree can increase its overall accuracy. This process can be prodded along by enhancements in the software that detect common errors and suggest more accurate responses.

The accuracy of the entire structure of the Family Tree depends solely upon the accuracy of each entry for each individual. The process of "adding names" should never override the need for accuracy. There are currently some developments that are affecting genealogy and thereby affecting the entries in the Family Tree. One of these is the spread of individual DNA tests. Another is the increasing availability of digitized source documents. With respect to DNA tests, the reliability of the initial entries in the Family Tree such as the first four generations could be measurably helpful in increasing the overall accuracy of the Family Tree.

To the extent that the individual users of the Family Tree maintain high standards of accuracy and support their entries with information from reliable historical records, the overall accuracy of the Family Tree should continue to improve. The challenge for the individual is simply to make each entry as accurate as possible and then the overall accuracy of the Family Tree will continue to grow despite the limitations on accuracy imposed by the reality of Big Data. The efforts of the users can continue to be supported by programmed error detection systems such as the Consistency Checker developed by MyHeritage.com.

Judging the overall accuracy of the Family Tree by spot checking specific entries is bound to be misleading. Yes, it is certain that the overall accuracy of the Family Tree is highly dependent on individual choices, but collectively the wiki format allows users to spontaneously change or correct entries. One simple step that would increase the accuracy would be to continue to implement an automated consistency detection system which I realize is already a concern of the programmers, but in addition, a concerted education system for the users would also be appropriate and helpful.  

No comments:

Post a Comment