Retrieving digitized data from the U.S. Census

NSF Award:

XSEDE: eXtreme Science and Engineering Discovery Environment  (University of Illinois at Urbana-Champaign)

Congressional Districts:
Research Areas:

As a result of work funded by NSF and the National Archives and Records Administration (NARA), the newly released 1940 census records are now in digital format. This considerably reduces the work involved in conserving these valuable records. Users searching census records for ancestry details or demographic reasons will find their search easier and more inclusive.

U.S. Census forms remain confidential for 72 years. After that time they are released to the public--a treasure trove for both genealogy buffs and researchers. Previously, the Census Bureau created microfilm images of the millions of paper forms. Companies such as then hired thousands of people to spend months transcribing the microfilm to create a searchable, online resource.

With the release of the 1940 census data in April 2012, the Census Bureau switched to all-digital data.  To maximize the usability of the digital data, NARA awarded a grant to a team based at the National Center for Supercomputing Applications (NCSA) to develop a framework to provide "searchable access" to archives of digitized documents. 

Currently the millions of images, constituting terabytes of data, can't be easily searched for names, locations or trends, and manual transcription of the forms is far too expensive. One search option, optical character recognition, is limited in its accuracy.  However, the NCSA framework will enable users to input handwritten queries--either using a mouse (or touch screen on a mobile device) or by typing a word that will then be rendered in a handwriting font--to search a database of images of handwritten text for potential matches.

Because handwritten text is variable, not all of the returned results will be perfect matches. Users will actually help improve the results through a passive form of crowd sourcing. After searching for "Smith," a user will likely click on results related just to "Smith." The query text entered by the user can be connected to the image results the user selects, allowing the image database to be slowly annotated. Input from user queries will improve the system's ability to return more accurate results.

Images

  • a handwritten census form
  • a template separates areas of the census form into individual images
Each cell of a template fit over the Census form is a separate image.
National Center for Supercomuting Applications at UIUC
Grouping similar features produces faster search results.
National Center for Supercomuting Applications at UIUC

