The NIST Structured Forms Database consists of 5,590 pages of binary, black-and-white images of synthesized documents.
The documents in this database are 12 different tax forms from the IRS 1040 Package X for the year 1988. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F, and SE.
Eight of these forms contain two pages or form faces; therefore, there are 20 different form faces represented in the database.
The document images in this database appear to be real forms prepared by individuals, but the images have been automatically derived and synthesized using a computer.
There are 900 simulated tax submissions represented in the database averaging 6.2 form faces per submission.
The database has the following features:
900 simulated tax submissions
5,590 images of completed structured form faces
5,590 text files containing entry field answers
20 tables of entry field types and contexts
Suitable for both document processing and automated data capture research, development, and evaluation, the data set can be used for:
forms identification
field isolation; locating the entry fields on the form
character segmentation: separating entry field values into characters
character recognition: identifying specific machine printed characters
This database is a valuable tool for measurement of system performance and system comparison on complex forms.