One of the major sources of cost in most digitization is in metadata extraction. In the case of phonograph records, the possible sources for the metadata are record label, album cover, or liner notes. The challenge is in locating the appropriate information then extracting them from these different sources. An important component of the research is to minimize the human intervention by automatically generating text and metadata from the captured images using document analysis and recognition techniques.
In order to automatically extract the data, intelligent document analysis techniques must be deployed because the required data may appear anywhere. To implement the specialized document analysis required for this project, open source software called Gamera is used. For the creation of database of audio files, text files, and metadata, which will be searchable and accessible via the web, another open-source software called Greenstone, developed by a digital library project group at the University of Waikato in New Zealand, is installed to facilitate the process. Similar to Greenstone, an open source digital repository management system called Fedora, supports digital libraries, content management, digital asset management, and digital preservation. Special features of this software includes content versioning, and it has a migration utility for mass export and mass ingest of objects from either directories or other repositories.
To create consistent names in the metadata, automated name authority control methodology developed at the Johns Hopkins University is deployed. It is anticipated that this will also aid in the optical character recognition step, by providing dictionary of names and its variants.
For audio processing, a program called Audacity, an open-source software is used. Three different types of audio format will be created: high-quality format to be stored as archived copy, CD-quality format for on-line storage, and high-quality MP3 format for web-delivery. A few additional open-source audio post-processing applications will be developed in effort to correct the equalization curve, to remove noise from the digitized sound files, and be compared to some of the commercially available software to cleanup old recordings.
A partial project equipment summary outlines a list of equipment and discusses some of the installation issues.