Frequently Asked Questions about the Million Book Project


What are the research issues in the Million Book Project (MBP)?

Information storage and management: The MBP when completed will produce approximately 250 million pages or 500 billion characters of information. The storage requirements for the image files will be approximately 50 petabytes ? an order of magnitude larger than any publicly available information base.  Creating and managing such a vast information base poses many technological challenges and provides a fertile test bed for innovative research in many areas (described below). The MBP is a multi-agency, multi-national effort that will require the database to be globally distributed.  For location independent access, this globally distributed database should appear to be a virtual central database from any place around the world. Mirroring the database in several countries will ensure security and availability. The network speeds at the various nodes would be different. Research in distributed caching and active networks would be needed to ensure that the look and feel of the database is the same from any location.

Search engines: The search engines of today work on the principle of keyword matching and perform searches in one language at a time. With a large corpus of multilingual data provided by the MBP, along with multilingual summarization and translation tools, a well-directed research effort would be needed to ensure concept- and content-based retrieval of knowledge from across multilingual data.

Image processing: The accuracy of Optical Character Recognition (OCR), even in some of the most developed languages, is hindered by the bad quality of the images.  This is particularly true for older books and those that use ancient fonts for which the OCR is not tuned.  Even the very best OCR accuracy of the order of 98% may not be acceptable in some cases.  In order to obtain an improved accuracy close to 100%, advanced image processing research that will perform recognition beyond the character level will be needed.  With the availability of large test data from the MBP and the exponentially increasing computing power of the microprocessors, well-directed image processing research would lead to near perfect optical recognizers.

Optical Character Recognition (OCR) in non-Romanic languages: The MBP will have considerable content in many Indian and Chinese languages.  The development of OCR in many of the Indian Languages is far more complicated. For example, some of the problems in the development of OCR for Indian languages are:

  • There are1500 spoken Indian languages and 17 scripts.

  • Unlike English, where the number of characters to be recognized is less than 100, Indian scripts have several hundred characters to be recognized.

  • Non-uniformity in the spacing of the characters within a word because of the presence of Consonant Conjuncts (vowel + consonant) makes OCR more difficult. Also, the presence of Consonant Conjuncts results in improper line segmentation.  Programs will have to do further processing to segment the lines.

  • Consonants take modified shapes when attached with the vowels. Vowel modifiers can appear to the right, on the top or at the bottom of the base consonant. Such consonant-vowel combinations are called modified characters. In addition, two, three or four characters can combine to generate a new complex shapes called compound characters. These characters are very difficult for a machine to recognize.

  • In scripts like Bangla and Devnagari, all the characters in a word are connected by a unique line called shirorekha (also called head line). In these scripts, character segmentation is especially difficult.

  • In south Indian scripts, vowels occur only at the beginning of a word as against the vowels in Oriya, where they occur anywhere within a word. So, the language morphology for some groups of scripts is different from the others.

  • There is no universally acceptable standard encoding scheme for Indian scripts. This necessitates a scheme where the output labels from the OCR system can be mapped to the labels used by the typesetter through a mapping table.

Because of the non-availability of quality-segmented data, the recognition rate of the Indian language OCR cannot be pushed beyond 90% using character level recognition.  To obtain a higher recognition rate, word level information in the form of dictionary has to be used.  A word corpus is essential, but such a word corpus for most of the Indian scripts is not available today. The MBP will be able to provide this important missing piece.  Once such data is available, it will be possible to use advances in image recognition to develop the OCR in Indian and other non-Romanic languages.

Copyright laws and digital rights management: In the new digital economy, providing democratic access to information while suitably and reasonably rewarding the innovator is possible.  The largest repertoire of free software available on the web in many cases has been the outcome of state supported research.  This free availability of software has in fact contributed to more developments and hence an exponential growth of knowledge.  Even in literary and scholarly publications, authors have experienced increase sales of their work whenever they are made freely available on the web.  This is in tune with observations in the new economy that the companies that make more and more of their software freely available on the web, have their market capitalization enhanced. The MBP, with its proposed plans to make a large knowledge base freely available, will provide useful statistics for testing many economic and sociological models.

Language processing: The MBP will produce an extensive and rich test bed for use in further textual language processing research.  It is hoped that at least 10,000 books among the million will be available in more than one language, providing a key testing area for problems in example based machine translation. In the last stage of the project, books in multiple languages will be reviewed to ensure that this test bed feature is accomplished.

Many believe that knowledge is now doubling at the rate of every two to three years. Machine summarization, intelligent indexing, and information mining are tools that will be needed for individuals to keep up in their discipline work, in their businesses, and in their personal interests.  This large digitization project will support research in these areas.  This will be of greater significance for the Indian languages where new tools for summarization, grammar and spell checking, thesaurus and translation dictionaries need to be developed ab initio.

The data provided by the MBP with the right research inputs will facilitate the development of language- and location-independent intelligence amplifiers for furthering information creation.

Publishers might not give the MBP blanket permission to digitize and make available all of their out-of-print, in-copyright titles, but might entertain requests for permission to digitize specific titles.  Is that possible?

The MBP approach is to request permission for a range of years, for example, everything published prior to 1990.  A publisher could specify the cut-off year or, alternatively, specify the list of titles for which they grant non-exclusive permission to digitize in the MBP.

What value-added services will the MBP develop and what formula will be used to calculate publisher royalties?  When might participating publishers begin to see income from the project?

The MBP is not developing a for-profit system. All of the content will be available free-to-read on the Internet.  Participating publishers will get copies of the digitized books and metadata, and can themselves provide or enable others to provide value-added services to access the digital books.  Permission granted to the MBP is NON-exclusive.

Reading the case study of the National Academy Press's experience putting their books online free-to-read could facilitate understanding and appreciation of the benefits of this approach. The case study is available at: (Journal of Electronic Publishing, Vol. 4, Issue 4, May 1999)

If a publisher requests removal of a title from the database, what fee would they have to pay for its removal?

The current cost is $200.

What university/scholarly presses are participating in the program?

The National Academy Press has given us permission to digitize all of their books published prior to 1995.  We are currently (July 2002) negotiating with MIT Press.

What kind of accuracy will the MBP achieve in scanning?

Carnegie Mellon has established a workflow (based on pilot ?100 book? and ?1000 book? projects) that includes steps to insure capture of high resolution images and essential metadata, post-processing to correct skewing and crop dark borders surrounding the page images, and OCRing to create searchable ASCII text with 98% accuracy.

Once you've scanned a title, how soon will you return TIFFs to the publisher?

We expect/hope to return TIFF images to the publisher in about three months from the time permission is granted, but the time will depend on how long it takes us to locate copies of the books (after permission to digitize them is granted) and how many books are included in the group.

Will the TIFFs meet the Print-On-Demand (POD) standards of Replica and Lightning Source?

The MBP follows the standards and best practices supported in ?A Framework of Guidance for Building Good Digital Collections? developed by the Institute for Museum and Library Services in 2001 and endorsed by the Digital Library Federation in 2002.  See:,

More specifically, our guidelines for data production (excerpted from the MBP NSF proposal and based on pilot projects) are:

  • Bitonal images with a pixel depth of 1 bit-per-pixel scanned at a resolution of 600 dots per inch (DPI).  Images will be stored as ?Intel? TIFF (Tagged Image File Format) files with the header content specified.  The compression algorithm used is ITU (Formerly CCITT) Group 4.

  • TIFF version 5.0 is acceptable.  Subject to testing, version 6.0 (or later) may also be acceptable.

  • The initial-capture system includes dynamic thresholding or a similar feature to capture variability of darkness in the imprint and possibly darker (e.g., foxed) backgrounds from decay.  Images should be as readable as the original pages.

  • Typical expected data will be provided for most TIFF tags (normally, the data supplied by software default settings).  A specification for the TIFF header will be produced to include scanner technical information, filename, and other data, but to be in no way a burden on the production service.

  • Images will be written in sequential order, with corresponding 8.3 filenames, e.g., 00000001.tif as the first image in volume sequence and 00000341.tif as 341st image in volume sequence.

  • Volumes provided to the MBP will be assigned unique identifiers that conform to 8.3 format.  The images will be in directories named with the corresponding identifier (e.g., the volume identified as akf3435.001 will have a directory with the same name, and 00000001.tif through 0000000N.tif files within that directory).

  • Images and directories (as specified above) will be written to gold CD-ROM according to agreed upon specifications and using ISO9660 format.

  • Skew will be within a specified range of degrees allowed.

Will the consortium formed to address publisher compensation issues have anything to do with the administration of the MBP?  If not, how will the MBP be administered?  Who will make decisions for it?  Will participating publishers have a role in decision making?

No, neither the consortium nor participating publishers will have an administrative role in the MBP.  Administrative responsibility for the MBP belongs to officials at participating universities.

At Indian Institute of Science, the administrative officials are:

Prof. N. Balakrishnan, Associate Director, Indian Institute of Science

Who will determine the pricing of value-added components of the MBP?

The publishers or vendors who develop the value-added components will determine the pricing for the services they provide.