Monday, September 14, 2009

Google Books Metadata and Library Functions

In a recent post in the NGC4LIB list, we got a very welcome answer from Chip Nilges of OCLC about Google's use of WorldCat records:
To answer Karen's most recent post, Google can use any WC metadata field. And it's important to note as well that our agreement with Google is not exclusive. We're happy to work with others in the same way. The goal, as I said in my original post, is to support the efforts of our members to bring their collections online, make them discoverable, and drive traffic to library services.

Regards,

Chip

As we have seen from recent postings about the metadata being presented in the Google Books Search service, there are some problems. Although Google claims to have taken the metadata from its library partners, we can look at records in GBS and the record for that item in the library partner database and see how very different they are. It is clear that Google has not retained all of the fields that libraries have provided, and has made some very odd choices about what to keep. Perhaps what we need to do, to help Google improve the metadata, is to make clear what data elements we anticipate we will need in order to integrate the Google product with library services.

When you ask people what metadata is needed for a service, they will often reply something like "everything" or "more is better." I'm going to take a different approach here because I think it is a good idea to connect metadata needs with actual functionality. This not only justifies the metadata, but the functionality helps explain the nature of the metadata that is required. For example, if we say that we want "date of publication" in our metadata, it may seem that we could use the date from the publication statement, which can have dates like "c1956" or "[1924]." If, instead, we indicate that we want to use dates in computational research, then it is clear (hopefully) that we need the fixed field date (from the 008 field in the MARC record).

So here are the functions that come to my mind, and I welcome additions. (Do remember that at this point we are only talking about books, so many fields relating to other formats will not be included.) I'll add the related MARC fields as I get a chance.

Function: Scholarship
Need: A thorough description of the edition in question. This will include authors, titles, physical description, and series information.


Function: Metasearch
Need: To be able to combine searches with the same data elements in library catalogs. Generally this means "headings," from the bibliographic record (authors, titles, subject headings).


Function: Collection development
Need: To use GBS to fill in gaps (or make comparisons) in a library's holdings, usually using classification numbers.


Function: Linking to other bibliographic collections or databases
Need: Identifiers and headings that may be found in other collections that would allow linking.

Function: Computation
Need: Data elements that can mark a text in time and space (date and place of publication), as well as those that can help segment the file, like language. This function also may need to rely on combining editions into groupings of Works, since this research may need to distinguish Works from Manifestations. Computation will most likely use metadata as a controlled vocabulary, and the full text of the work as the "meat" of the research.

2 comments:

Anonymous said...

I'd suggest that, since this perspective is similar to the current prevailing library worldview where metadata standards and application are employed primarily in service of librarians and scholars, that you or others develop a third-party solution for these communities utilizing this approach instead of expecting that GBS will adopt a similar worldview. I'm not saying that this metadata doesn't assist other user communities (particularly when it is all that is available) but I am seeing that the approach (and representation) that you are suggesting is limited in user identity as well as at odds with what is a highly successful (if flawed) approach that Google has taken with their other discovery tools. Their approach, based on how users create and interact with information resources, holds a great deal of promise over time for end-users outside of the narrow librarian (particularly those in the cataloging/metadata community) and those scholars already conversant with traditional library metadata tools.

I expect libraries, or at least library vendors will create metadata-based tools to mine GB. I wouldn't expect Google to create metadata tools to support librarians, however. Perhaps they will - I just hope it won't be to the detriment of other users such as has been the case in the library community in the midst of evolving technological capabilities.

Karen Coyle said...

Anon,

I would love it if a third-party solution were possible, but that requires 2 conditions:

1) that there be a way for a third party to know which books Google has in its service. Google would either need to release a catalog of its holdings, or allow systematic crawling of its site. My guess is that neither of these will be available. To date, Google has been very closed about the coverage of its database, and only released the actual number of books scanned (at that time) in the settlement document. If we can't learn how many documents have been scanned, I sincerely doubt that we will be able to know which ones have been. Google appears to guard this information closely. I assume it is considered of competitive importance (e.g. vis-a-vis Amazon).

2) To 'know' what Google has means to have metadata that clearly identifies each book. So for any service outside of Google to act as a catalog, Google has to provide identifying metadata. I should probably elaborate that identifying metadata as a separate requirement, although the data elements are included in the requirements I have posted here. At the moment, the metadata provided by Google (either online or through its API) is insufficient to identify the item. This makes sense if you see Google Books primarily for topical research, and therefore will rely on keyword searching. It doesn't work for bibliographic research, or for the kinds of scholarly activities that may follow-on the discovery of a book through the search on Google.

I suppose I should add 3) -- which is that there are indications that OCLC is creating a MARC record for each book held by Google, and therefore will position itself as the third party for interaction with services based on library cataloging. This, unfortunately, means that no fourth party will be able to enter into the picture as OCLC will not provide records for competitive products (and will not allow the participating libraries to do so either).