The ALCTS Serials Section, the folks who give the award for the Worst Serial Title Change each year, have just announced their own title change: they will now be known as the Continuing Resources Section. This does not further our profession's ability to communicate with the world around us.
Meanwhile, I was looking at the Bibliographic Ontology, work being done by some individual academics who wish to create a standard for the expression of academic citations. Their vocabulary is also notable: they have documents (article, book, patent...) and they have collections (journal, magazine, periodical). I've been in discussions before where people were declaring journals and magazines as separate document types and I've never gotten a definition that I found satisfactory, although there's no question that if you put Journal of Immunological Methods beside Vogue, no one would have trouble seeing them as different publication types.
Unfortunately the ontology defines a journal as "A collection of journal Articles," and a magazine as "A collection of magazine Articles." I have to say that's not very ontological of them.
Some of the suggestions made at the recent meeting on the Future of Bibliographic Control encourage us to get more bibliographic creation integrated into the authoring and publication workflows. I have long felt the need for a standard bibliographic model for citations which would make linking between citations and their cited documents easier, and one that could be used by common document creation software. The Bibliographic Ontology unfortunately internalizes too much nerdy academic practice, but at least it uses words that most academics might understand. It's patently clear that we librarians cannot go out into the world talking about "continuing resources" and hope to meet with any comprehension. This is just one small illustration of the gap that we need to cross before we can talk to anyone outside of our own secret cabal.
Friday, July 27, 2007
Friday, July 20, 2007
Copies, duplicates, identification
In at least three projects I'm working on now I am seeing problems with the conflict between managing copies (which libraries do) and managing content (which users want). Even before we go chasing after the FRBR concept of the work, we are already dealing with what FRBR-izers would call "different items of the same manifestation." Given that the items we tend to hold were mass produced, and thus there are many copies of them, it seems odd that we have never found a way to identify the published set that those items belong to.
"Ah," you say, "what about the ISBN?" The ISBN is a good manifestation identifier for things published after 1968 (not to mention some teddy bears and fancy chocolates), but it doesn't help us for anything earlier than that.
You probably aren't saying, "What about the BICI?" which was an admirable attempt to create a book identifier similar to the SICI (which covers serials, serials issues, and serials articles). The BICI never got beyond being a draft NISO standard, presumably because no one was interested in using it. The SICI is indeed a full NISO standard, but it seems to be falling out of use. Both of these were identifiers that could be derived either from the piece or from metadata, which is in itself not a bad idea. What was a less than good idea is that the BICI only could be derived for books that have ISBNs, but if you've got an ISBN you haven't a whole lot of use for a BICI, although it would allow you to identify individual chapters or sections of the book. But as a book identifier, it doesn't do much for us.
Now that we're moving into a time of digitization of books, I'm wondering if we can't at least find a way to identify the duplicate digital copies (of which there will be many as the various digitization projects go forward, madly grabbing books off of shelves and rushing them to scanners). Early books were identified using incipits, usually a few characters of beginning and ending text. Today's identifier would have to be more clever, but surely with the ability to run a computation on the digitized book there would be some way to derive an identifier that is accurate enough for the kind of operation where lives aren't usually at stake. There would be the need to connect the derived book identifier to the physical copies of the book, but I'm confident we can do that, even if over a bit of time.
Both Google and the Internet Archive are assigning unique identifiers to digitized books, but we have to presume that these are internal copy level identifiers, not manifestation-specific. The Archive seems to use some combination of the title and the author. Thus "Venice" by Mortimer Menpes is venicemenpes00menpiala while "Venice" by Berly De Zoete is venicedeselincou00dezoiala and "Venice" by Daniel Pidgeon is venicepidgeon00pidgiala. The zeroes in there lead me to believe that if they received another copy it would get identified as "01." Google produces an impenetrable identifier for the Mortimer Menpes book: id=4XsKAAAAIAAJ, which may or may not be derivable from the book itself. I suspect not. And we know that Google will have duplicates so we also know that each item will be identified, not each manifestation.
Meanwhile, there is a rumor circulating that the there is discussion taking place at Bowker, the ISBN agency, on the feasibility of assigning ISBNs to pre-1968 works, especially as they get digitized. I'm very interested in how (if?) we can attach such an identifier to the many copies of the books that already exist, and to their metadata. (This sounds like a job for WorldCat, doesn't it, since they have probably the biggest and most accurately de-duped database of manifestations.)
I know nothing more about it than that, but will pass along any info if I get it. And I'd love to hear from anyone who does know more.
"Ah," you say, "what about the ISBN?" The ISBN is a good manifestation identifier for things published after 1968 (not to mention some teddy bears and fancy chocolates), but it doesn't help us for anything earlier than that.
You probably aren't saying, "What about the BICI?" which was an admirable attempt to create a book identifier similar to the SICI (which covers serials, serials issues, and serials articles). The BICI never got beyond being a draft NISO standard, presumably because no one was interested in using it. The SICI is indeed a full NISO standard, but it seems to be falling out of use. Both of these were identifiers that could be derived either from the piece or from metadata, which is in itself not a bad idea. What was a less than good idea is that the BICI only could be derived for books that have ISBNs, but if you've got an ISBN you haven't a whole lot of use for a BICI, although it would allow you to identify individual chapters or sections of the book. But as a book identifier, it doesn't do much for us.
Now that we're moving into a time of digitization of books, I'm wondering if we can't at least find a way to identify the duplicate digital copies (of which there will be many as the various digitization projects go forward, madly grabbing books off of shelves and rushing them to scanners). Early books were identified using incipits, usually a few characters of beginning and ending text. Today's identifier would have to be more clever, but surely with the ability to run a computation on the digitized book there would be some way to derive an identifier that is accurate enough for the kind of operation where lives aren't usually at stake. There would be the need to connect the derived book identifier to the physical copies of the book, but I'm confident we can do that, even if over a bit of time.
Both Google and the Internet Archive are assigning unique identifiers to digitized books, but we have to presume that these are internal copy level identifiers, not manifestation-specific. The Archive seems to use some combination of the title and the author. Thus "Venice" by Mortimer Menpes is venicemenpes00menpiala while "Venice" by Berly De Zoete is venicedeselincou00dezoiala and "Venice" by Daniel Pidgeon is venicepidgeon00pidgiala. The zeroes in there lead me to believe that if they received another copy it would get identified as "01." Google produces an impenetrable identifier for the Mortimer Menpes book: id=4XsKAAAAIAAJ, which may or may not be derivable from the book itself. I suspect not. And we know that Google will have duplicates so we also know that each item will be identified, not each manifestation.
Meanwhile, there is a rumor circulating that the there is discussion taking place at Bowker, the ISBN agency, on the feasibility of assigning ISBNs to pre-1968 works, especially as they get digitized. I'm very interested in how (if?) we can attach such an identifier to the many copies of the books that already exist, and to their metadata. (This sounds like a job for WorldCat, doesn't it, since they have probably the biggest and most accurately de-duped database of manifestations.)
I know nothing more about it than that, but will pass along any info if I get it. And I'd love to hear from anyone who does know more.
Thursday, July 12, 2007
FoBC Meeting 3, Detailed Notes
Introduction
Speaker: Deanna Marcum Associate Librarian for Library Services Library of Congress
This is the 3rd public session. Comments can still be sent to the committee or via the web site until the end of July.
The question is turning out to cover more than bibliographic control. Instead the broader question is: what is librarianship about in the web world?
When MARC was introduced, libraries were concerned that using MARC would have implications for their own local cataloging, and weren't sure they wanted to use this standard for their own local cataloging. Conforming to the standard meant giving up local practice. But we have gotten many benefits.
In the web world, users have the opportunity to use their own language for searching, and they are being successful. So what contributions can users make, and what will make things more effective for our users?
The theme today is economics and organization. Many librarians believe that cataloging should not be an economic issue. In "this" world, it is not possible for us to ignore the economic implications of cataloging.
The Library of Congress provides cataloging as a service, and that helps other libraries economically. But Library of Congress has no budget line for that service.
Speaker: José-Marie Griffiths, Chair, Working Group, University of North Carolina at Chapel Hill
This is the third of three meetings, each with a different theme.
1. Who uses bibliographic data produced by libraries, and what are the needs of users?
The meeting showed that there is a wide variety of users and uses.
2. Standards and structures
One issue that came out at that meeting is whether the process serves the needs of the community.
3. Economics and Organization
One study that the speaker has conducted was to determine the actual costs of "free" services.
Speaker: Judith Nadler, Working Group Member, University of Chicago Library
Judy described the meetings as being about Who, What, and How. We are now at the How.
Setting the Stage
Rick Lugg, Partner, R2 Consulting
He used to always say that there is no such thing as a bibliographic emergency. However, in the past few years he has found himself working as bibliographic trauma specialist. As consultants, R2 gets called in to see things that aren't working. In the cataloging area, he has seen huge backlogs that are so well-established they have sophisticated inventory systems. With hard copy backlogs you can go into a storage room and see the huge amount of material there. In the digital world you can't see the backlogs. Broken links aren't visible. You don't know what isn't getting done. We don't have a measure of how far behind we are in the digital world.
He said that the cost of bibliographic control is disproportionate to benefits.[kc: It would be great to have a way to measure that, or at least to measure what parts of the bibliographic record produce the greatest benefits.]
The MARC record for a basic monograph is a commodity. It is estimated that the creation of the MARC record is $150-$200. The book is cataloged once, and the cataloging is used many times. Libraries have contained costs by using different levels of staff for copy cataloging. But there are still a lot of duplicative costs in the system.
We have a cult of perfection with the following beliefs:
1 – bibliographic perfection is attainable
2 – cataloging is still about the arrangement of print books on the shelf
Bibliographic Perfection
One of the main barriers to cost savings is the desire to create the perfect record: people change bibliographic records, or at least check all of the details. They change call numbers and use custom Cuttering schemes. Many still write the call number in pencil on the verso of the title page. Some check the reported size with rulers. We focus on the record itself rather than what record is for.
We have a narrow view of quality – we see quality as being about the record, but not about timeliness. (Thus, the backlogs.)
What is good enough? The question should be: does this error impede access?
We need to take advantage work on elsewhere in supply chain.
Shelf arrangement still influences cataloging, but many items are in storage where shelf order doesn't matter. We still create unique call numbers, but duplicate call numbers don't prevent access. We need to think about browsing online, not just on the shelf.
We also need to consider the total cost of bibliographic control. There are the initial costs, but we also need to consider full lifecycle cost. Records are changed at various points, for example as we move items offsite, or move a book out of reference. Most of these changes are done manually. In serials, as we move from print to electronic and end or modify print subscriptions, records have to be updated. Much of this is inventory control, but still means record changes.
There are opportunity costs: What are we not doing that we should be doing? Answer: special collections cataloging, cataloging unique materials, and rare books, manuscripts and archives.
Another opportunity cost: we have no capacity for non-MARC metadata – no one has time to learn MODS, METS, DC. Cost in delay in moving in new directions.
We are involved in mass digitization, but we haven't started working on discovery of full text.
Catalogers are not involved in systems development early on, which affects how systems are developed.
How can we collaborate with others (not just other libraries) to create a richer bibliographic record?
Q: I asked: To what extent is complexity of MARC an issue? His answer was rather vague, so I think he hadn't really thought about this in detail. It would be interesting to know how much time is spent on things like fixed fields, or figuring out subfielding. It would also be interesting to do more experimentation with interfaces. Later speakers brought up the idea of using systems better to help catalogers work faster.
Speaker: Lizanne Payne - Library Consortium
Executive Director Washington Research Library Consortium
Lizanne Payne talked out how consortia can affect costs. Their main role is often providing joint licensing of digital materials, but they are also involved in ERMs and ILL workflow. The usually share a common OPAC to facilitate borrowing, and sometimes have a common ILS. This latter allows them to share the cost of IT staff for systems by centralizing systems. If don't share an ILS, then you have duplication between local catalogs and union catalog. You need 3 levels of bibliographic control: 1 – master record 2 – individual library records (eg for special subject control) 3 – holdings, shelving, etc.
Where libraries share a storage facility, searching for duplicates before sending to storage is very expensive.
[This talk brought up some interesting thoughts about duplication – of materials and of catalog records. Duplication keeps coming up for me in various projects I am working on, and it seems to have cost implications at a lot of levels, especially those areas where duplication in the user view is not desirable, but duplication that exists in the real world also serves users where access is concerned.]
I also learned from Payne's talk that MFHD is pronounced "muffhead."
Speaker: Mary Catherine Little - Public Library
Director, Technical Services Department Queens Borough Public Library
Little gave some good arguments for matching your cataloging to your actual need. She manages a huge and active public library with 65 different languages represented in the collection. She doesn't have the ability to produce cataloging in all of those languages so she relies on vendor-supplied copy and doesn't augment it. Her bottom line is to know what the library owns and give users access to it. She asks herself: am I creating data I'm not likely to use? Am I creating enough data for the ILS to function today? Tomorrow?
And, would this item be replaced if lost? (Many of her books are popular reading that are used for a few years then discarded when the item is worn out.) She even has some un-cataloged collections that are accessed at the shelf only. But fewer users today are in the library. [Note: there were various mentions that digital materials require more and better metadata, but no one really connected this to that fact that our collections are increasingly digital.]
She called for more sharing of vendor data – which of course means a change on the part of vendors.
Speaker: Susan Fifer Canby - Special Library
Vice-President, Library and Information Services National Geographic Society
The Special Library case was quite different from either public or academic libraries.
Some special libraries hold proprietary data that cannot be shared. They are focused on service to their organizations and often have considerable collections of archival and organizational records. They may have responsibility for all or part of the organization's web site. They may also use their collection for e-commerce, as is the case with the National Geographic Society's photo archives.
On the other hand, an organization can require that internal data providers attach certain metadata (like subject headings) to items they store.
The special library is not seen as a general good by the organization. It is a cost center, therefore has to produce value. Bibliographic control is not a major activity for them.
Questions and comments
Q: There seems to be a distinction being made between bibliographic control v. inventory control
Lugg: That starts way back in the chain. For vendors it's about inventory and sales. In systems, the overhead of using MARC as a transaction vehicle is too much, so the transaction areas of systems tend to keep less data and match it up to MARC when needed. However, libraries often see transaction data as part of MARC record (because they display together.). There are different needs within the system, and the MARC record shouldn't change when items circulate.
Q: The committee has done some thinking about atomizing MARC record, removing some complexity and creating different structures for the different functions
Payne: MARC was designed for transmittal, not for daily use. And there's no standardization for how it is broken apart and used in our systems, which makes system upgrades difficult. There are lots of areas of our systems that we haven't standardized.
Lugg: This really shows up in the holdings area. Libraries make different choices as to how that is structured and stored and displayed. Some of this is showing up as libraries try to go to Worldcat local.
Lorcan Dempsey (OCLC): The problem is not MARC, but the fact that we want to do more sharing, so all of these local options are showing up more as problems. It isn't the technology but the social way that we decide what goes into records (often designed for a single application but now want to reuse it for a different application.) Think of data as something that applications use rather than people.
Q: There are greater expectations for the sophistication of access. How much of that is part of shared bibliographic control and how much is local?
Little: Social tagging can represent the cultural aspects of language – the social spin on things.
The Stakeholders' Perspective
Speaker: Bob Nardini - The Vendor
Group Director, Client Integration and Head Bibliographer, Coutts Information Services
It is good that vendors are included in discussions of bibliographic control. Vendors produce a lot of bibliographic information. Coutts employs catalogers and is providing 280,000 bibliographic records this year. Other vendors are even larger. 63% of libraries obtain records from book vendors (based on a survey).
He spoke of the CIP program as one where vendors contribute data. Publishers produce metadata for their audience, for example publishers are very aware of the metadata needs of Amazon, since that translates to sales. He said that he would like to see more of a use of the metadata record in a marketing role. (I'm not sure what that means for libraries.)
Speaker: Mechael Charbonneau - PCC and Large Research Library
Director of Technical Services and Head, Cataloging Division Indiana University, Bloomington
Cataloging is seen as high cost activity, thus the Program for Cooperative Cataloging is a way to save labor. PCC is an international coalition coordinated by the Library of Congress, and a major stakeholder in the bibliographic data future. It relies on voluntary cooperation between libraries. Today, about 35-45% of shared records are being produced outside of the Library of Congress.
She mentioned a need to include non-MARC metadata (but didn't say which ones). She also talked about the need to internationalize authority files, and mentioned the Virtual International Authority File project at OCLC.
Speaker: Linda Beebe - Abstracting and Indexing Services
Senior Director of PsycINFO, American Psychological Association
A&I services create metadata for discovery. There is little emphasis on description in the library cataloging sense. There are particular needs in the different subject areas.
She suggested that we need to look at the "meeting points" of linked systems to see if there is a way we can simplify workflow. [She didn't give any detail, but I have thought that we need to define what our linking elements will be so that we can concentrate on those, and maybe skip non-linking data in some instances.]
One of the problems they are running into is the increase in supplemental audio visual files that need to be linked to the print resource.
She talked about the difference between customers and librarians. Librarians like controlled vocabulary, but users simply want to search on the terms they know. This means that systems need to handle lots of synonyms. We have to discard the notion that it takes special knowledge to find things in the literature. This isn't dumbing down, but making our systems work harder.
Questions and Comments
Q: In the past, vendors have been reluctant to allow their records to be merged with other vendor records because they lost branding. Is this still an issue?
Beebe: This is becoming less of an issue.
Q: What about the different treatments of author names?
Beebe: Searching for author is the most complicated thing. There are author profiles that some are putting together to help this. Social tagging might also help here.
Todd Carpenter (NISO): There is an ISO group is working on an international standard name identifier. This is being driven by the publishing community because of their interest in tracking royalties.
A: Crossref is working at author identifiers (also looking at institutional identifiers)
Q: Vendors don't use LCSH. Vendors put in more marketing tags and readership levels, plus formats (e.g. textbooks). Maybe this is something that Library of Congress should stop putting in records, but should take from the vendor records.
Speaker: Karen Calhoun - OCLC
Vice-President, OCLC WorldCat and Metadata Services
Response to the Background Paper
She was speaking from the view of OCLC as a stakeholder.
There are 7 economic challenges: productivity, redundancy, value, scale, budgets, demography, collaboration
1 - productivity
Fred Kilgore created a dramatic enhancement in productivity of cataloging
2. redundancy
OCLC shared cataloging removed duplication of effort; the Internet and web make possible other efficiencies
3.value
We talk about quality, but we all mean different things depending on our point of view. To the bibliographic control expert it means: adherence to rules. To the library decision-maker, quality has to do with stewardship of library funds and budgets, producing value for communities.
4. scale
Users look at and beyond individual library collections when seeking answers to questions. We must not narrow our scope to what we have done in the past.
5. budgets
Budget restrictions not surprising – especially as libraries move into new areas but have the same budgets.
6. demography
The famed "retirement wave" for generation of bibliographic experts begins in 2010. We will have to change hiring practices.
7. collaboration
These challenges won't be met by libraries working alone.
She then outlined some future potential for OCLC to respond to these challenges.
Metadata is like money – it is a medium of exchange; it points to the value of things.
OCLC might build grid services along the supply chain for creation and augmentation of metadata. The publication supply chain could be an interdependent flow of reusable metadata on the grid.
Where does metadata come from? From bibliographic control experts; publishers, authors, reviewers, readers, selectors. Where could metadata come from? Worldcat is a large unexplored resource, as evidenced by its terminology services and Worldcat Identities. OCLC could run a contract cataloging service. OCLC might help libraries by incrementally moving selected technical services functions to the network. E.g. build on the ILL fee management service into the acquisitions area, creating a kind of Pay Pal for libraries. This could make libraries less dependent on local systems.
Speaker: Beacher Wiggins - Library of Congress
Director, Acquisitions and Bibliographic Access Library of Congress
The Library of Congress has explored the use of bibliographic data from number of sources. PCC is the largest and most successful of these operations. They have increased their cataloging output at the same time that their staffing in the cataloging area has been cut. In the current congressional climate, they must do more without any increased funding for staff.
He mentioned the precipitating event of the Library of Congress dropping authority control for series entries as an example of how they have to cut back. They are re-organizing their cataloging staff such that technicians will do all descriptive cataloging and librarians will do authorities and subject analysis. They are shifting costs, not reducing costs.
He also mentioned the problem of not being able to share vendor data. Apparently there was a rather nasty incident between Library of Congress and Casalini Libri over the reuse of Casalini's supplied bibliographic records. (Something that no one talked about was the systems issues: identifying which records you cannot share. That itself must have some overhead.)
Questions and Comments
Q: There are many small libraries that cannot afford to be in OCLC. How can they be included if OCLC expands its services?
Calhoun: We're looking into that.
Q: What is the cost of leadership, such as standards maintenance?
Q: What is being done to increase training/continuing education?
Wiggins: ALCTS, Library of Congress and ALA are organizing continuing education in this area.
PUBLIC TESTIMONY
Speaker: Diane McCutcheon for NLM
Ideas on how to improve cost-effectiveness
NLM does both cataloging and A&I indexing.
She agrees that cataloging is a public good, but that service has costs. Institution has a particular obligation to create cataloging in cost-effective way. How? Fully utilize descriptive metadata that is available electronically, mainly from publishers and vendors. Basic descriptive data. Eliminate rote keying tasks. NLM uses metadata from journal publishers rather than re-keying – have realized cost savings. Publishers supply date in a standard format because they want to be in Medline. Need to convince publishers that it is to their advantage to be cited in catalogs.
Getting metadata earlier in the chain. Can't use MARC – need to use an xml format (ONIX) – but library systems can't handle non-MARC data. Use crosswalks instead. There is a need for those crosswalks to be available to others.
Making more use of automated data. Current cataloging is like hand crafting furniture or clothing. Need to move into mass production. Some materials may not have electronic data, but we should take advantage for those that do. Need to make more use of more machine assistance. Catalogers are often working in subject areas where they aren't expert, so machines can help with subject heading and classification assignment. They've been working with an automated system that suggests MeSH terms.
New economic model : libraries create data, then share for the cost of sharing it via OCLC. Libraries and vendors have little incentive to do original cataloging.
Need faster standards development. Can't take 2-5 years.
Speaker: Chris Cole for the National Agricultural Library
NAL also does both library and A&I publisher. Indexing uses basic metadata supplied by the publisher. This saves a considerable amount of cost. No suffering of quality. Use of publisher data both possible and necessary. Metadata should be created from data supplied by publishers, with libraries adding value.
NAL contributes to CIP on agriculture related titles. Many libraries use the CIP record because they aren't connected to the network.
Current process isn't economically feasible. We can also get data from music and sound recording industry. If we can move from transcription to adding values, we can tap those resources. This is especially true for digital files, which cannot be discovered without metadata.
Focus of RDA is on traditional materials and traditional procedures, unfortunately. RDA is not recommending an abandonment of standards but a transformation. Do not focus on the record but on clear set of data elements that can be used by libraries, vendors and others that can be reassembled as needed for different uses.
Lorcan Dempsey: the majority of records in Worldcat are not from Library of Congress, but the majority of holdings are on Library of Congress-produced records.
Q: cost and value of creation of thesauri and classification
NAL: we have a thesaurus, and have found that others want to use it and offer to help (in the sciences)
NLM: authors should identify themselves; publishers aren't in our discussions about and they need to be here.
We put too little value on our work. It costs $130 to catalog book but we sell it for 6 cents. What do we offer to people to make it worth their while to contribute?
Regina Reynolds (LC): one economic model is bartering. Could we barter our data in trade for expertise?
Dan Chudnov: [after some in audience rejected the idea of "non-expert" social tagging] The user, in social tagging becomes an access point. Hard to reconcile with privacy, but somehow we have to do that. LibraryThing has social tagging around Library of Congress data. Also, we need an involvement of technology folks in the discussion about bibliographic control.
Speaker from U Penn on subject analysis: issue is: how to make it more efficient. Not the creation of the string, but the aboutness of the work. There is no way to contribute actual subject headings (in cooperative cataloging) in the same way as name authorities files. Social tagging: "expert" tagging defeats the purpose. There's a value in letting users decide; people tag for various reasons, have points of view.
Wiggins: We are looking at the pre-coordination of Library of Congress subject headings. Will issue a report looking at simplification.
Speaker from Folger Shakespeare library: How do we know when we have accomplished our goals? What are our evaluation mechanisms?
Speaker from Library of Congress, education office: Most speakers seem to say that a librarian manages, a user finds. The problem is that we don't use our own products.
Library of Congress staff member: was viewing the meeting on the webcast and came up to say something. We keep talking about incrementally change how we process bibliographic records so we can create more of them. Library of Congress and OCLC are metadata repositories. We should think more radically about what kind of metadata repository we want and need. Create a repository for all of the ONIX data that publishers are creating that allows a way to use that data. Let libraries download the information they need. Do this rather than item-by-item work flow.
Karen Calhoun: OCLC is exploring a way to use ONIX data, enrich it and send it back to publishers, and then create MARC records, and let users add enrichment.
WRAP-UP
Summary of the Day
Speaker: Robert Wolven, Working Group Member, Columbia University Library
Themes of the day:
Digital backlogs – we don't have ways to understand what they are and how they are treated. We don't have measures of this.
Final Thoughts
Speaker: Deanna Marcum
What will they say a hundred years from now talking about the choices we had in 2007? The choices we make at the Library of Congress will make a difference. Library of Congress has focused on cataloging those materials that will be most used by other institutions. Of 130 million items at Library of Congress, only 30 million have records in the catalog. Many are set up as mediated collections. Many are unique or rare materials. Not sharable like books and journals. Users now expect to get access to these materials.
Library of Congress is going to identify performance measures that are quantitative (as much as possible). We have to report back to Congress on benefits and who has benefited. This is much more detailed of a report than Library of Congress has ever had to do before.
What do we all have in common? That we are the institutions in which society has placed its trust that we will figure out: what should be saved, how will it be saved, how will we make it available over time.
Speaker: Deanna Marcum Associate Librarian for Library Services Library of Congress
This is the 3rd public session. Comments can still be sent to the committee or via the web site until the end of July.
The question is turning out to cover more than bibliographic control. Instead the broader question is: what is librarianship about in the web world?
When MARC was introduced, libraries were concerned that using MARC would have implications for their own local cataloging, and weren't sure they wanted to use this standard for their own local cataloging. Conforming to the standard meant giving up local practice. But we have gotten many benefits.
In the web world, users have the opportunity to use their own language for searching, and they are being successful. So what contributions can users make, and what will make things more effective for our users?
The theme today is economics and organization. Many librarians believe that cataloging should not be an economic issue. In "this" world, it is not possible for us to ignore the economic implications of cataloging.
The Library of Congress provides cataloging as a service, and that helps other libraries economically. But Library of Congress has no budget line for that service.
Speaker: José-Marie Griffiths, Chair, Working Group, University of North Carolina at Chapel Hill
This is the third of three meetings, each with a different theme.
1. Who uses bibliographic data produced by libraries, and what are the needs of users?
The meeting showed that there is a wide variety of users and uses.
2. Standards and structures
One issue that came out at that meeting is whether the process serves the needs of the community.
3. Economics and Organization
One study that the speaker has conducted was to determine the actual costs of "free" services.
Speaker: Judith Nadler, Working Group Member, University of Chicago Library
Judy described the meetings as being about Who, What, and How. We are now at the How.
Setting the Stage
Rick Lugg, Partner, R2 Consulting
He used to always say that there is no such thing as a bibliographic emergency. However, in the past few years he has found himself working as bibliographic trauma specialist. As consultants, R2 gets called in to see things that aren't working. In the cataloging area, he has seen huge backlogs that are so well-established they have sophisticated inventory systems. With hard copy backlogs you can go into a storage room and see the huge amount of material there. In the digital world you can't see the backlogs. Broken links aren't visible. You don't know what isn't getting done. We don't have a measure of how far behind we are in the digital world.
He said that the cost of bibliographic control is disproportionate to benefits.[kc: It would be great to have a way to measure that, or at least to measure what parts of the bibliographic record produce the greatest benefits.]
The MARC record for a basic monograph is a commodity. It is estimated that the creation of the MARC record is $150-$200. The book is cataloged once, and the cataloging is used many times. Libraries have contained costs by using different levels of staff for copy cataloging. But there are still a lot of duplicative costs in the system.
We have a cult of perfection with the following beliefs:
1 – bibliographic perfection is attainable
2 – cataloging is still about the arrangement of print books on the shelf
Bibliographic Perfection
One of the main barriers to cost savings is the desire to create the perfect record: people change bibliographic records, or at least check all of the details. They change call numbers and use custom Cuttering schemes. Many still write the call number in pencil on the verso of the title page. Some check the reported size with rulers. We focus on the record itself rather than what record is for.
We have a narrow view of quality – we see quality as being about the record, but not about timeliness. (Thus, the backlogs.)
What is good enough? The question should be: does this error impede access?
We need to take advantage work on elsewhere in supply chain.
Shelf arrangement still influences cataloging, but many items are in storage where shelf order doesn't matter. We still create unique call numbers, but duplicate call numbers don't prevent access. We need to think about browsing online, not just on the shelf.
We also need to consider the total cost of bibliographic control. There are the initial costs, but we also need to consider full lifecycle cost. Records are changed at various points, for example as we move items offsite, or move a book out of reference. Most of these changes are done manually. In serials, as we move from print to electronic and end or modify print subscriptions, records have to be updated. Much of this is inventory control, but still means record changes.
There are opportunity costs: What are we not doing that we should be doing? Answer: special collections cataloging, cataloging unique materials, and rare books, manuscripts and archives.
Another opportunity cost: we have no capacity for non-MARC metadata – no one has time to learn MODS, METS, DC. Cost in delay in moving in new directions.
We are involved in mass digitization, but we haven't started working on discovery of full text.
Catalogers are not involved in systems development early on, which affects how systems are developed.
How can we collaborate with others (not just other libraries) to create a richer bibliographic record?
Q: I asked: To what extent is complexity of MARC an issue? His answer was rather vague, so I think he hadn't really thought about this in detail. It would be interesting to know how much time is spent on things like fixed fields, or figuring out subfielding. It would also be interesting to do more experimentation with interfaces. Later speakers brought up the idea of using systems better to help catalogers work faster.
Speaker: Lizanne Payne - Library Consortium
Executive Director Washington Research Library Consortium
Lizanne Payne talked out how consortia can affect costs. Their main role is often providing joint licensing of digital materials, but they are also involved in ERMs and ILL workflow. The usually share a common OPAC to facilitate borrowing, and sometimes have a common ILS. This latter allows them to share the cost of IT staff for systems by centralizing systems. If don't share an ILS, then you have duplication between local catalogs and union catalog. You need 3 levels of bibliographic control: 1 – master record 2 – individual library records (eg for special subject control) 3 – holdings, shelving, etc.
Where libraries share a storage facility, searching for duplicates before sending to storage is very expensive.
[This talk brought up some interesting thoughts about duplication – of materials and of catalog records. Duplication keeps coming up for me in various projects I am working on, and it seems to have cost implications at a lot of levels, especially those areas where duplication in the user view is not desirable, but duplication that exists in the real world also serves users where access is concerned.]
I also learned from Payne's talk that MFHD is pronounced "muffhead."
Speaker: Mary Catherine Little - Public Library
Director, Technical Services Department Queens Borough Public Library
Little gave some good arguments for matching your cataloging to your actual need. She manages a huge and active public library with 65 different languages represented in the collection. She doesn't have the ability to produce cataloging in all of those languages so she relies on vendor-supplied copy and doesn't augment it. Her bottom line is to know what the library owns and give users access to it. She asks herself: am I creating data I'm not likely to use? Am I creating enough data for the ILS to function today? Tomorrow?
And, would this item be replaced if lost? (Many of her books are popular reading that are used for a few years then discarded when the item is worn out.) She even has some un-cataloged collections that are accessed at the shelf only. But fewer users today are in the library. [Note: there were various mentions that digital materials require more and better metadata, but no one really connected this to that fact that our collections are increasingly digital.]
She called for more sharing of vendor data – which of course means a change on the part of vendors.
Speaker: Susan Fifer Canby - Special Library
Vice-President, Library and Information Services National Geographic Society
The Special Library case was quite different from either public or academic libraries.
Some special libraries hold proprietary data that cannot be shared. They are focused on service to their organizations and often have considerable collections of archival and organizational records. They may have responsibility for all or part of the organization's web site. They may also use their collection for e-commerce, as is the case with the National Geographic Society's photo archives.
On the other hand, an organization can require that internal data providers attach certain metadata (like subject headings) to items they store.
The special library is not seen as a general good by the organization. It is a cost center, therefore has to produce value. Bibliographic control is not a major activity for them.
Questions and comments
Q: There seems to be a distinction being made between bibliographic control v. inventory control
Lugg: That starts way back in the chain. For vendors it's about inventory and sales. In systems, the overhead of using MARC as a transaction vehicle is too much, so the transaction areas of systems tend to keep less data and match it up to MARC when needed. However, libraries often see transaction data as part of MARC record (because they display together.). There are different needs within the system, and the MARC record shouldn't change when items circulate.
Q: The committee has done some thinking about atomizing MARC record, removing some complexity and creating different structures for the different functions
Payne: MARC was designed for transmittal, not for daily use. And there's no standardization for how it is broken apart and used in our systems, which makes system upgrades difficult. There are lots of areas of our systems that we haven't standardized.
Lugg: This really shows up in the holdings area. Libraries make different choices as to how that is structured and stored and displayed. Some of this is showing up as libraries try to go to Worldcat local.
Lorcan Dempsey (OCLC): The problem is not MARC, but the fact that we want to do more sharing, so all of these local options are showing up more as problems. It isn't the technology but the social way that we decide what goes into records (often designed for a single application but now want to reuse it for a different application.) Think of data as something that applications use rather than people.
Q: There are greater expectations for the sophistication of access. How much of that is part of shared bibliographic control and how much is local?
Little: Social tagging can represent the cultural aspects of language – the social spin on things.
The Stakeholders' Perspective
Speaker: Bob Nardini - The Vendor
Group Director, Client Integration and Head Bibliographer, Coutts Information Services
It is good that vendors are included in discussions of bibliographic control. Vendors produce a lot of bibliographic information. Coutts employs catalogers and is providing 280,000 bibliographic records this year. Other vendors are even larger. 63% of libraries obtain records from book vendors (based on a survey).
He spoke of the CIP program as one where vendors contribute data. Publishers produce metadata for their audience, for example publishers are very aware of the metadata needs of Amazon, since that translates to sales. He said that he would like to see more of a use of the metadata record in a marketing role. (I'm not sure what that means for libraries.)
Speaker: Mechael Charbonneau - PCC and Large Research Library
Director of Technical Services and Head, Cataloging Division Indiana University, Bloomington
Cataloging is seen as high cost activity, thus the Program for Cooperative Cataloging is a way to save labor. PCC is an international coalition coordinated by the Library of Congress, and a major stakeholder in the bibliographic data future. It relies on voluntary cooperation between libraries. Today, about 35-45% of shared records are being produced outside of the Library of Congress.
She mentioned a need to include non-MARC metadata (but didn't say which ones). She also talked about the need to internationalize authority files, and mentioned the Virtual International Authority File project at OCLC.
Speaker: Linda Beebe - Abstracting and Indexing Services
Senior Director of PsycINFO, American Psychological Association
A&I services create metadata for discovery. There is little emphasis on description in the library cataloging sense. There are particular needs in the different subject areas.
She suggested that we need to look at the "meeting points" of linked systems to see if there is a way we can simplify workflow. [She didn't give any detail, but I have thought that we need to define what our linking elements will be so that we can concentrate on those, and maybe skip non-linking data in some instances.]
One of the problems they are running into is the increase in supplemental audio visual files that need to be linked to the print resource.
She talked about the difference between customers and librarians. Librarians like controlled vocabulary, but users simply want to search on the terms they know. This means that systems need to handle lots of synonyms. We have to discard the notion that it takes special knowledge to find things in the literature. This isn't dumbing down, but making our systems work harder.
Questions and Comments
Q: In the past, vendors have been reluctant to allow their records to be merged with other vendor records because they lost branding. Is this still an issue?
Beebe: This is becoming less of an issue.
Q: What about the different treatments of author names?
Beebe: Searching for author is the most complicated thing. There are author profiles that some are putting together to help this. Social tagging might also help here.
Todd Carpenter (NISO): There is an ISO group is working on an international standard name identifier. This is being driven by the publishing community because of their interest in tracking royalties.
A: Crossref is working at author identifiers (also looking at institutional identifiers)
Q: Vendors don't use LCSH. Vendors put in more marketing tags and readership levels, plus formats (e.g. textbooks). Maybe this is something that Library of Congress should stop putting in records, but should take from the vendor records.
Speaker: Karen Calhoun - OCLC
Vice-President, OCLC WorldCat and Metadata Services
Response to the Background Paper
She was speaking from the view of OCLC as a stakeholder.
There are 7 economic challenges: productivity, redundancy, value, scale, budgets, demography, collaboration
1 - productivity
Fred Kilgore created a dramatic enhancement in productivity of cataloging
2. redundancy
OCLC shared cataloging removed duplication of effort; the Internet and web make possible other efficiencies
3.value
We talk about quality, but we all mean different things depending on our point of view. To the bibliographic control expert it means: adherence to rules. To the library decision-maker, quality has to do with stewardship of library funds and budgets, producing value for communities.
4. scale
Users look at and beyond individual library collections when seeking answers to questions. We must not narrow our scope to what we have done in the past.
5. budgets
Budget restrictions not surprising – especially as libraries move into new areas but have the same budgets.
6. demography
The famed "retirement wave" for generation of bibliographic experts begins in 2010. We will have to change hiring practices.
7. collaboration
These challenges won't be met by libraries working alone.
She then outlined some future potential for OCLC to respond to these challenges.
Metadata is like money – it is a medium of exchange; it points to the value of things.
OCLC might build grid services along the supply chain for creation and augmentation of metadata. The publication supply chain could be an interdependent flow of reusable metadata on the grid.
Where does metadata come from? From bibliographic control experts; publishers, authors, reviewers, readers, selectors. Where could metadata come from? Worldcat is a large unexplored resource, as evidenced by its terminology services and Worldcat Identities. OCLC could run a contract cataloging service. OCLC might help libraries by incrementally moving selected technical services functions to the network. E.g. build on the ILL fee management service into the acquisitions area, creating a kind of Pay Pal for libraries. This could make libraries less dependent on local systems.
Speaker: Beacher Wiggins - Library of Congress
Director, Acquisitions and Bibliographic Access Library of Congress
The Library of Congress has explored the use of bibliographic data from number of sources. PCC is the largest and most successful of these operations. They have increased their cataloging output at the same time that their staffing in the cataloging area has been cut. In the current congressional climate, they must do more without any increased funding for staff.
He mentioned the precipitating event of the Library of Congress dropping authority control for series entries as an example of how they have to cut back. They are re-organizing their cataloging staff such that technicians will do all descriptive cataloging and librarians will do authorities and subject analysis. They are shifting costs, not reducing costs.
He also mentioned the problem of not being able to share vendor data. Apparently there was a rather nasty incident between Library of Congress and Casalini Libri over the reuse of Casalini's supplied bibliographic records. (Something that no one talked about was the systems issues: identifying which records you cannot share. That itself must have some overhead.)
Questions and Comments
Q: There are many small libraries that cannot afford to be in OCLC. How can they be included if OCLC expands its services?
Calhoun: We're looking into that.
Q: What is the cost of leadership, such as standards maintenance?
Q: What is being done to increase training/continuing education?
Wiggins: ALCTS, Library of Congress and ALA are organizing continuing education in this area.
PUBLIC TESTIMONY
Speaker: Diane McCutcheon for NLM
Ideas on how to improve cost-effectiveness
NLM does both cataloging and A&I indexing.
She agrees that cataloging is a public good, but that service has costs. Institution has a particular obligation to create cataloging in cost-effective way. How? Fully utilize descriptive metadata that is available electronically, mainly from publishers and vendors. Basic descriptive data. Eliminate rote keying tasks. NLM uses metadata from journal publishers rather than re-keying – have realized cost savings. Publishers supply date in a standard format because they want to be in Medline. Need to convince publishers that it is to their advantage to be cited in catalogs.
Getting metadata earlier in the chain. Can't use MARC – need to use an xml format (ONIX) – but library systems can't handle non-MARC data. Use crosswalks instead. There is a need for those crosswalks to be available to others.
Making more use of automated data. Current cataloging is like hand crafting furniture or clothing. Need to move into mass production. Some materials may not have electronic data, but we should take advantage for those that do. Need to make more use of more machine assistance. Catalogers are often working in subject areas where they aren't expert, so machines can help with subject heading and classification assignment. They've been working with an automated system that suggests MeSH terms.
New economic model : libraries create data, then share for the cost of sharing it via OCLC. Libraries and vendors have little incentive to do original cataloging.
Need faster standards development. Can't take 2-5 years.
Speaker: Chris Cole for the National Agricultural Library
NAL also does both library and A&I publisher. Indexing uses basic metadata supplied by the publisher. This saves a considerable amount of cost. No suffering of quality. Use of publisher data both possible and necessary. Metadata should be created from data supplied by publishers, with libraries adding value.
NAL contributes to CIP on agriculture related titles. Many libraries use the CIP record because they aren't connected to the network.
Current process isn't economically feasible. We can also get data from music and sound recording industry. If we can move from transcription to adding values, we can tap those resources. This is especially true for digital files, which cannot be discovered without metadata.
Focus of RDA is on traditional materials and traditional procedures, unfortunately. RDA is not recommending an abandonment of standards but a transformation. Do not focus on the record but on clear set of data elements that can be used by libraries, vendors and others that can be reassembled as needed for different uses.
Lorcan Dempsey: the majority of records in Worldcat are not from Library of Congress, but the majority of holdings are on Library of Congress-produced records.
Q: cost and value of creation of thesauri and classification
NAL: we have a thesaurus, and have found that others want to use it and offer to help (in the sciences)
NLM: authors should identify themselves; publishers aren't in our discussions about and they need to be here.
We put too little value on our work. It costs $130 to catalog book but we sell it for 6 cents. What do we offer to people to make it worth their while to contribute?
Regina Reynolds (LC): one economic model is bartering. Could we barter our data in trade for expertise?
Dan Chudnov: [after some in audience rejected the idea of "non-expert" social tagging] The user, in social tagging becomes an access point. Hard to reconcile with privacy, but somehow we have to do that. LibraryThing has social tagging around Library of Congress data. Also, we need an involvement of technology folks in the discussion about bibliographic control.
Speaker from U Penn on subject analysis: issue is: how to make it more efficient. Not the creation of the string, but the aboutness of the work. There is no way to contribute actual subject headings (in cooperative cataloging) in the same way as name authorities files. Social tagging: "expert" tagging defeats the purpose. There's a value in letting users decide; people tag for various reasons, have points of view.
Wiggins: We are looking at the pre-coordination of Library of Congress subject headings. Will issue a report looking at simplification.
Speaker from Folger Shakespeare library: How do we know when we have accomplished our goals? What are our evaluation mechanisms?
Speaker from Library of Congress, education office: Most speakers seem to say that a librarian manages, a user finds. The problem is that we don't use our own products.
Library of Congress staff member: was viewing the meeting on the webcast and came up to say something. We keep talking about incrementally change how we process bibliographic records so we can create more of them. Library of Congress and OCLC are metadata repositories. We should think more radically about what kind of metadata repository we want and need. Create a repository for all of the ONIX data that publishers are creating that allows a way to use that data. Let libraries download the information they need. Do this rather than item-by-item work flow.
Karen Calhoun: OCLC is exploring a way to use ONIX data, enrich it and send it back to publishers, and then create MARC records, and let users add enrichment.
WRAP-UP
Summary of the Day
Speaker: Robert Wolven, Working Group Member, Columbia University Library
Themes of the day:
- We have focused on today and the near-term future.
- We've thought mainly about trade monographs and efficiencies there.
- But opportunity costs are about less standard areas of collection. Economics there are more local and individual; less opportunity with collaboration. Are we looking at economic shift to areas where we won't have large scale economies?
- We look carefully at individual records we are creating, then we go off and load hundreds of thousands of records in sets.
- Different approach to names authorities in cataloging and A&I databases.
- We need to think about lifecycle of resource. We tend to think about initial process, not later changes. Some life-cycles are short, like public reading, others longer term, like making decisions about off-site storage.
- The MARC record is a commodity; we need appropriate distributions of costs. How to compensate vendors for cataloging; vs "free riders" in the chain. Do we recoup the costs from those who benefit? Or do some bear the costs?
- We don't want to pay for metadata we get but talk about getting value for our metadata. This implies retaining control.
- We propose that value is tied to its use, yet a lot of our effort goes into metadata that isn't used much. Do we focus on area where we have the most sharing, or on the long tail? With long tail less ability to share costs.
Digital backlogs – we don't have ways to understand what they are and how they are treated. We don't have measures of this.
Final Thoughts
Speaker: Deanna Marcum
What will they say a hundred years from now talking about the choices we had in 2007? The choices we make at the Library of Congress will make a difference. Library of Congress has focused on cataloging those materials that will be most used by other institutions. Of 130 million items at Library of Congress, only 30 million have records in the catalog. Many are set up as mediated collections. Many are unique or rare materials. Not sharable like books and journals. Users now expect to get access to these materials.
Library of Congress is going to identify performance measures that are quantitative (as much as possible). We have to report back to Congress on benefits and who has benefited. This is much more detailed of a report than Library of Congress has ever had to do before.
What do we all have in common? That we are the institutions in which society has placed its trust that we will figure out: what should be saved, how will it be saved, how will we make it available over time.
Tuesday, July 10, 2007
Meeting 3, Briefly
This third meeting of the Library of Congress Working Group on the Future of Bibliographic Control focuses on the economic and organizational issues. Unsurprisingly, most of the conversation was about the high cost of cataloging. The suggested solutions fall into these areas:
Interestingly, no one was willing to give up authority control. In fact, there was a desire to expand it into other areas such as article databases, although in the journal publishing area there were thoughts that authors could self-identify as part of publishing. There was a negative reaction to the idea of "social tagging." And there wasn't much discussion of the possible use of full text to either generate cataloging or to function in the place of cataloging.
In the end, none of the proposed solutions really appear to solve much of the problem. To begin with, any savings will merely free up staff to work on cataloging items that make up the real or virtual arrears: paper archives or the vast digital world that libraries hardly touch today. Many of the suggestions are details that would chip away at cataloging time but not really change how we do things. Nothing really radical or revolutionary was put forth.
I'll try to put out more detailed notes soon.
- Be willing to accept imperfection. In particular, take in copy cataloging without scrutiny, or with very select scrutiny, only paying attention to areas that are important for retrieval.
- Put less energy into the easy cases (recently published books) so that more can be spent on the special cases (archival materials, digital materials).
- Use vendor-supplied cataloging that comes with purchase of materials
- Find new partners. Many speakers suggested that libraries partner with publishers to get some metadata creation earlier in the supply chain.
- Do more cooperative cataloging; spread the work across more institutions
Interestingly, no one was willing to give up authority control. In fact, there was a desire to expand it into other areas such as article databases, although in the journal publishing area there were thoughts that authors could self-identify as part of publishing. There was a negative reaction to the idea of "social tagging." And there wasn't much discussion of the possible use of full text to either generate cataloging or to function in the place of cataloging.
In the end, none of the proposed solutions really appear to solve much of the problem. To begin with, any savings will merely free up staff to work on cataloging items that make up the real or virtual arrears: paper archives or the vast digital world that libraries hardly touch today. Many of the suggestions are details that would chip away at cataloging time but not really change how we do things. Nothing really radical or revolutionary was put forth.
I'll try to put out more detailed notes soon.
Monday, July 09, 2007
LoC Meeting -- Live Webcast
Today's meeting of the Library of Congress Working Group on the Future of Bibliographic Control is being webcast live. However, if you prefer time-shifting, the webcast will be available online for "any time" listening later this week. See the agenda.
Thursday, July 05, 2007
Bibliographic Futures, Meeting 3
I'll be attending the third meeting on bibliographic futures being held at Library of Congress on July 9. If I have connectivity I'll blog it in real time; otherwise, I'll give a summary later when I get connected.
Meanwhile, out of purely selfish reasons I consolidated my notes from Meeting 1 and have placed them on my web site. They aren't yet linked from the home page, but I'll get to that soon.
Meanwhile, out of purely selfish reasons I consolidated my notes from Meeting 1 and have placed them on my web site. They aren't yet linked from the home page, but I'll get to that soon.