Missing Records Team Preliminary Report May 7, 2007

Missing Records Team Preliminary Report
May 7, 2007
Sara Shatford Layne (team leader, UCLA), Vicki Grahame (UCI),Lee Leighton (UCB), Lisa Spagnolo (UCD), Amy Weiss (UCSB), John Riemer (Implementation Team Liaison, UCLA)

Part A. Categories of records missing from OCLC

This report lists categories of records that are present in our local catalogs but missing from OCLC, and the issues associated with these records.

I. On-order records.

a. Downloaded or exported from OCLC; contain an OCLC record number
b. Brief records supplied by a vendor and that lack an OCLC record number (for example, from GOBI or Collection Manager
c. Brief records created locally and based on minimal information; sometimes these records do not describe true bibliographic entities, but rather an assortment of related materials from a particular publisher or vendor. Example from UCLA (entire bibliographic record):
           Kokudo Chiriin.
           Topographic maps of Japan; various scales.
           Tokyo : Geographical Survey Institute
d. “Wants”—for out-of-print items that may never be acquired.
e. Materials that may never be received (materials ordered from foreign countries are likely to fall into this category).

Issues associated with these records: potential for ILL requests for items we do not yet own or may never own; potential for creating bibliographic ‘ghosts’, that is, records for something that never existed (b and c and e); potential for vendors not to want to have their records contributed to WorldCat (b).

Recommendation from the Missing Records Team: Consider omitting categories b, c, d, and e from the records contributed to WorldCat. Develop strategies for omitting these records from database extracts. If it is decided that some or all of these categories should be contributed to WorldCat, there needs to be a relatively simple mechanism for removing the records themselves (not just our holdings, although that also is important) from WorldCat in order not to create permanent bibliographic ghosts.

II. In-process records. (Records for items that have been received but not yet cataloged)

Issues associated with these records: potential for ILL requests for items that are not yet ready to be loaded; potential that what was ordered was not what was received, but this fact will not become apparent until the item is cataloged; potential for significant changes in the record once cataloging has occurred.

III. Circ-on-the-fly records. (Records created by circ staff when they cannot find a bibliographic record in the catalog for the item that is being circulated.)

Issues associated with these records: these records are very brief, and generally include local codes to identify them as Circ-on-the-fly records. These records are not meant for discovery; they are created for inventory control. Example from UCLA (this is the entire bibliographic record):
The Theater COTF YRL1

Recommendation from the Missing Records Team: Omit Circ-on-the-fly records from the records contributed to WorldCat. Develop strategies for omitting these records from database extracts if they are not already suppressed.

IV. Temporary records created for items owned or created by faculty and placed on reserve for a class.

Issues associated with these records: the materials are in the library’s custody for a short period of time and are not owned by the library; the records are often extremely brief and non-unique. Example from UCLA (this is the entire bibliographic record):
Lecture Notes :

Recommendation from the Missing Records Team: Omit temporary records from the records contributed to WorldCat. (Comment from John Riemer: This raises an issue for the Implementation Team to consider: will class reserves and e-reserves require a separate silo?)

V. Suppressed records. Examples of suppressed records include withdrawn items, “pay” records, and “in review” serials.

Recommendation from the Missing Records Team: Omit suppressed records from the records contributed to WorldCat; if we don’t want them displaying in our local OPAC, we certainly don’t want them displaying in WorldCat. These records are for inventory control rather than for discovery.

VI. SCP records for electronic serials.
These records almost always have an OCLC number in them.

Issues associated with these records: we may need to load them separately from other records so that OCLC can add the appropriate UC symbols to each record. May be best to do this centrally rather than to include SCP records in individual campus reclamation projects; note that many campuses will not own the print version that the OCLC record represents.

VII. SCP records for monographs.
These records may have an OCLC number in them; they may have a “cloned” OCLC number in them (the OCLC number for the print followed by “eo”); they may have no OCLC number in them.

Issues associated with these records: some record sets were purchased from vendors other than OCLC, and negotiation will be needed in order to add them to WorldCat; we may need to load them separately from other records so that OCLC can work out, if necessary, different matching algorithms for the records that lack oclc numbers; and can also add the appropriate UC symbols to each record. May be best to do this centrally rather than to include SCP records in individual campus reclamation projects.

VIII. Catalog records purchased from vendors other than OCLC. Examples: ProQuest records for dissertations; Marcive records; Casalini records. Some of these records may have OCLC numbers in them, some may not.

Issues associated with these records: negotiation with vendors may be necessary on our behalf with the help of OCLC. A possibility may be that we set our holdings when there is a match but we do not load as new any records that do not match an existing OCLC record.

IX. Brief (very brief) serial records consisting of a title and nothing else.
UCB has 85,000 of these; UCSB has 10,000 (approximately 50% of UCSBs records have an ISSN as well as a title.) UCD, UCI, and UCSD have fewer than 1,000 each at this time and are working on them.

Issues associated with these records: Machine matching is close to impossible; human matching requires consulting holdings records in addition to the title to identify a match. OCLC does take into consideration in its matching algorithm whether a record is “sparse”, in order to avoid false matches.

X. Records for local government documents.

Issues associated with these records: they may be quite brief, and pose problems for matching against existing records in WorldCat.

XI. Records for local collections or quasi-bibliographic entities.

Issues associated with these records: they may be quite brief, and pose problems for matching against existing records in WorldCat.

XII. Records for rare books.

Issues associated with these records: OCLC’s criteria for when to create a new record are different from that of the rare book community, which has led to the creation of local records for variants that do not meet OCLC’s criteria, but that are significant to users of rare books. Explore with OCLC the possibility of using the Institutional record approach from RLIN in order to preserve local records for some categories of materials.

XIII. Records for videorecordings input locally

Issues associated with these records: may there be licensing issues with putting all of these records into WorldCat? Were standard cataloging practices followed in the creation of these records?

XIV. Records for unique datasets input locally

Issues associated with these records: Were standard cataloging practices followed? These datasets may not be available for use to anyone not affiliated with the local UC. Is it appropriate to add these records to WorldCat?

XV. Film records from the Film and Television Archive (FATA) at UCLA.

Issues associated with these records are being worked on by OCLC and FATA staff; it seems likely that FATA will have an OCLC code separate from UCLA’s since their local ILS is separate from that of the UCLA Library.

XVI. Records with an OCLC number but that are actually missing from OCLC (that is, the OCLC number is incorrectly used in the local database).

This can occur because an OCLC record was used as a template for a set of different records or because a cataloger, rather than create a new record for a reproduction or slightly different edition, used an existing OCLC record and edited it locally.

Recommendation: try to identify these records and correct them before doing a reclamation project with OCLC.

XVII. Records for slides and individual images.

UCSD has approximately 240,000 records for slides.

Issues associated with these records: should they be added to WorldCat? Or is it more appropriate for these records to be in a different database such as ARTstor?

Part B. Questions and Related Issues

The Missing Records Team identified, in the course of its discussions, a set of questions for OCLC and a list of related issues that will need to be addressed, although perhaps not by the Missing Records Team.

I. Questions for OCLC.

John Riemer talked with OCLC concerning the questions we had for them, and the notes of that conversation are in Appendix B of this report.

II. Related Issues.

a. Which OCLC holdings symbol(s) should be used for items at the RLFs? Should it be the holdings symbol for the owning library? Or should it be a holdings symbol for the RLF? Or should it be both? There are implications for ILL, for links to local systems, and for record maintenance.

b. Shift of workflow to make corrections in WorldCat.
Will we be looking to WorldCat to reflect more dynamic activity? If so, it will be important to make updates in WorldCat, not just in our local catalog. For example,

What will be required if a PromptCat approval book is returned?
How will updating be done for a book that is on order and then received?
Would our approach to PromptCat change? Would we want our holdings set in WorldCat in a different time frame than currently occurs? A common preference is 21 days after the vendor notifies OCLC that the books are being shipped to the library.

c. Will we try to resolve the problem of varying treatments at different campuses making it appear that a campus doesn’t own something that it does own. For example,

If a monographic series is treated as an unanalyzed serial at one campus, as an analyzed serial at a second campus, and cataloged separately at a third, the first campus may appear not to own the monographs, while the third campus may appear not to own the serial.
Serial vs monograph treatment for conferences and annuals
Successive vs latest entry cataloging of serials.
Single vs separate records for microform and print versions

d. Reclamation projects. We need an inventory of these as part of our strategy for getting holdings into OCLC; maybe we need this before the pilot starts rather than during the pilot? Especially if it seems like a good idea to coordinate the ‘reclamation’ of SCP records.

e. The work of this team is focused on missing records. However, there is also a problem of missing data. For example, UCSD has in its records tables of contents that it is not permitted to contribute to Melvyl, so it seems that it may be difficult to contribute that data to WorldCat. As another example, UCSD has informative summary notes in local records for videorecordings; it would be desirable to have those summaries preserved if/when the local records are matched to existing WorldCat records.

f. Desirable scope of Melvyl replacement. Should WorldCat, and by extension WorldCat-as-Melvyl-Replacement include records for e-resources that are licensed to just one campus and cannot be shared via interlibrary load? For example, e-books?

Appendix A.

UC Data in the OCLC Database (chart prepared by CAMCIG)

UC Data in the OCLC Database

Campus	Total Database Size	Records NOT in OCLC	Categories of records not in OCLC
Berkeley	6,170,836 (100%)	1,550,000 (25%)	CJK records MX format records Computer file format records GLADIS record level other than F, R, or B Low-level order order records, circulation-created records, NRLF records, temp cat pool records, some SCP records
Davis	2,365,861 (100%)	1,114,698 (47%)	RLIN records – 608,067 REMARC (Carrollton Press) – 262,254 Early Amer Imprints – 64,543 GPO/Marcive – 44,975 SCP – 58,878 Other – 75,981
Irvine	1,879,407 (100%)	430,871 (23%)	CIS records — 87,236 Batch loaded records from 1990 — 89,567 SCP records — 84,453 Marcive records for e-resources (Documents without Shelves service) — 62,513 Older marcive records for print California documents records
Los Angeles	5,057,218 (100%)	662,614 (13%)	Monograph (and a few serial records) keyed directly into the database; SCP records for major monograph sets; ISSR (Institute for Social Science Research) records for datasets
Merced	234,477 (100%)	185,000 (79%)	Documents Without Shelves: 68,480 Vendors (ebrary, netLibrary, MyiLibrary, xRefer, etc.): 44,000 SCP: 72,520 (soon to more than double with ECCO)
Riverside	1,866,816 (100%)	768,032 (41%)	SCP records Vendor records Hand-keyed records
San Diego	2,517,728 (100%)	601,242 (24%)	Slide records: 242,911 ECCO: 130,000 EEBO: 96,000 ICPSR: 6,200* Early Amer Imprints: 36,000* LION: 14,000 Carrollton Press: 19,000 Other: 58,00
San Francisco	331,940 (100%)	64,091 (19.3%)	2,020 serial & serial- analytic titles 62,071 monographs
Santa Barbara	2,678,421 (100%)	2,207,693 (82%)	RLIN records (1.4 million; 54%) Marcive (262,000; 10%) GPO (227,000 8%) Early Amer Imprints (36,000; 1%) Congress Hearings (33,000; 1%) Other (177,000; 6%) includes SCP, local originals, etc.
Santa Cruz	1,322,267 (100%)	545,844 (41%)	SCP records (about 17,392 eserials and 72,764 emonographs) Marcive Ebook vendors (ebrary, netlibrary, ABC-Clio, XRefer, etc.) (approx. 40,000) In-house creation from publisher md (Lexis-Nexus) (6,000 eserials) Other older records

* These records have contractual agreements that do no allow us to upload to OCLC.

Appendix B

Questions Pulled from 4/26/07 Missing Records Team conference call notes
And asked of Renee Register and Joanne Gullo
by John Riemer & Sara Layne (recorder)
May 4, 2007 12:00-12:45 PDT

(1) How reclamation projects work. It seems there are two possible approaches. Are both of the following possible?
A) Removing all of the holdings symbols for member XYZ from WorldCat. Resetting all of XYZ’s holdings based on a wholesale extract of the records from a local file and submitting them to OCLC. This strategy would be needed for libraries that have not removed holdings from WorldCat as materials have been withdrawn from collections.
In OCLC terms “reclamation” involves both removal of outdated holdings info and adding holdings to new records. Projects confined to the latter are “retrospective loads.”
Reclamation includes scan-delete; no longer wipes out everything… uses date stamp on existing holdings symbols; records are processed to add holdings with a newer date stamp; then a scan-delete of old holdings is done—so no gap file needed. Maybe a couple of weeks to get everything through the queue (for a file of 5 million records)—maybe a month; setting up takes up to 90 days.

B) Leaving existing holdings symbols for XYZ as-is and adding holdings to WorldCat for records known or believed to be missing. This strategy would be attractive to those libraries who have merely acquired some record sets from outside sources while generally keeping WorldCat synchronized with their cataloging activity.
Retrospective load—just send new records known to lack WorldCat holdings.

If there is normally a limit of one free “reclamation project” per institution, does that apply only to type (A) reclamations?
Both are free for one time.

(2) For representative samples of records in various categories that UC wants to load prior to the pilot, how will that be accomplished: Sending of records to OCLC for batchloading? Loading by campuses using Connexion software? What choices do we have?
For regular batch loads, don’t normally use samples … but for this wanted the types of records that aren’t normally in WorldCat … wouldn’t have to go through a big official process … could start tackling the types of records … will confer with Doug on numbers for samples. Usually as big a sample as you want to send, for variation … 100 of 5,000 would be fine … but if 500,000, then a few thousand … better result with larger sample. If we are going to do live data loading, we could actually load and review, but to see how real thing will go, a larger sample better.

(3A) One category of record for which UC might opt to load all records (versus a mere sample) is the set of e-resources licensed for the entire UC System and cataloged by various means for all 10 campuses. Currently none of the campus holding symbols appear in WorldCat. Is it possible to submit a batchload request in which each match or new record loaded is tagged with a set of 10 holding symbols?
One symbol to represent the group; or could add all 10 holdings to every record. [Could the 920 be used?] Yes, it could be used. Doesn’t matter where the code in the record or where it is as long as it is consistent. [John: ILL traffic and scoping local catalogs and OCLC collection analysis tool are factors that will affect choice to use one new symbol or one per campus.]

(3B) Some of the e-resource records UC created in-house by batch cloning of the print version records. The OCLC# reflected in the UC record consists of the print version OCLC# with a suffix of ‘eo.’ For finding the appropriate match in WorldCat, is it possible for OCLC’s batch loader to reference the 776 field subfield $w (OCoLC) in the WorldCat record, for the print version, as a means of locating the “real” OCLC record for the e-version?
Recommends forced add, loading them all as new? Or extended matching? Would do manual searching to see if there is an actual match? All they have are really detailed internal docs—but they do have a basic overview—would we like that? Don’t think can match on 776 field.

(4) Is there any documentation OCLC can share with the UC team concerning the batchload process and the duplicate detection algorithm (DDR)? One of multiple ways this would be helpful is UC ability to gauge the ability of the loader to find matching records when the UC records are very brief.
See above. Extended matching algorithm and DDR in development … have been tweaking algorithms … new extended matching and old DDR … hard one to nail down. Weighs and measures entire record and gives it a value and sometimes doesn’t use this for new processes. A “similarity value” of 0.85 or greater is usually considered a duplicate. A lot more flexibility in matching than there was. Will get us the best review that we can, probably need to play with the data. Process a file: give us the results of each pass to see how far they should go with [fuzzy] matching. They have been putting in a lot of rules for ‘sparse record’ matching. Would end up in the ‘unresolved’ category. Option to add as-is or upgrade.

(5) Once any of the UC sample records have been loaded in support of the pilot, will they they end up in WorldCat proper, or in some auxiliary store of records (e.g. as with the 30M article citation records)? The answer will probably affect the UC load strategy.

Can’t answer this right now. Doesn’t know if it has been decided. Have a feeling that they would be added to WorldCat unless there was something ‘real bizarre’.

(6A) Presumably libraries participating in batchloading or reclamation projects will need to get back the OCLC#s for which holdings symbols have newly been set and load the OCLC#s to the local file, since the WorldCat Local service will depend on them for linking into the ILS. Is the chief means of accomplishing this for large quantities of records the set up of a MARC Subscription through OCLC Western http://www.oclc.org/western/news/updates/announcement706.htm ?

Wouldn’t be MARC subscription—which is from online transactions only. Could have batch output of MARC records, retaining our local ILS number so that we could re-load. In 035 or 9xx or whatever. Would send back OCLC Master record with local fields retained if necessary. Could force something if it were based on something in the record. If we didn’t want or need full MARC records returned, could have a cross-reference report, with text-column report of our system record number and the OCLC number.

(6B) During the pilot phase, libraries with an existing MARC subscription service may want some or all categories of those records to be excluded. Which, if any, of the choices would prevent the records from being sent back to us through our MARC subscription service?

MARC subscription is limited to online transactions, not batch processes.

(7) How is the responsibility handled for potential duplication resulting from batchloading of UC titles to WorldCat? (In terms of prevention or clean up?)

See answers above … do their best to use the best matching algorithms possible—pre-processing can help clean up records so that matching algorithms work better—almost impossible not to add duplicate records to WorldCat. A glaring duplicate will be merged when reported to Quality Control (QC). They also run macros behind the scenes.

(8) For records obtained from sources other than WorldCat or the vendors listed in Appendix D of the UC requirements grid, is there provision for OCLC’s seeking to obtain permission for UC to batchload them? An example would be Marcive’s government document records (Documents Without Shelves).

We will do everything we can to work with those vendors to load these records into WorldCat. We have more and more vendors that want to be represented in WorldCat. Marcive is a bit different, because they are also providers of MARC data … but they can’t prevent you from adding your holdings to an existing OCLC record. Do have a large number of vendors on board with contributing records; barriers breaking down. OCLC is offering incentives to the vendors. If there is a question about whether or how a batch should load—make sure the records have some identification, or send as a separate batch. OCLC may need us to say it is important to us, but they will work with the vendors.

(9) Former RLG members were offered the possibility of the “Institutional Record” option, something that will be offered to OCLC members for a cost. Is that option available for segments of an institution’s records, e.g. pre-1850 rare book records, or must it be applied across the board to all of an institution’s records?

I talked to Chris Grabenstatter and this is under consideration. We don’t have it right now, but Chris agrees with me that we need to offer this option. We think we can make that happen.

(10) Setting of WorldCat symbols via batchloading normally cannot result in any value add contributions to the master record. If any editing is done on a bibliographic record subsequent to exporting it to a local file, it will be nearly impossible for an OCLC batchload program to identify the improvements (in particular fields) that could be incorporated into the master record. This situation is particularly true when a library revisits a previously cataloged title to make revisions, e.g. to reflect the newer pieces of a monographic set.

We need to think about this … not part of the normal process … in the RLG load, have done this … replacing later? In batch loads, records have Elvl of M and anyone can edit it.

Question: It appears the only viable option for making corrections and improvements to the bibliographic records our users will see is to have all the catalogers in the UC system begin cataloging directly online in WorldCat and using OCLC Enhance and/or Program for Cooperative Cataloging authorizations. Do you see any alternative to this?

I understand that a couple of years ago Glenn Patton was working with a small group of large research libraries (Berkeley, Harvard, Yale) engaged in backlog cataloging projects. A means of both setting holdings symbols and overlaying the master record with the output of the backlog project was proposed at OCLC: the inclusion of a 989 field containing the flag “$a coopcat.” Is this likely to be a viable option to the Enhance/PCC strategy cited above?

Possibility of using one-to-one batch replacing of records that have been significantly enhanced. Only works if the institution seeking replacement of the record is the same one who originally contributed the record (the institution’s symbol is in 040 field).

On-order data—working with ‘upstream’ data—on our plate with WorldCat Selection—

Unresolved records:

Condition: if it doesn’t match, and doesn’t contain a validation error, then add it.

So.. Unresolved records could be sparse or have character-set conversion problems or … will ask database specialist and find out.