A Candid Look at Collected Works: Challenges of Clustering Aggregates in GLIMIR and FRBR Gail Thornburg INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014 53 ABSTRACT Creating descriptions of collected works in ways consistent with clear and precise retrieval has long challenged information professionals. This paper describes problems of creating record clusters for collected works and distinguishing them from single works: design pitfalls, successes, failures, and future research. OVERVIEW AND DEFINITIONS The Functional Requirements for Bibliographic Records (FRBR) was developed by the International Federation of Library Associations (IFLA) as a conceptual model of the bibliographic universe. FRBR is intended to provide a more holistic approach to retrieval and access of information than any specific cataloging code. FRBR defines a work as a distinct intellectual or artistic creation. Put very simply, an expression of that work might be published as a book. In FRBR terms, this book is a manifestation of that work.1 A collected work can be defined as “a group of individual works, selected by a common element such as author, subject or theme, brought together for the purposes of distribution as a new work.”2 In FRBR, this type of work is termed an aggregate or “manifestation embodying multiple distinct expressions .”3 Zumer describes aggregate as “a bibliographic entity formed by combing distinct bibliographic units together.”4 Here the terms are used interchangeably. In FRBR, the definition of aggregates applies only to group 1 entities, i.e., not to groups of persons or corporate bodies. The IFLA Working Group on Aggregates has defined three distinct types of aggregates: (1) collections of expressions, (2) aggregates resulting from augmentation or supplementing of a work with additional material, and (3) aggregates of parallel expressions of one work in multiple languages.5 While noting the relationships between the categories, this paper will focus on the first type. Aggregates of the first type include selections, anthologies, series, books with independent sections by different authors, and so on. Aggregates may occur in any format, from a volume containing both of the J. D. Salinger works Catcher in the Rye and Franny and Zooey to a sound recording containing popular adagios from several composers to a video containing three John Wayne movies. Gail Thornburg (thornbug@oclc.org) is Consulting Software Engineer and Researcher at OCLC, Dublin, Ohio. mailto:thornbug@oclc.org A CANDID LOOK AT COLLECTED WORKS | THORNBURG 54 THE ENVIRONMENT The OCLC WorldCat database is replete with bibliographic records describing aggregates. It has been estimated that that database may contain more than 20 percent aggregates.6 This proportion may increase as WorldCat coverage of recordings and videos tends to increase. In the Global Library Manifestation Identifier (GLIMIR) project, automatic clustering of the records into groups of instances of the same manifestation of a work was devised. GLIMIR finds and groups similar records for a given manifestation and assigns two types of identifiers for the clusters. The first type is Manifestation ID, which identifies parallel records differing only in language of cataloging or metadata detail, some of which are probably true duplicates whose differences cannot be safely deduplicated by a machine process. The second type is a Content ID, which describes a broader clustering, for instance, physical and digital reproductions and reprints of the same title from differing publishers. This process started with the searching and matching algorithms developed for WorldCat. The GLIMIR clustering software is a specialization of the matching software developed for the batch loading of records to WorldCat, deduplicating the database, and other search and comparison purposes.7 This form of GLIMIRization compares an incoming record to database search results to determine what should match for GLIMIR purposes. This is a looser match in some respects than what would be done for merging duplicates. The initial challenges of tailoring matching algorithms to suit the needs of GLIMIR have been described in Thornburg and Oskins8 and in Gatenby et al.9 The goals of GLIMIR are (1) to cluster together different descriptions of the same resource and to get a clearer picture of the number of actual manifestations in WorldCat so as to allow the selection of the most appropriate description, and (2) to cluster together different resources with the same content to improve discovery and delivery for end users. According to Richard Greene, “The ultimate goal of GLIMIR is to link resources in different sites with a single identifier, to cluster hits and thereby maximize the rank of library resources in the web sphere.”10 GLIMIR is related conceptually to the FRBR model. If the goal of FRBR is to improve the grouping of similar items for one work, then GLIMIR similarly groups items within a given work. Manifestation clusters specify the closest matches. Content clusters contain reproductions and may be considered to represent elements of the expression level of the FRBR model. The FRBR and GLIMIR algorithms this paper discusses have evolved significantly over the past three years. In addition, it should be recognized that the FRBR algorithms use a map/reduce keyed approach to cluster FRBR works and some GLIMIR content while the full GLIMIR algorithms use a more detailed and computationally expensive record comparison approach. The FRBR batch process starts with WorldCat enhanced with additional authority links, including the production GLIMIR clusters. It makes several passes through WorldCat, each pass constructing keys that pull similar records together for comparison and evaluation. As described by Toves, “Successive passes progressively build up knowledge about the groups allowing us to refine and INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014 55 expand clusters, ending up with the work, content and manifestation clusters to feed into production.”11 Each approach to clustering has its limits of feasibility, but the FRBR and GLIMIR combined teams have endeavored to synchronize changes to the algorithms and to share insights. Some materials are easier to cluster using one approach, and some in the other. Clustering meets Aggregates In the initial implementation of GLIMIR, the issue of handling collected works was considered out of scope for the project. With experience, the team realized there can be no effective automatic GLIMIR clustering if collected works are not identified and handled in some way. Why is this? Suppose a record exists for a text volume containing work A. This matches to a record containing work A, but actually also containing work B. This matches to a work containing B and also containing works C, D, and E. The effect is a snowballing of cluster members that serves no one. How could this happen? In a bibliographic database such as WorldCat, items representing collected works can be catalogued in several ways. Efforts to relax matching criteria in just the right degree to cluster records for the same work are difficult to devise and apply. The GLIMIR and FRBR teams consulted several times to discuss clustering strategies for works, content, and manifestation clusters. Practical experience with GLIMIR led to rounds of enhancements and distinctions to improve the software’s decisions. While GLIMIR clusters can and have been undone and redone on more than one occasion, it took experience from the team to realize that the clues to a collected work must be recognized. Bible and Beowulf As are many initial production startups, the output of GLIMIR processing was monitored. Reports for changes in any clusters of more than fifty were reviewed by quality control catalogers for suspicious combinations. And occasionally a library using a GLIMIR- or FRBR-organized display would report a strange cluster. This was the case with a huge malformed cluster of records for the Bible. Such a work set tends to be large and unmanageable by nature; there are a huge number of records for the Bible in WorldCat. However, it was noticed the set had grown suddenly over the previous two months. User interface applications stalled when attempting to present a view organized by such a set. One day, a local institution reported that a record for Beowulf had turned up in this same work set. This started the team on an investigation. After much searching and analysis of the members of this cluster, the index case was uncovered. In many cases bibliographic records are allowed to cluster based on a uniform title. What the team found connecting these disparate records was a totally unexpected use of the uniform title, a field A CANDID LOOK AT COLLECTED WORKS | THORNBURG 56 240 subfield a, contents: “B.”. That’s right, “B.”. Once the first case was located, it was not hard to figure out that there were numerous uniform “titles” with other single letters of the alphabet. So in this odd usage, Bible and Beowulf could come together, if insufficient data were present in two records to discriminate by other comparisons. Or potentially, other titles which started with “B.” Seeing this unanticipated use of uniform title field, the FRBR and GLIMIR algorithms were promptly modified to beware. The FRBR and GLIMIR clusters were then unclustered and redone. This was a data issue, and unanticipated uses of fields in a record will crop up, if usually with less drama. Further experience showed more. In the examination of another ill-formed cluster, a reviewer realized that one record had the uniform title stated as “Illiad” but the item title was Homer’s “Odyssey.” Of course these have the same author, and may easily have the same publisher. Even the same translator (e.g., Richard Lattimore) is not improbable for a work like this. This was a case of bad data, but it imploded two very large clusters. Music and Identification of Collected Works As music catalogers know, musical works are very frequently presented in items that are collections of works. The rules for creating bibliographic records for music, whether scores or recordings or other, are intricate. The challenges to software to distinguish minor differences in wording from critical differences seem to be endless. Moreover, musical sound recordings are largely collected works due to the nature of publication. As noted by Papakhian, personal author headings are repeated oftener in sound recording collections than in the general body of materials.12 There are several factors that may contribute to such an observation. There are likely to be numerous recordings by the same performer of different works and numerous records of the same work by different performers. Composers are also likely to be performers. The point is, for sound recordings an author statement and title may be less effective discriminators than for printed materials. Vellucci13,14 and Riley15 have written extensively on the problems of music in FRBR models. The problems of distinguishing and relating whole/part relationships is particularly tricky. Musical compositions often consist of units or segments that can be performed separately. So they are generally susceptible to extraction. These extractive relationships are seen in cases where parts are removed from the whole to exist separately, or perhaps parts for a violin or other instrument are extracted from the full score. Software must be informed with rules as to significant differences in description of varying parts and varying descriptions of instruments, and in this team’s experience that is particularly difficult. Krummel has noted that the bibliographic control of sound recordings has a dimension beyond item and work, that is, performance.16 Different performances of the same Beethoven symphony INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014 57 need to be distinguished. Cast and performer list evaluation and dates checking are done by the software. However, the comparisons the software can make are susceptible to fullness or scarcity of data provided in the bibliographic record. There is great variation observed in the numbers of cast members stated in a record. Translator and adapter information can prove useful in the same sense of roles discrimination for other types of materials. This is close scrutiny of a record. At the same time consider that an opera can include the creative contributions of an author (plot), a librettist, and a musical composer. Yet these all come together to provide one work, not a collected work. Tillett has categorized seven types of bibliographic relationships among bibliographic entities, including the following: 1. Equivalence, as exact copies or reproduction of a work. Photocopies, microforms are examples. 2. Derivative relationships, or, a modification such as variations, editions, translations. 3. Descriptive, as in criticism, evaluation, review of a work. 4. Whole/part, such as the relation of a selection from an anthology. 5. Accompanying, as in a supplement or concordance or augmentation to a work. 6. Sequential, or chronological relationships. 7. Shared characteristic relationships, as in items not actually related that share a common author, director, performer, or other role. 17 While it is highly desirable for a software system to notice category 1 to cluster different records for the same work, that same software could be confused by “clues,” such as in category 7. And the software needs to understand the significance of the other categories in deciding what to group and what to split. To handle these relations in bibliographic records, Tillett discusses linking devices including, for instance, uniform titles. Yet uniform titles are used for the categories of equivalence relationships, whole/part relationships, and derivative relationships. This becomes more and more complex for a machine to figure out. Of course, uniform titles within bibliographic records are supposed to link to authority records via text string only. Consideration should ideally be given to linking via identifiers, as has been suggested elsewhere.18 Thematic Indexes Review of scores and recordings GLIMIR clusters showed a case where Haydn’s symphonies A and B were brought together. These were outside the traditional canon of the 104 Haydn symphonies and were referred to as “A” and “B” by the Haydn scholar H. C. Robbins Landon. This mis- clustering highlighted the need for additional checks in the software. A CANDID LOOK AT COLLECTED WORKS | THORNBURG 58 The original GLIMIR software was not aware of thematic indexes as a tool for discrimination. Thematic indexes are numbering systems for the works of a composer. The Kochel Mozart catalog, as in K. 626, is a familiar example. These designations are not unique to a given composer, that is, they are intended to be unique for a given composer, but identical designators may coincidentally have been assigned to multiple composers. While “B” series numbers may be applied to works of Chambonnières, Couperin, Dvořák, Pleyel, and others, the presence of more than one B number is suggestive of collected work status. For more on the various numbering systems, see the interesting discussion by the Music Library Association.19 However, the software cannot merely count likely identifiers in the usual place. This could lead to falsely flagging aggregates; one work by Dvořák could have B.193, which is incidentally equivalent to opus 105. Clearly, any detection of multiple identifiers of this sort must be restricted to identifiers of the same series. String Quartet Number 5, or Maybe 6 Cases of renumbering can cause problems in identifying collected works. An early suppressed or lost work, later discovered and added to the canon of the composer’s work, can cause renumbering of the later works. Clustering software needs must be very attentive to discrete numbers in music, but can it be clever enough? Paul Hindemith (1895–1963) works offer an example. His first string quartet was written in 1915, but long suppressed. His publisher was generally Schott. Long after Hindemith’s death, this first quartet was unearthed, and then was published by Schott. The publisher then renumbered all the quartets. So quartets previously 1 through 6 became 2 through 7. The rediscovered work was then called “No. 1,” though sometimes called “No. 0” to keep the older numbering intact. Further, the last two quartets did not even have opus numbers assigned and were both in the same key.20 This presents a challenge. Anything Musical Another problem case emerged when reviewers noticed a cluster contained both the unrelated songs “Old Black Joe” and “When You and I were Young Maggie.” On investigation, the cluster held a number of unrelated pieces. Here the use of alternate titles in a 246 field had led to overclustering, and the rules for use of 246 fields were tightened in FRBR and GLIMIR. As in the other problem cases, cycles of testing were necessary to estimate sufficient yet not excessive restrictions. Rules too strict split good clusters and defeat the purpose of FRBR and GLIMIR. At this point the GLIMIR/FRBR team recognized that rules changes were necessary but not sufficient. That is, a concerted effort to handle collected works was essential. INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014 59 Strategies for Identifying Collected Works The greatest problem, and most immediate need, was to stop the snowballing of clusters. Clusters containing some member records that are collected works can suddenly mushroom out of control. Rule 1 was that a record for a collected work must never be grouped with a record for a single work. If all in a group are collected works, that is closer to tolerable (more on that later). With time and experimentation, a set of checks were devised to allow collected works to be flagged. These clues were categorized as types: (1) considered conclusive evidence, or (2) partial evidence. Type 2 needed another piece of evidence in the record. Finding the best clues was a team effort. It was acknowledged that to prevent overclustering, overidentification of aggregates was preferable to failure to identify them. Several cycles of tests were conducted and reviewed, assessing whether the software guessed right. Table 1 illustrates the types of checks done for a given bibliographic record. Here the “$” is used as abbreviation for subfield, and “ind” equals indicator. Area Field Rule Notes Uniform Title 240 $a and no $m, $n, $p, or $r Title in $ a on list of terms, without the other subfields listed, IS collected work This is a long list of terms such as “symphonies,” “plays,” “concertos,” and so on. Title 245 Contains “selections,” IS collected 245 245 with multiple semi colons and doc type “rec” 246 If four or more v246 fields with ind2 = 2, 3, or 4, IS collected. If more than 1 246, consider partial evidence Extent 300 If 300$a has “pagination multiple” or “multiple pagings,” IS collected Contents Notes 505$a and $t 1. Check $a for first and last occurrences of “movement”. If Not multiple movement occurrences and does have IF all / any the above produce more than one pattern instance or more A CANDID LOOK AT COLLECTED WORKS | THORNBURG 60 multiple “ / ” pattern. 2. If the above doesn’t find multiple patterns, also look for “ ; “ patterns. 3. If the above checks don’t produce more than 1 pattern, look for multiple “ – ” patterns. 4. Count 505s $t cases. 5. Count $r cases. than one $t, or more than one $r, IS collected. Various fields for Thematic Index clues 505a If any v505 $a, check for differing Opuses. (This also checks for thematic index cases too.) If found, IS collected. For types Score and Recording Related work 740 If 1 or more 740 and 1 has indicator 2 = 2”, IS collected . If only multiple 740s, partial evidence Author 700/710/711/730 Check for $t and $n. And check 730 ind 2 value of “2.” If 730 with ind2 = 2 or multiple $t is found, IS collected. If only 1 $t, partial evidence 100/110/111, 700/710 730 If format recording, and both records are collected work, require cast list match to cluster anything but manifestation matches. That is, do not cluster at content level without verifying by cast. Table 1. Checks on Bibliographic Records. Frailties of Collected Works Identification in Well-Cataloged Records The above table illustrates many areas in a bibliographic record that can be mined for evidence of aggregates. The problem is that cataloging practice offers no one rule mandatory to catalog a collected work correctly. Moreover, as WorldCat membership grows, the use of multiple schemes of cataloging rules for different eras and geographic areas adds to the complexity, even assuming that all the bibliographic records are cataloged “correctly.” Correct cataloging is not assumed by the team. INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014 61 Software Confounded With all the checks outlined in the table, the team still found cases of collected works that seemed to defy machine detection. One record had the two separate works, Tom Sawyer and Huckleberry Finn, in the same title field, with no other clues to the aggregate nature of the item. The work Brustbild was another case. For this electronic resource set, Brustbild appeared to be the collection set title, but the specific title for each picture was given in the publisher field. A cluster for the work Gedichte von Eduard Morike (score) showed problems with the uniform title which was for the larger work, but the cluster records each actually represented parts of the work. The bad cluster for Si ku quan shu zhen ben bie ji, an electronic resource, contained records which each appeared to represent the entire collection of 400 volumes, but the link in each 856 field pointed only to one volume in the set. Limitations of the Present Approach The current processing rules for collected works adopt a strategy of containment. The problem may be handled in the near term by avoiding the mixing of collected works with noncollected works, but the clusters containing collected works need further analysis to produce optimal results. For example, it is one thing to notice scores “arrangements” as a clue to the presence of an aggregate. The requirement also exists that an arrangement should not cluster with the original score. The rules for clustering and distinguishing different sets of arrangements present another level of complexity. Checks to compare and equate the instruments involved in an arrangement are quite difficult; in this team’s experience, they fail more often than they succeed. Without initial explication of the rules for separating arrangements, reviewers quickly found clusters such as Haydn’s Schopfung, which included records for the full score, vocal score, and an arrangement for two flutes. An implementation that expects one manifestation to have the identifier of only one work is a conceptual problem for aggregates. A simple case: if the description of a recording of Bernstein’s Mass has an obscurely placed note indicating the second side contains the work Candide, Mass is likely to be dominant in the clustering effect, with the second work effectively “hidden.” This manifestation would seem to need three work IDs, one for the combination, one for Mass, and one for Candide. This does not easily translate to an implementation of the FRBR model but could perhaps be achieved via links. Several layers of links would seem necessary. A manifestation needs to link to its collected work. A collected work needs links to records for the individual works that it contains, and vice versa, individual works need to link to collective works. This can be important for translations, for example, into Russian, where collective works are common even where they do not exist in the original language. A CANDID LOOK AT COLLECTED WORKS | THORNBURG 62 Lessons Learned First and foremost, plan to deal with collected works. For clustering efforts this must be addressed in some way for any large body of records. Secondly, formats will gain the focus. The initial implementation of the GLIMIR algorithms used test sets mainly composed of a specific work. After all, GLIMIR clusters should all be formed within one work. These sets were carefully selected to represent as many different types of work sets as possible, whether clear or difficult examples of work set members. Plenty of attention was given to the compatibility of differing formats, given the looser content clustering. These were good tests of the software’s ability to cluster effectively and correctly within a set that contained numerous types of materials. Random sets of records were also tested to cross check for unexpected side effects. What in retrospect the team would have expanded was sets that were focused on specific formats. Recordings, scrutinized as a group, can show different problems than scores or books. The distinctions to be made are probably not complete. Another lesson learned in GLIMIR concerned the risks of clustering. The deliberate effort to relax the very conservative nature of the matching algorithms used in GLIMIR was critical to success in clustering anything. Singleton clusters don’t improve anyone’s view. In the efforts to decide what should and should not be clustered, it was initially hard to discern the larger scale risks of overclustering. Risks from sparse records were probably handled fairly well in this initial effort, but risks from complex records needed more work. Collected works is only one illustration of risks of overclustering. FUTURE RESEARCH The current research suggests a number of areas for possible further exploration: • The option for human intervention to rearrange clusters not easily clustered automatically would seem to be a valuable enhancement. • There is next the general question, what sort of processing is needed, and feasible, to distinguish the members of clusters flagged as collected works? • Part versus whole relationships can be difficult to distinguish from the information in bibliographic records. Further investigation of these descriptions is needed. • Arrangements of works in music are so complex as to suggest an entire study by themselves. Work on this area is in progress, but it needs rules investigation. • Other derivative relationships among works: Do these need consideration in a clustering effort? Can and should they be brought together while avoiding overclustering of aggregates? • How much clustering of collected works may actually be helpful to persons or processes searching the database? How can clusters express relationships to other clusters? INFORMATION TECHNOLOGY AND LIBRARIES | SEPTEMBER 2014 63 CONCLUSION Clustering bibliographic records in a database as large as WorldCat takes careful design and undaunted execution. The navigational balance between underclustering and overclustering is never easy to maintain, and course corrections will continue to challenge the navigators. ACKNOWLEDGMENTS This paper would have been a lesser thing without the patient readings by Rich Greene, Janifer Gatenby, and Jay Weitz, as well as their professional insights and help in clarifying cataloging points. Special thanks to Jay Weitz for explicating many complex cases in music cataloging and music history. REFERENCES 1. Barbara Tillett, “What is FRBR? A Conceptual Model for the Bibliographic Universe,” last modified 2004, accessed November 22, 2013, http://www.loc.gov/cds/FRBR.html. 2. Janifer Gatenby, email message to the author, November 10, 2013. 3. International Federation of Library Associations (IFLA) Working Group on Aggregates, Final Report of the Working Group on Aggregates, September 12, 2011, http://www.ifla.org/files/assets/cataloguing/frbrrg/AggregatesFinalReport.pdf. 4. Maja Zumer and Edward T. O’Neill, “Modeling Aggregates in FRBR,” Cataloging and Classification Quarterly 50, no. 5–7 (2012): 456–72. 5. IFLA Working Group on Aggregates, Final Report. 6. Zumer and O’Neill, “Modelling Aggregates in FRBR.” 7. Gail Thornbug and W. Michael Oskins, “Misinformation and Bias in Metadata Processing: Matching in Large Databases,” Information Technology & Libraries 26, no. 2 (2007): 15–22. 8. Gail Thornburg and W. Michael Oskins, “Matching Music: Clustering versus Distinguishing Records in a Large Database,” OCLC Systems and Services 28, no. 1 (2012): 32–42. 9. Janifer Gatenby et al., “GLIMIR: Manifestation and Content Clustering within WorldCat,” Code{4}Lib Journal 17 (June 2012),http://journal.code4lib.org/articles/6812. 10. Richard O. Greene, “Cataloging Alchemy: Making Your Data Work Harder” (slideshow presented at the American Library Association Annual Meeting, Washington, DC, June 26–29, 2010), http://vidego.multicastmedia.com/player.php?p=ntst323q. 11. Jenny Toves, email message to the author, December 17, 2013. 12. Arsen R. Papakhian, “The Frequency of Personal Name Headings in the Indiana University Music Library Card Catalogs,” Library Resources & Technical Services 29 (1985): 273–85. http://www.loc.gov/cds/FRBR.html http://www.ifla.org/files/assets/cataloguing/frbrrg/AggregatesFinalReport.pdf http://journal.code4lib.org/articles/6812 http://vidego.multicastmedia.com/player.php?p=ntst323q A CANDID LOOK AT COLLECTED WORKS | THORNBURG 64 13. Sherry L. Vellucci, Bibliographic Relationships in Music Catalogs (Lanham, MD: Scarecrow, 1997). 14. Sherry L. Vellucci, “FRBR and Music,” in Understanding FRBR: What It Is and How It Will Affect Our Retrieval Tools, ed. Arlene G. Taylor (Westport, CT: Libraries Unlimited, 2007), 131–51. 15. Jenn Riley, “Application of the Functional Requirements for Bibliographic Records (FRBR) to Music,” www.dlib.indiana.edu/~jenlrile/presentations/ismir2008/riley.pdf. 16. Donald W. Krummel, “Musical Functions and Bibliographic Forms,” The Library, 5th ser. 31 (1976): 327–50. 17. Barbara Tillett, “Bibliographic Relationships: Toward a Conceptual Structure of Bibliographic Information used in Cataloging,” (PhD diss., Graduate School of Library & Information Science, University of California, Los Angeles, 1987), 22–83. 18. Program for Cooperative Cataloging (PCC) Task Group on the Creation and Function of Name Authorities in a Non MARC Environment, “Report on the PCC Task Group on the Creation and Function of Name Authorities in a Non MARC Environment,” last modified 2013, http://www.loc.gov/aba/pcc/rda/RDA%20Task%20groups%20and%20charges/ReportPCC TGonNameAuthInA_NonMARC_Environ_FinalReport.pdf. 19. Music Library Association, Authorities Subcommittee of the Bibliographic Control Committee, “Thematic Indexes Used in the Library of Congress/NACO Authority File,” http://bcc.musiclibraryassoc.org/BCC-Historical/BCC2011/Thematic_Indexes.htm. 20. Jay Weitz, email message to the author, May 6, 2013. http://www.dlib.indiana.edu/~jenlrile/presentations/ismir2008/riley.pdf http://www.loc.gov/aba/pcc/rda/RDA%20Task%20groups%20and%20charges/ReportPCCTGonNameAuthInA_NonMARC_Environ_FinalReport.pdf http://www.loc.gov/aba/pcc/rda/RDA%20Task%20groups%20and%20charges/ReportPCCTGonNameAuthInA_NonMARC_Environ_FinalReport.pdf http://bcc.musiclibraryassoc.org/BCC-Historical/BCC2011/Thematic_Indexes.htm OVERVIEW AND DEFINITIONS THE ENVIRONMENT Clustering meets Aggregates In the initial implementation of GLIMIR, the issue of handling collected works was considered out of scope for the project. With experience, the team realized there can be no effective automatic GLIMIR clustering if collected works are not identified ... Why is this? Suppose a record exists for a text volume containing work A. This matches to a record containing work A, but actually also containing work B. This matches to a work containing B and also containing works C, D, and E. The effect is a snowb... Bible and Beowulf Music and Identification of Collected Works Thematic Indexes String Quartet Number 5, or Maybe 6 Anything Musical Strategies for Identifying Collected Works The greatest problem, and most immediate need, was to stop the snowballing of clusters. Clusters containing some member records that are collected works can suddenly mushroom out of control. Rule 1 was that a record for a collected work must never be grouped with a record for a single work. If all in a group are collected works, that is closer to tolerable (more on that later). Frailties of Collected Works Identification in Well-Cataloged Records Software Confounded Limitations of the Present Approach Lessons Learned FUTURE RESEARCH CONCLUSION ACKNOWLEDGMENTS This paper would have been a lesser thing without the patient readings by Rich Greene, Janifer Gatenby, and Jay Weitz, as well as their professional insights and help in clarifying cataloging points. Special thanks to Jay Weitz for explicating many co... REFERENCES