Making (Meta)Data of Musical Works Accessible as a Thematic–Bibliographic Catalogue

Currently, information on Estonian musical works is fragmented across several platforms: general information on works is available on the website of the Estonian Music Information Centre, data on manuscripts in the Museums Public Portal, and bibliographic records of printed scores as well as audio and video recordings in the shared catalogue of ESTERFurthermore, data in these systems is structured according to quite differing principles. For example, the MARC21 format has not always been followed consistently. The aim of this research group at the Estonian Academy of Music and Theatre (EAMT) is, taking the Estonian composer Heino Eller (1887–1970) as its point of departure, to map the possibilities for gathering data from existing systems, while drawing on more recent practices in the cataloguing of musical sources (RISM). The research project can be regarded as a requirements analysis and planning stage in the process of developing a digital catalogue dedicated to Heino Eller.

The research project primarily makes use of three types of data/sources: 1) Metadata on works from existing information systems (downloaded using scripts developed during the project in the Python programming language); 2) Heino Eller’s manuscripts stored at the Estonian Theatre and Music Museum (part of the Estonian History Museum), digitized as PDF and TIFF; 3) Printed and typescript sources, mainly from the 1970s and 1980s, which will be digitized during the project with optical character recognition (OCR).

Workflow steps

Keywords: Cataloguing, Content analysis, Contextualizing

A digital thematic–bibliographic catalogue is one way of making the legacy of Estonian composers accessible to researchers, performers, and the general public. But what kind of information should such a catalogue contain in order to be useful to different user groups? In addressing this question, it is advisable first to examine some well-known existing online musical catalogues, such as the Köchel-Verzeichnis of Mozart’s works. In this web catalogue, users can find information on each of Mozart’s works, ranging from the most basic data (classification of works, date of composition, instrumentation, etc.) to bibliographic information useful to researchers concerning manuscripts and printed music editions. The catalogue also contains recorded performances of Mozart’s music and scores published in the collected works series (Neue Mozart-Ausgabe). The emphasis on audiovisual media and interactivity is an increasingly prominent trend in web-based catalogues, one that should also be taken into account when designing similar information systems dedicated to Estonian composers.

Keywords: Data gathering, Web scraping, Data cleansing, Editing


The next step is to determine which portions of the data are already available in existing information systems. For example, a list of Heino Eller’s works and information on manuscripts can be found on the website of the Estonian Music Information Centre, descriptions of archival sources are accessible via theMuseums Public Portal (MuIS)and records of Eller’s printed music editions and audio recordings are available in ESTER. Many audio and video recordings of cultural and historical significance are stored in the (digital) archive of the Estonian Public Broadcasting (ERR) . Therefore, in seeking information on a specific work by Heino Eller, one currently needs to search across several platforms, each with quite different interfaces and levels of user experience.
 
To meet the needs of researchers, data must be entered into information systems according to internationally agreed-upon principles. For example, ESTER records can be viewed in the widely accepted MARC21 format and exported in bulk. These records are machine-readable, as the meaning of each data field is indicated by a tag. However, sometimes this download option is not available: in such cases, the researcher must manually save and clean the data retrieved from the website. One solution is to write Python scripts, which allow HTML data to be parsed and more easily organized into Excel table columns. When data from different information systems do not align, this stage of the research project is crucial in revealing inconsistencies and errors.

Keywords: Disseminating, Publishing, Managing
 
Before developing a new information system, it is advisable to consider the possibilities of publishing data in existing ones. At this stage, the metadata of musical manuscripts are entered into the largest database of musical sources RISM (Répertoire International des Sources Musicales), which currently contains more than 1.5 million records and ensures open access to the data(open data) . Since RISM allows detailed descriptions of musical sources according to the MARC21 standard and can include machine-readable incipits of the opening measures of the manuscripts, there is currently no viable alternative to this in Estonian information systems. At the same time, it is important to check the accuracy of the metadata in Estonian systems and supplement it if necessary.
 
Inserting manuscript metadata into RISM is also crucial from a data management perspective: this ensures that data collections are compiled according to internationally approved practices and remain accessible. It is also worth noting that in the RISM Online environment data collections become public shortly after being inserted and reviewed—that is, even before the subsequent stages of the research project.

Keywords: Character recognition, Optical character Recognition, Optical music recognition

Gathering data for a thematic–bibliographic catalogue includes digitizing various sources and converting them into machine-readable digital formats. Although a considerable portion of holdings in Estonian museums and archives is already digitally accessible through cultural heritage preservation initiatives, many important musical sources still require digitization. To ensure that the digitized material (including image files in TIFF format or PDFs) is machine-readable, it must be further processed using optical character recognition (OCR) or optical music recognition (OMR) software.

While rapid advances in OCR have made it largely standard practice in digital archives, OMR is still awaiting more significant development, particularly in the recognition of handwritten notation. The goal of OMR is to convert musical notation from image files into a digital music notation format (MusicXML), making it editable in notation software (such as MuseScore). OMR would therefore not only streamline the creation of musical scores for works that currently exist only in manuscripts or outdated printed editions, but also open up rich new possibilities for data-driven music analysis.

This stage of the project requires collaboration between the EAMT research group and cultural heritage institutions, primarily the National Library of Estonia and the Estonian Theatre and Music Museum (part of the Estonian History Museum), which play a leading role in preserving and digitizing musical manuscripts. Knowledge gained from the project will allow improvements to Estonian information systems (especially MuIS) and data collections from the perspective of music culture.

Keywords: Disseminating, Publishing, Archiving, Preserving
 
The work stages described above form the foundation for defining the requirements of an information system that consolidates Estonian musical heritage. The aim is to make musical information accessible “in one place, with one search,” borrowing the motto of E-Varamu. Somewhat similarly to E-Varamu, the catalogue to be developed would provide centralized access to data stored in other information systems (including manuscript metadata in RISM and PDF files in MuIS), while also offering a convenient browsing environment for sources and domain-specific functionalities such as character and music recognition (OCR and OMR), which are currently lacking in Estonian information systems.
 
It is possible to use an application programming interface(API)to retrieve data from other information systems, allowing applications to communicate with each other—that is, to send queries and receive responses. For example, RISM provides an APIthrough which information about all sources contained in the database can be requested.