Use cases for working photos of material culture researchers

This project investigates how working photographs collected by material culture researchers can be used to automatically identify objects and assess their condition, and how such a large collection can be published under FAIR principles in a user-friendly way. Researchers of material culture may take thousands of photos of objects within the scope of each research topic, but in most cases, the use of photos is limited to typological identification, visual comparison of finds, and, in the case of better-formatted photos, illustration of works. At the end of the project, however, most of the photos taken are simply stored on the researchers' hard drives.

The project uses photos of archaeological glass finds taken between 2012 and 2024 in Estonia, which were photographed during the data collection phase of Monika Reppo’s master's and doctoral theses to catalogue the finds. The photos are of varying quality and were taken with several different devices. The aim of the project is to clean and publish the largely unpublished raw data (approx. 15,000 photos) of the objects preserved in Estonian memory institutions, and to determine whether it is possible to use artificial intelligence to collect additional data from the photos, for example, to automatically identify object types or detect the deterioration of an object.

Töövoo sammud

Keywords: Annotating, Cataloguing, Comparing, Preprocessing

The purpose of this step is to collect and manage information about possible research directions and methods and to plan research outputs using information from past studies or on-going projects. Since photographs of archaeological objects have been used in various computer vision research projects, the goal of this research phase is to look for examples and best practices that have already been implemented, but also to identify potential problems and obstacles. This could be described simply as collecting sources, but since the project period is limited, this step in the workflow helps to use the time as productively as possible to create the most optimal research plan.

We will compile comparative tables for the models, programming languages, datasets, results, and outputs used in the collected publications. The method condenses the types of research outputs and potential publications in tabular form, adding, for example, the maximum length of articles, the reference system, and the turnover rate, and for data, the maximum size of the data and the required license(s). This allows us to work purposefully towards our goals from the very beginning of the research plan in terms of data publication, and to compile and format the project data in such a way that publishing the data takes less time.

Keywords: Data cleansing, Cropping, Editing, Naming convention, Preprocessing

The purpose of this step is to convert previously collected data into a form suitable for new research and to separate it from other data so that it can be analysed. To do this, we first need to determine the condition of the dataset being studied (in this case, the photos) and agree on a system that will be consistently used when organizing the data. In this study, this requires removing duplicate photos, moving from a multi-level folder system to a single-folder system, cropping individual photos with a scale (Adobe Photoshop), and renaming the photos. Both manual and automated approaches are tested to compare their accuracy, efficiency, and time consumption. During this phase, data is also continuously backed up to an external hard drive and the cloud (Tallinn University Google Drive), which helps protect the data from destruction.

The EXIF data, or metadata associated with the file, must be checked or specified for each photo. At this stage, we plan to use OpenRefine, which is a free data cleaning tool. We will also use Microsoft Excel to manage the dataset, which allows us to filter and visualize the data and, if necessary, analyse it in other programs in .CSV format. We will use Datasheets for Cultural Heritage (versioon 2) to describe the data for re-use and transparency. This step in the workflow allows for a better understanding and interpretation of the data being studied and forms the basis for the application of data visualization, analysis of (raw) data, and publication.

Keywords: Machine learning, Visual analysis

The main goal of this step is to use cleaned raw data, i.e. artefact photos, to train the machine vision capabilities of an artificial intelligence model. The ability of such a model to identify similar objects (object type) is tested using existing photos of varying quality, applying a large dataset that has not been used before. At present, it is also planned to use photos of fragments of objects found during the same archaeological excavation (from the same site) to train the model to determine the probability of the fragments belonging to the same artefact.

Since several objects have been photographed repeatedly over a period of 12 years, another objective is to determine possible signs of deterioration. In the case of glass, these include iridescence (a rainbow-coloured layer on the surface of the glass), blooming (microfractures), and other physical damage (e.g., fragments breaking off the object, shattering). The result of this stage will reveal whether the photos taken by researchers can be used by memory institutions to assess changes in the condition of objects, which would ensure better preservation of cultural heritage. The model to be used is still being selected, and trials with several models are planned.

Keywords: Permanent identifier, Posting, Publishing

This step in the workflow is two-tiered. Since the goal of the project is to publish the collected raw data, i.e., a large collection of working photos, this is planned to be done after the data has been cleaned, but before the images with multiple findings are processed (cut) into individual photos. The collection of original, unedited photographs is planned to be published through DataDOI (DataCite Estonia); metadata (.txt format) based on Datasheets for Cultural Heritage will be published alongside the data. This way, the raw data will receive a persistent identifier (DOI). A research article is also planned to be written about the process of publishing and organizing the work photos. We will explore creating IIIF manifests for each of the images which will enable interoperability, better access, and long-term sustainability, if successful IIIF versions of each of the images will also be available for researchers.

The experiences gained during the analysis and use of the data, the model created, and the results obtained are also planned to be published separately. The workflow and results will be shared on social media and in scientific reports on an ongoing basis. Publication involves monitoring the publication process and media plan, as there are several parallel steps in the workflow. The published data and articles are compiled in chronological order with classifiers and full references. In addition to project reporting and management, this is also helpful for designing new research projects.