Ukrainian folk songs

Literary Museum, Estonian Folklore Archives

This dataset contains 11 digitized collections of Ukrainian folk songs (8 collections of epic songs and 3 collections of folk songs from the Podillia region), inviting you to engage not only with texts, but with living cultural history: songs that shaped ideas of heroism, memory, and community in Ukraine.

This dataset contains 11 digitized collections of Ukrainian folk songs published in the 20th and 21th centuries, unified into a structured CSV format suitable for computational analysis and digital humanities research. Each file corresponds to a separate published collection, enabling both within-collection and cross-collection analysis. The corpus consists of two main components: eight collections of Ukrainian epic songs (dumas) and three collections of folk songs of various genres from the Podillia region.
 
The texts are accompanied by metadata, including genre, year and place of recording, collector, performer, and, where available, the respondent’s age or year of birth. This contextual information makes the dataset particularly valuable for studying oral tradition within its social, geographic, and historical settings.
 
The Podillia region materials are linguistically more close to modern standard Ukrainian, partly due to editorial normalization, which makes them well-suited for corpus-based and computational text analysis. In contrast, the epic dumas exhibit greater textual and structural complexity, preserving formulaic language, extended narratives, and performance-driven variation. Their written recordings reflect multiple historical spelling systems, including Russian-based orthography of the late 19th century, as well as several Ukrainian spelling norms (maksymovychivka, kulishivka, dragomanivka, and zhelehivka).
 
Possible project ideas:
Identifying and visualizing imbalances in genre, time, or metadata coverage.
– Detection of formulaic expressions and recurring motifs in epic songs, with possible comparison to Estonian runosongs.
– Automatic detection and normalization of orthographic variants.
– Measuring the impact of spelling variation on tokenization and NLP performance.
– Evaluating topic modelling methods on epic genres.
– Named entity and social role extraction (e.g., heroes, places, figures).
– Cross-corpus comparison with Estonian or Finnish folk song datasets.