Comparative Topic Analysis of Ukrainian and Estonian Folk Songs Using AI Translation and Computational Methods

The aim of the research was to identify thematic overlap, cultural similarities, and unique topics in Ukrainian and Estonian folk songs using computational methods, particularly LLM translation and LDA topic modelling.

Although the two nations belong to different linguistic and cultural traditions (Finnic and East Slavic), they shared periods of historical contact that may be reflected in their folklore. From the early Middle Ages, both regions were connected through northern–southern trade networks, most notably the “Varangian to Greek” route linking the Baltic and Black Sea regions (Pritsak 1981). These channels of exchange facilitated not only economic exchange but also the transmission of narrative motifs, ritual structures, and mythological ideas. Both Ukrainian and Estonian folklore exhibit notable thematic parallels despite their linguistic distance, making them particularly well-suited for comparative analysis.

The research addresses three primary questions: (1) What underlying themes, motifs, and narrative structures can be identified within Ukrainian and Estonian folk songs using topic modelling? (2) How do the thematic structures derived from computational analysis align with traditional folkloristic classifications? (3) How does the use of translation impact the analysis of thematic overlap between these two languages?

Workflow steps

Keywords: Collecting, Organizing

The first step of the case study focuses on assembling the two corpora that form the basis of the entire analysis: 2,762 Ukrainian folk songs from the Podillia region, collected between 1918 and 2013 (Dei 1965; Dmytrenko & Yefremova 2014; Myshanych 1976); and a corpus of Estonian folk songs (ERAB) maintained by the Estonian Literary Museum (Sarv & Oras 2020), with a focus on the songs from Järvamaa, as the dialects of this region are among the closest to the Estonian written language. The Estonian dataset included digital texts of handwritten archival documents from the years 1833–1908.

The goal here was to ensure that both datasets are comparable in size, structure, and representativeness so that later computational methods can identify shared and culture-specific themes. To do this, for example, refrains and whole-line repetitions were removed from the Ukrainian dataset because such repeated lines could artificially inflate word frequencies and distort thematic clustering. On the Estonian side, the initial retrieval from the FILTER database yielded 6,553 Järvamaa songs via SQL queries. To ensure comparable corpus sizes, the Järvamaa dataset was reduced to 2,852, retaining primarily the typical runosong texts (as the database also contains poetic items from the genre’s hybrid periphery and from other genres).

Keywords: Translating, Enriching, Interpreting

The translation stage ensures that Ukrainian and Estonian folk songs can be compared within the same linguistic framework, which is essential for reliable topic modelling and clustering. In order to achieve that the next step was to select an appropriate translation model and to develop an effective translation workflow. Several AI models were tested, and Claude 3.5 Sonnet was selected because it demonstrated the best understanding of regional dialects, folkloric vocabulary, and the poetic structures that characterize these songs.
A dedicated script was developed to translate the songs via theLLM, incorporating each song's title and genre. The prompt was iteratively refined and included a dialect description for cultural context. An important guiding principle was to preserve the cultural nuances while ensuring literal accuracy for computational analysis, which requires stable and consistent vocabulary. The full translation process resulted in a combined dataset of 5,614 songs (Ukrainian + Estonian), where each text has both its original and translated form. The translation step is tightly connected to the next phases: consistent English texts provide the basis for preprocessing, vectorization, and ultimately the cross-linguistic thematic comparison that lies at the heart of the project.

Keywords: Preprocessing, Natural Language Processing, Lemmatizing

After translation, the texts had to be prepared for computational analysis. This step involved standardizing the English texts: converting text to lowercase, removing punctuation, applying lemmatization, and removing stopwords. Lemmatization aimed to bring different grammatical forms of a word (e.g. "sing," "sings," "singing"back to a common base form, ensuring that the model interprets these variations as one concept rather than several unrelated ones. Removing common stopwords (such as and, the, but) helps prevent function words from dominating the analysis. In addition, the corpus is filtered by part of speech to retain only those word classes most relevant for thematic analysis: nouns, verbs, adjectives, and adverbs. These content words carry the semantic information that topic modelling relies on to detect patterns. Another key part of preprocessing is removing the grammatical words in order to be able to focus in the analysis on the core word classes most relevant to thematic analysis: nouns, verbs, adjectives, and adverbs. These carry the semantic content that topic modelling relies on to detect patterns. Preprocessing choices directly influence the quality and interpretability of the topic modelling and clustering results.

Keywords: Compiling, Parsing, Transcoding

Once the texts were cleaned, the Ukrainian and Estonian corpora were merged into a single combined dataset. The goal is to allow comparative analysis: the algorithms must analyse both corpora using the same feature space so that similarities and differences emerge from the same computational framework. To maintain clarity, each song is labelled with its cultural origin, ensuring that later visualizations and statistical analyses can distinguish between Ukrainian and Estonian materials.

Constructing this unified dataset also involves a careful organizational step: arranging the text, metadata, and corpus labels into a structured format. This integrated dataset becomes the backbone for all subsequent feature extraction, modelling, and interpretation. Without consistent structure and clear labelling, it would be impossible to compare themes across languages or evaluate how cultural traditions shape the symbolism and narrative patterns found in the songs.

Keywords: Extracting, Data mining

Before computational models can identify patterns in the songs, the texts must be transformed into a format that allows them to be compared systematically. While humans can recognise similarities between texts just by reading and comparing them, algorithms require a numerical representation of the content. Feature extraction and vectorization therefore serve as a crucial bridge between qualitative material and quantitative analysis.


In this workflow, TF-IDF vectorization is used to convert each song into a numerical vector. This widely adopted method captures how characteristic each word is for a given text by considering both its frequency within the song (term frequency=TF) and its distribution across the entire corpus (inverse document frequency=IDF). By setting a minimum document frequency of 5%, the analysis focuses on terms that appear frequently enough to reflect cultural themes.


The resulting TF-IDF matrix serves multiple purposes: it provides the input for LDA topic models, supports hierarchical clustering methods such as Ward’s algorithm, and acts as the high-dimensional input data used by t-SNE. t-SNE itself performs the projection into a low-dimensional space, but it requires the TF-IDF features as its starting point. This dimensionality-reduction process creates a two-dimensional map that helps visualize clusters, lexical similarity, and relationships between texts. This feature extraction step thus serves as a bridge between raw poetic texts and the quantitative thematic analysis that follows. While the TF-IDF + t-SNE workflow produces a spatial representation of the corpus based on shared vocabulary and lexical patterns, it remains focused on song-to-song proximity. Topic modelling, however, works at a more abstract level by detecting the latent themes that shape the entire corpus.

Keywords: Topic Modeling, Data mapping, Cluster analysis

The next phase applies machine-learning methods to identify hidden thematic structures within the dataset. Topic modelling makes it possible to detect recurring themes across large bodies of text by identifying groups of words that tend to appear together. Latent Dirichlet Allocation (LDA) is used first to discover 35 topics, each defined by a coherent set of words that frequently occur together. Each song usually relates to several topics. In order to compare the appearance of different topics in Ukrainian vs Estonian songs, only the dominant topic of each text is observed. In addition, we performed complementary analysis with BERTopic that offers an alternative modelling perspective with more fine-grained topic divisions (95 topics), enriching the interpretive possibilities.
In addition to topic models, we used hierarchical clustering and t-SNE techniques to help visualize lexical relationships among songs. Ward’s hierarchical clustering groups songs based on similarity in word use, while t-SNE creates a two-dimensional map that illustrates how the songs form clusters or overlap. All four methods—LDA, BERTopic, t-SNE, and hierarchical clustering—reveal connections and distinctions across the two oral traditions, each from a complementary perspective: the topic models provide structured thematic categories, while t-SNE and hierarchical clustering illustrate broader lexical and semantic relationships among the songs.

Keywords: Analyzing, Interpreting, Contextualizing, Explanation, Reasoning

Once thematic structures have been identified, the goal is to interpret the results in light of folkloristic knowledge and cultural contexts. This step involves comparing how Ukrainian and Estonian songs engage with different themes such as family life, courtship, agricultural work and rituals, etc. To distinguish universal from culture-specific themes, we examine the prevalence of topics across the two corpora: topics that appear in both Ukrainian and Estonian songs—such as family and kinship, love and courtship, nature and landscapes, and work and daily life—can be interpreted as shared, reflecting common aspects of oral tradition, social structures, and cultural memory. At the same time, topics dominated by songs from only one corpus can be interpreted as culture-specific, revealing regionally distinctive themes or narrative emphases.

Crucially, this interpretive step connects computational results with traditional folkloristic scholarship. Topic clusters are evaluated alongside established folkloristic classifications (e.g., Cossack songs, ritual songs, and wedding songs), ethnographic descriptions, and theories of oral tradition. However, topic modelling also uncovers sub-themes within these categories. For example, songs classified under “love and courtship” in traditional scholarship may include both happy and tragic narratives, whereas computational clustering separates them based on word usage patterns, providing a more nuanced view of thematic variation. This ensures that the computational findings are not only statistically grounded but culturally meaningful. The comparative analysis thus transforms numerical patterns into insights about how different communities express identity, relationships, emotions, and social values through song.

Keywords: Data Visualization, Design, Diagramming, Graphics programming

The final step involves presenting the results through visualization. Dendrograms illustrate how songs cluster hierarchically, while t-SNE plots provide a low-dimensional representation of the textual data, enabling the visualization of patterns and relationships between individual texts based on their TF-IDF representations. Although t-SNE does not directly identify topics, it reflects the similarity of texts in terms of their word distributions, so clusters in the plot often correspond to thematic groupings discovered in topic modelling. Additional visualizations help communicate topic proportions, thematic overlaps, and cross-cultural similarities. These visuals are essential for making complex computational findings understandable to readers.

The results have three main forms of scholarly presentation:

a. conference presentation

Visual plots and thematic summaries were integrated into slides for the DHNB 2025 conference, allowing audiences to follow the workflow and interpret the findings through concrete examples.

b. workflow documentation at HUMAL

A more detailed and structured explanation is prepared for HUMAL, which aims to document the process step-by-step, showing how each methodological choice influences the final results.

c. journal article

The process of the research and its results are planned to be published as an article in a peer-reviewed journal. For publication, visualizations support the argumentation of the article, illustrating cross-cultural comparisons of the thematic structures.

Works cited

Dei, Oleksii (red.). 1965. Pisni Yavdokhy Zuikhy: zapysav Hnat Tantsiura [Songs of Yavdokha Zuikha: recorded by Hnat Tantsiura]. Kyiv : Naukova dumka. 810 s.

Dmytrenko, Mykola & Liudmyla Yefremova (red.). 2014. Narodni pisni Khmelnychchyny (z kolektsii zbyrachiv folkloru) [Folk songs of Khmelnytskyi region (from the collections of folklore collectors)]. Kyiv: Naukova dumka. 720 s.

Myshanych, Stepan (red.). 1976. Pisni Podillia: zapysy Nasti Prysiazhniuk v seli Pohrebyshche. 1920-1970 rr. [Songs of Podillia: recordings of Nastia Prysiazhniuk in the village of Pohrebyshche. 1920-1970.] Kyiv: Naukova dumka. 520 p.

Pritsak, Omeljan (1981). The Origin of Rus: Old Scandinavian Sources Other than the Sagas. Cambridge, Massachusetts: Harvard University Press

Sarv, Mari & Janika Oras,. 2020. From tradition to data: The case of Estonian runosong. In: Arv. Nordic Yearbook of Folklore, 76, 105−117.