Researchers from the University of Latvia’s Faculty of Humanities have developed a new and unique digital resource — the “Database of Latvian Morphemes and Word‑Formation Models (LVMVMD)”. It systematically compiles data on the structure and formation of Latvian words. The database is built on the analysis of more than 75,000 lemmas extracted from the Balanced Corpus of Modern Latvian (LVK2018), a comprehensive digital collection of contemporary Latvian texts.
The newly created resource can be useful not only for linguists – it helps analyse the development of the language, build corpora and dictionaries, improve machine translation, and develop artificial‑intelligence tools tailored for the Latvian language. It makes it possible to study how Latvian words are formed, how they are interconnected, and how the language changes over time.
What are morphemes?
Every Latvian word we use in everyday life consists of smaller units, or morphemes – the root, prefix, suffix and ending – each of which carries its own meaning. Metaphorically, one can say that they resemble the skeleton and bones in the anatomy of living beings.
The most important morpheme is the root, because it contains the meaning of the word.
Alongside the root, other morphemes may be added in various combinations, and words may also have more than one root. For example, the word “saule” consists of two morphemes – the root saul- and the ending -e, whereas in the word “saulīte” the suffix -īt- is added next to the root, so the word contains three morphemes: saul-īt-e.
When learning a language and its structure, we also acquire all morphemes, their combinations and the meanings they contain. Without this skill we would not be able to communicate, because it allows us to form the necessary expressions in each situation – to combine morphemes into words, arrange words into sentences, and create texts according to the purpose of communication.
By exploring morphemes, one can better understand not only the principles by which words are formed, but also how human thinking and language as a whole operate, and what associations, metaphors and metonymies underlie our linguistic and extralinguistic perception. For example, many Latvian speakers may be surprised that some berry and mushroom names with the suffix -en- are derived from animal and bird names, such as avene–avs, kazene–kaza, lācene–lācis, as well as cūcene–cūka, gailene–gailis.
The database reveals word relatedness
Each word in the newly created database is divided into morphemes and classified according to word‑formation models, making it possible to determine which word is primary, which is derived or compound, as well as the methods by which words are formed in Latvian.
The database also distinguishes homonyms (words that are pronounced and written the same but have different meanings, for example, dumpis meaning “waterfowl” and dumpis meaning “uprising”) and homographs (words that are written the same but pronounced differently and have different meanings, for example, zāle meaning “grass”, “herbs” and zāle meaning “large hall”), because they have different word‑formation models in the language.
Another benefit is the marking of borrowed words with a special indication, because the division into morphemes and word‑formation models usually does not coincide for inherited Latvian words and borrowings. Moreover, borrowed words often cannot be divided further, because the components of the original language differ from the elements of Latvian. For example, from the perspective of Latvian, for borrowed words such as “kupols”, “ingvers”, “panelis”, we can identify only the ending -s or -is, but not the root, prefix or suffix.
In the database, words are arranged in nests according to a common root, making it easy to trace their origin, see the formation of derivatives and compounds, and notice repeated word‑formation models and their meanings. If such a nest is based on a primary verb, then around it often cluster about one hundred different derivatives and compounds.
For example, historically related inherited roots, all of which are variants of the same original root – ved-, ves-, ve-, vez-, vež-, vad-, vaz-, važ- – correspond in Latvian to words such as vest, vedējs, pavediens, vedekla, vezums, vadīt, vadība, vadītājs, vads, novads, vazāt, važa, barvedis, tiesvedība, apvedceļš, asinsvads, vadlīnija, etc.
Useful not only for linguists
At present, anyone interested can access the new database in the GitHub repository, where it is also possible to learn about its creation principles, the Latvian language material included, and its classification. During 2026, a user manual for the database will be developed in Latvian and English. The resource has been created in accordance with international and modern standards for digital language resources.
The database will be useful not only for linguists, but also for computational linguists, translators, information‑technology specialists, artificial‑intelligence tool developers, corpus, database and dictionary compilers, Latvian language teachers and learners.
The new resource provides an important foundation for further data‑based research on Latvian grammar, word formation and other aspects, as well as for the development of various language‑learning and usage materials and manuals, since there is currently a lack of digital language resources in this field.
Without comprehensive research into the word‑formation system, it is not possible to fully understand other subsystems of the language – grammar, vocabulary, pragmatics, semantics and their use.
The article was prepared within the project “Database of Latvian Morphemes and Word‑Formation Models (LVMVMD)” (No. lzp‑2022/1‑0013) of the Fundamental and Applied Research Programme of the Latvian Council of Science. More information is available at https://www.dlmdm.lu.lv/