On December 8, the MCU’s Institute of Foreign Languages and International Relations Department welcomed Professor Alexei Lavrentiev, École normale supérieure de Lyon. Professor presented the lecture “Humanities scholars and computer technologies: checkered history of relations.” The lecture was mostly concerned with corpus linguistics, its history from ancient times until the present day. It was composed of four parts, each of them dedicated to the particular historical and technological stage of the research field’s development.
At the introduction, Professor Lavrentiev told a joke from Charles Fillmore’s book, which makes fun of both “arm-chair” (who stick exclusively to their mind’s abilities to conduct research and are skeptical about computer linguistics) and corpus linguists, and underlines that a linguist is to combine the qualities and abilities of both. Moreover, Professor emphasized that linguistics was primarily corpus as the first scholars were engaged in collecting scanty pieces of writing. The most prominent of the early (corpus) linguists was an ancient Indian philologist and grammarian Panini. He composed almost 4,000 “vedas”, which precisely described the Sanskrit grammar. In the Medieval, the scholars were predominantly interested in the religious texts, basically, the Bible. They collected the concordances, that essentially commented on the biblical texts.
In Russia, corpus linguistics started with the Dictionary by Vladimir Dahl, who traveled around the country and collected examples of dialectic oral speech in the 19th century. Later, at the beginning of the 20th century, Nikolay Morozov developed the theory of stylometry. His theory suggests that the author’s texts feature similar employment of function words. Thus, it might help detect plagiarism.
In the mid-20th century, corpus linguistics undergoes a paradigm shift, Roberto Busa developing Index Thomisticus, the first computer-recorded corpus in history. Busa began his path as a theologist and Jesuit. His thesis discussed the use of the preposition in by Thomas Aquinas. Then, it developed into the collection of Thomas Aquinas’s writing. In 1949, Busa got acquainted with the IBM chairman and CEO Thomas J. Watson, who was inspired by Busa’s intention to record the analog corpus on the ECM.
In the 1960s, computer technologies developed by leaps and bounds, and contemporary corpus linguists could not fail to notice the drastic progress. Brown Corpus has been composed of pieces of American English and started a huge wave of computerization of linguistics. Since that time, the number of corpora has been increasing at an exponential rate. Other projects were intended to collect corpora of British English (London-Lund and Lancaster-Oslo-Bergen corpora), French of different epochs (Frantext — from the 16th until the 20th centuries, DMF — from 1430 until 1500, and BMF — from the 9th until the 15th centuries) as well as other languages.
In the USSR, an ambitious project was also launched at the end of the 1980s, which was called the Computer Fund of Russian Language. Unfortunately, it was a hard time in the history of Russia: Perestroika and the collapse of the Soviet Union put paid to the project. Luckily, when the Russian National Corpus was developed, it made use of the CFRL so that the latter did not go missing.
Professor Lavrentiev concluded the lecture with a brief discussion of the modern trends in corpus linguistics and described some of the world’s leading institutions engaged in its research and development. Inter alia, Professor highlighted the Textual Heritage, the Russian Association for Digital Humanities, the Alliance of Digital Humanities Organizations, and the Text Encoding Initiative.