Methods to study frequency (Voyant), classification, entity extraction, topic modeling, word embeddings. Outcome: initial insights related to research questions, visualizations. Need: machine-readable text from the sourcebook.
Links to text by year:
Svetlana’s comment: Voyant does not have a list of stopwords (words to be excluded from the analysis, such as prepositions, pronouns, conjunctions) for Russian, so I had to compile my own. I managed to put in the most often used Russian prepositions and a couple of conjunctions and pronouns before the list stopped accepting words. I managed to put about 49 words on while my “good” list would include about 100 words.
Svetlana’s comment: Bradley gave me a research question and a few keywords connected to market and demand. This view is the Document Type KWIC grid that contextualizes these keywords. I used not the keywords themselves, but rather the stems that allowed me to “catch” more phrases. The tool lists phrases starting with keywords and shows the number of occurrences in the corpus (count) and length of a repeated phrase.
Svetlana’s comment: This tool is helpful if you want to know where the keywords occur and if they are clustered in a specific place or are spread around the corpus.
Svetlana’s comment: Experimenting with the keywords and the sliders for the number of branches and the limit for concordance phrases will allow one to get a better feel for the keywords and contexts in the corpus.
Svetlana’s comment: If one disables connections and animations in the display bar, there will be less noise on the map allowing to see a multitude of places mentioned in Knizhnoe obozrenie. This can be compared to information from other publications or investigated for instances of mentioning cities other than St. Petersburg and Moscow. Another possible line of research could be the weight of mentioning cities in various former republics of the Soviet Union. Also, note the large Saint Petersburg dot and the accumulation of dots in the Moscow area.
Svetlana’s comment: I had to use the Bestsellers’ Lists for the topic modeling because I could not effectively eliminate all the non-OCRed noise from the corpus and I could not put all the stopwords I needed. As a result, the tool was taking to long to calculate and I could not load it. The topic modeling tool still works pretty well on the bestsellers’ lists alone. E.g. it unites the words for "women," "self-healing," "deal," "[without?] meat," and "[in the?]house" pointing at the difficulties of procuring food, the gender markers that point at male and female tasks, as well as dieting and cookbooks constructing the food shortages as something that could turn to be beneficial for health. Another topic unites the words for "horoscope," “for two,” "antique," "mythology," "gods," and "El’tsin." One thing the topic modeling tool can do is to point at locations in the corpus that tend to use similar or common themes thus identifying the discourses of the corpus.