× Home Schedule
Introduction
Research Questions Use Cases and User Experience Design
Text Analysis Image Analysis
Text Analysis

 

Methods to study frequency (Voyant), classification,  entity extraction, topic modeling, word embeddings.  Outcome: initial insights related to research questions, visualizations.  Need: machine-readable text from the sourcebook.

 

Link to our Knizhnoe obozrenie corpus in Voyant

Link to the Knizhnoe obozrenie bestsellers lists corpus in Voyant

Link to full text of the KO corpus

Links to text by year:

1990, 1991, 1992, 1993

 

Word frequency over the whole corpus in Voyant

Svetlana’s comment: Voyant does not have a list of stopwords (words to be excluded from the analysis, such as prepositions, pronouns, conjunctions) for Russian, so I had to compile my own. I managed to put in the most often used Russian prepositions and a couple of conjunctions and pronouns before the list stopped accepting words. I managed to put about 49 words on while my “good” list would include about 100 words.

The phrases for Bradley's key terms рынок, бестселлер, спрос, предложение, биржа, прейскурант, конъюнктура, популярность, статистика, издаваемость

Svetlana’s comment: Bradley gave me a research question and a few keywords connected to market and demand. This view is the Document Type KWIC grid that contextualizes these keywords. I used not the keywords themselves, but rather the stems that allowed me to “catch” more phrases. The tool lists phrases starting with keywords and shows the number of occurrences in the corpus (count) and length of a repeated phrase.

Microsearch for the terms across the corpus

Svetlana’s comment: This tool is helpful if you want to know where the keywords occur and if they are clustered in a specific place or are spread around the corpus.

The word tree view for  terms across the corpus

Svetlana’s comment: Experimenting with the keywords and the sliders for the number of branches and the limit for concordance phrases will allow one to get a better feel for the keywords and contexts in the corpus.

The DreamScape geotagging of the corpus

Svetlana’s comment: If one disables connections and animations in the display bar, there will be less noise on the map allowing to see a multitude of places mentioned in Knizhnoe obozrenie. This can be compared to information from other publications or investigated for instances of mentioning cities other than St. Petersburg and Moscow. Another possible line of research could be the weight of mentioning cities in various former republics of the Soviet Union. Also, note the large Saint Petersburg dot and the accumulation of dots in the Moscow area.

Bestseller's list topic modeling

Svetlana’s comment: I had to use the Bestsellers’ Lists for the topic modeling because I could not effectively eliminate all the non-OCRed noise from the corpus and I could not put all the stopwords I needed. As a result, the tool was taking to long to calculate and I could not load it. The topic modeling tool still works pretty well on the bestsellers’ lists alone. E.g. it unites the words for "women," "self-healing," "deal," "[without?] meat," and "[in the?]house" pointing at the difficulties of procuring food, the gender markers that point at male and female tasks, as well as dieting and cookbooks constructing the food shortages as something that could turn to be beneficial for health. Another topic unites the words for "horoscope," “for two,” "antique," "mythology," "gods," and "El’tsin." One thing the topic modeling tool can do is to point at locations in the corpus that tend to use similar or common themes thus identifying the discourses of the corpus.