Skip to main content

SoBigData Articles

The role of Telegram's mediums during protests in Belarus 2020

Authors: Ivan Slobojhan; Rajesh Sharma (University of Tartu)

More and more anti-government protests have occurred in the world in recent years. A common denominator for such protests is that they all rely on social media.

While Twitter, Facebook, and Reddit were extensively studied in the research community, the studies of Telegram and its role in the protest activity are scarce. At the same time, a big part of computational social science is focused on English-speaking countries such as the USA, the United Kingdom, or countries in which English is common (for example, in India). However, computational social science does not pay equal attention to the research on protests in countries where the English or Latin alphabet is not common. This is surprising given that such countries are often exposed to various forms of protests. 

This paper fills these gaps by analyzing protests in Belarus in 2020 using Telegram’s data from May to November 2020. Telegram has a unique feature that is not present in many other messengers. 

Telegram has three different communication tools (mediums) where users can share their messages with the audience. They are: 1) channels (broadcasting tool for admins where they share posts with the subscribers), 2) groups (communication tool for users to share messages with the rest of users), 3) and local chats (similar to groups, but the users share some specific geographical location). 

We collect more than four million messages from 654 local chats, more than 30 thousand posts from five large channels, and six million messages from two big groups from the period of 1st May 2020 to the 29th of November 2020.

Using this rich dataset, we investigate the following Research Questions (RQs) to understand the role of each medium during the protests in Belarus 2020:

RQ 1: Does the activity of users differ in mediums? Firstly, we analyze the users’ activity as the number of messages they post per day in each medium. Then we identify the top five spikes in each medium and plot them against the external offline events using open-source information from Wikipedia and online news sources such as BBC, DW, and others. 

Findings: We discover notable differences in the activity of users and admins (for channels) in the number of daily messages in the mediums. We also find that the top spikes of users’ activities are different in the three mediums. The dates of the highest spikes in channels match with the important political announcements. At the same time, spikes in local chats match with major protests and marches. Finally, the spikes in groups are related to both.

RQ 2: What topics do users discuss in each medium? To investigate this question, we perform a three-stage analysis. Firstly, we analyze the most frequent words in each medium using WordClouds. Secondly, we extract topics using LDA and compare them among the mediums. We compare topics in two different scenarios. First, we analyze topics that we extract for the whole period of the data. Second, we also select the most important topics which overlap between any two mediums during their most active days (i.e. spikes). Finally, we analyze the context surrounding the names of politicians, protests, and specific locations utilizing Word2Vec embeddings. We train three models for each medium, and then for each noun. We find the top ten most important nouns using cosine distances among words’ embeddings. Then we infer the context surrounding using these ten words to find the main topics which users discussed. 

Finding: Using the data for the whole period, we find that users were concerned with three main subjects, namely Covid, elections, and protests. However, we also find that each medium featured some unique topics during their respective spikes. 

RQ 3: Do users communicate distinctly in different mediums? We investigate whether users communicate distinguishably in the different mediums. In other words, whether it is possible to predict for a given message from which medium this message is. We train a classifier that predicts the medium by a given text input to check this question. We use TF-IDF with a bag on n-grams to create features from text and then feed them to the logistic regression. Finally, we perform an error analysis and analyze the feature's importance. 

Finding: We find that the messages on channels can be predicted much better than messages on other mediums. At the same time, messages from groups and local chants are not so easy to classify correctly. This finding has two possible interpretations. First, those people who post via channels use consistent words and language patterns which makes them recognizable. This could mean that people use channels to forward similar messages and news. At the same time, groups and chats are more disorganized and less homogeneous. Therefore, there is no consistent language pattern that unites them.