Skip to main content

SoBigData Articles

Ask “Who”, Not “What”: Bitcoin Volatility Forecasting with Twitter Data

Cryptocurrency market, after its surge in 2009, gained immense popularity among not only small-scale investors, but also large hedge funds. The ever increasing popularity of cryptocurrencies attracted professional investors who started constructing portfolios using cryptocurrencies, however, the vast majority of the market share still belongs to individuals. The importance of individual investors in cryptocurrency markets incentivizes studying alternative sources of information, such as social media, to which the market participants may respond.

With its widely available API, ease of use, and its primary purpose of information sharing, Twitter is both a relevant and convenient online resource for revealing relations between financial markets and social media. Among cryptocurrencies, Bitcoin is arguably the most appropriate candidate for analysis of the relationship between social media and cryptocurrency markets due to its immense popularity. 

Therefore in our work, we concentrated on the coupling between the development of the Bitcoin’s price variations (volatility in technical terms) and various aspects of social media signals, including the sentiment of Bitcoin-related tweets, popularity of the Tweet’s author, etc. This coupling is understood as an improved prediction of Bitcoin's volatility when Twitter data is incorporated in the forecasting model. Incorporating such exogenous information into a model is a difficult task in and of itself. We opted to study deep learning-based architectures and compared them to classical financial models that do not allow for exogenous signals. We utilized a modular and testable architecture to which social media and other features can be progressively added. In this way, we were able to gain several valuable insights about: which deep learning-based alternatives perform best (and, importantly, better than traditional econometric models) when only price data is considered; how a model that incorporates Twitter signals should be best built; the aspects of the signal from Twitter that are most useful in predicting realized volatility.

To achieve our goals, we employed Twitter API in a streamwatcher fashion, so that the tweets which contained Bitcoin-related strings were saved instantly. We then utilized the obtained textual and other accompanying information of Bitcoin-related tweets and used it together with intraday Bitcoin price data in the prediction task. The gathered data was first preprocessed by pruning and refactoring, so that all social media data entries had the same format, and contained only that information that could potentially aid the realized volatility forecast. The data was collected using the publicly available Twitter API’s

All relevant tweets between 10.10.2020 and 3.3.2021 were stored as JSON line objects on an AWS MongoDB server. A tweet was considered relevant, if its textual body contained one of the following strings: “BTC”, “$BTC”, or “Bitcoin”. Such a tweet was automatically appended to the database by the streamer. For volatility, we considered 15-minute closing prices in Bitfinex exchange. The data was acquired using Cryptocurrency eXchange Trading Library.

Overall, the dataset used in this study consisted of approximately 14000 Bitcoin price snapshots and 30 million Bitcoin-related tweets.

We first considered price data alone and compared the performances of the econometric baselines — Heterogeneous Autoregressive Realized Volatility (HAR-RV) and Generalized Autoregressive Conditional Heteroskedasticity (GARCH) to the following deep learning architectures: Recurrent (RNN), Temporal Convolutional (TCN), Gated Recurrent Unit (GRU), and  Long-Short Term Memory (LSTM)-based neural network models over the 96 day training and 48 days testing horizon. Following the results of the experiments, we concluded that TCN is a promising candidate model for analysis and prediction of Bitcoin volatility.

Following this step, we built a two-level TCN model which allows for concurrent use of log-returns and information obtained from social media as its inputs. Finally, to assess the influence of the information added, we trained the D-TCN with combinations of different feature sets and compared the results obtained from each combination with the TCN which was trained using only the log-returns.

Overall, our experiments produced an unexpected outcome: in contrast with what would be intuitive, semantic content of the tweet’s text is not the most informative for the predictions of realized volatility. On the contrary, information about authors of the tweets (e.g., follower counts and friend counts) convey much more predictive power. A possible qualitative explanation for this phenomenon is that attention of a popular account towards a cryptocurrency influences realized volatility regardless of the sentiment of a Tweet. We also note that fFeature sets that combined User information and Tweet information together with information about the total tweet volume (Count) were more predictive than Count would be on its own, particularly in high realized volatility regimes.

Author: Vaiva Vasiliauskaite, ETHZ
Exploratory: Demography, Economy and Finance 2.0