September 18, 2017
A few months ago I wanted to know what people say, when they talk about the Internet of Things (IoT) on Twitter. So over the course of 200 days, from October 2016 to May 2017, I recorded tweets containing the hashtag #IoT.¹
This resulted in a total of over 8 million tweets.
Today, I finally got to exploring this data.
The goal of this blogpost is two-fold: On the one hand, it presents some results of an exploratory analysis of this data. On the other hand, it aims at giving a simple introduction to analysing text data, from preprocessing and visualising, to topic modelling and sentiment analysis — all using Python.
 To get the data I implemented a simple listener for the Twitter streaming API based on Tweepy that stored each incoming tweet containing the hashtag #IoT into DynamoDB. You can check out the code on Github.
Raw data in a pandas DataFrame
Above you can see a screenshot of what the raw data I collected from Twitter looks like after having read it into a pandas DataFrame. It contains columns such as the tweet’s text, its language, the hashtags of the tweet, and the author’s username.
As with most data-related projects I started out with a simple exploration and was initially interested in the most common hashtags that were used together with the hashtag #IoT. This gives a good overview of other topics people mention when tweeting about the Internet of Things.
Below is a barchart showing the top 10 hashtags that appeared in IoT tweets (ignoring hashtag #IoT) and their corresponding frequency:
Top hashtags co-occurring with the hashtag #IoT
The figure shows hashtags that I would have expected to see, as they are clearly related to IoT (e.g. #ai, #bigdata, #tech), but some that are not obvious at first glance, such as #comunidade, #ssp, and #florianopolis.
After conducting a quick search, I found that these hashtags are actually related to the Secretary for Public Safety of the Brazilian state of Santa Catarina. They launched a program called Bem-Te-Vi, which consists of security cameras installed at multiple locations throughout the state of Santa Catarina. Images taken from these cameras are being sent to Twitter every few minutes, all of them tagged with #IoT.
Since I was not interested in these particular tweets, I dropped them by removing all tweets containing the hashtag #ssp, the hashtag associated with the above-mentioned institution. By doing so, 2.3 million tweets were removed from the dataset, and I re-plotted the top hashtags:
Top hashtags co-occurring with the hashtag #IoT after dropping the above-mentioned tweets
Et voilà, I obtained an excellent overview of topics people mention when they talk about IoT on twitter, such as AI, Big Data, Machine Learning, and Security.
In a similar way, other valuable insights can be unveiled. As a next step, I took a look at the number of most active users tweeting about IoT…
Most active users (according number of tweets about #IoT)
According the plot, user alejandro vergara is most actively involved in tweeting about the Internet of Things. Interestingly, over 99% of his tweets in this dataset are retweets. In all subsequent analyses, I decided to drop retweets. In doing so, the list of most active users looks quite different:
Most active users (number of tweets, retweets ignored)
Additionally, I had a look at most common languages used in the tweets’ dataset.
You can see that more than 8/10 tweets about IoT are in English.
To conduct further analysis, I decided to investigate the content of these tweets. As most text analysis tools are language specific, I dropped all tweets that are non-English.
I ended up with a dataset of around 1.65 million tweets.
When working with text data usually multiple preprocessing steps need to be performed in order to clean the text data at hand. Such text cleaning steps typically consist of operations like removing punctuation, dropping stopwords and single characters, splitting text into terms, lemmatising words, etc.
After doing so for every tweet, I started to look at the most common words across all our IoT tweets. I used Python’s wordcloud library for a nice visualisation of term frequencies in the data:
Wordcloud of the most common terms in IoT tweets
The wordcloud illustrates often encountered terms associated with IoT.
If you are interested in more than just isolated term frequencies, a common approach in text mining is to identify co-occurring terms, so-called n-grams. NLTK is a powerful natural language processing library for Python. Here is an example of using NLTK to identify common bigrams — pairs of consecutive written words:
The result shows a list of top bigrams — displaying very familiar collocations!
Topic modelling is a text-mining tool for the discovery of central subjects in documents (i.e. tweets). An example of a topic model is Latent Dirichlet Allocation (LDA), which assumes that documents are produced from a mixture of topics.
Using Python’s gensim implementation of LDA for topic modelling and setting the number of topics to n=5, the top terms of each topic obtained were:
Top terms of topics identified using LDA
The list of top words shows that different overall topics such as news (0), security (2), or *data (4) *could automatically be identified using topic modelling.
Sentiment analysis is used to determine whether a writer’s attitude towards a particular topic is positive or negative (or neutral). TextBlob is a Python library for processing text data that offers a model for sentiment analysis.
Applying the model to each tweet and looking at the distribution of the polarity values, we get a sense of the overall sentiment of the tweets:
Histogram of sentiment polarity values
The image illustrates how the polarity histogram is skewed towards the right indicating an overall positive sentiment of tweets about IoT.
I was more interested in negatively-associated content, so I took a closer look at the most frequent terms with a very low polarity (i.e. negative) in tweets:
Most frequent terms in tweets with low polarity (negative)
Not really surprising, is it? Artificial Intelligence, which has been increasingly part of negative news coverage, as well as Security, a still unsolved, and largely discussed, problem in IoT, appear in the list of terms with negative sentiment.
To give a better idea of the classification of tweets into positive and negative sentiments, below is a random sample of tweets with maximum and minimum polarity as returned by the model:
Examples of tweets with negative sentiment
Examples of tweets with positive sentiment
Valuable insights can be extracted from text data. The Python ecosystem with its rich number of data science libraries makes it easy to process and analyse text data in many different ways, depending on the use case at hand. These tools can be used to rapidly and effortlessly build automated tools that can be of great value to companies. For instance, at WATTx, we built a small tool, where a project lead can enter some tags related his project, and his team will get a curated list of results/news from Twitter into the project’s Slack channel on a regular basis. This helps us to stay informed about progress within the realm of each project we are running.
The goal of this blogpost was to present simple examples of some of these tools and text analysis concepts applied to tweets for gaining insights into #IoT content on Twitter.
All code used for the results presented here was written in Python and can be found in this Jupyter notebook.
pandas — Powerful library for data analysis scikit-learn — The goto library for Machine Learning in Python NLTK — Tools for natural language processing gensim — Topic modelling tools in Python wordcloud — A lightweight library for producing nice-looking wordclouds matplotlib — Python 2D plotting library tweepy — A Python library for accessing the Twitter API
We are currently working on a research paper around security and people’s perception on that topic which will be released by beginning of October. If you are interested, sign up for our newsletter here.
WE ARE WATTX
Properly executed user research and design can save you money and make your project timeline...
WE ARE WATTX
WATTx and Cornell Tech are collaborating to tackle the challenge of personal data privacy and...
WE ARE WATTX
Building WATTx to propel an almost 100-year-old industrial powerhouse into the digital era.