Winter has come. Game of Thrones season 7 is here and there is merely 12 more episodes left of our beloved series. Now is the time to sit back and enjoy what we have while we have it. Or we can dive into the data and do a bit of analysis. One in the same, right?
“Death is so terribly final, while life is full of possibilities” — Tyrion Lannister
So with little direction or end goal in sight, I starting thinking of the best way to do some interesting exploratory data analysis on the recent premiere. After a bit of thought, the obvious choice seemed to be Twitter: where the general public of casual watchers and die-hard fans all join together to spout their opinions and impressions in real-time. I was bound to find some interesting insights. So let’s get into it.
If you haven’t seen the season premiere and plan to, I recommend you stop reading this, fire up your friend’s HBO GO account that you’ve been mooching off of for 6 months now, and watch it. Then and only then come back to this post and enjoy my data-driven look into the awesomeness that was S7E1.
So with little to no real experience scraping data, I underwent the process of extracting thousands of ‘Game of Thrones’ related tweets. I made the choice to identify relevant tweets by only scraping those that contained #GoT. This turned out to be plenty, as I extracted over 215,000 tweets over the course of the week and more importantly over 25,000 live-tweets from during premiere. These will serve as the backbone of my analysis.
Leading up to Premiere
As one might imagine, the excitement surrounding the premiere built up over time. I was able to capture this in the plot below that shows number of relevant tweets over the course of the week from 7/10–7/18.
You can see some small blips on the 11th and 13th. I’m honestly not quite sure what these were due to, possibly generally hype or a headline/article being released. More apparently, we can see a clear peak that took place over the hour that the episode aired for the public to enjoy. Let’s dive into the live-tweets from that hour alone.
Activity Throughout the Episode
All in all, we can learn a lot from the ~25,000 tweets related to Game of Thrones that were put out during the episode from all over the world. As you can see in the plot below broken down by minute, the activity wasn’t exactly consistent.
So right off the bat, you probably notice 2–4 local maximums that stand out. I went back and looked at what exactly went down during these peaks. It went something like this:
0–4 Minutes in (~800 Mentions): Episode kicks off, Arya gives her big speech.
8–12 Minutes in (~1300 Mentions): Intro starts, we hear that sweet tune we’ve been waiting for.
34–36 Minutes in (~400 Mentions): Sam makes his entrance via a very unappealing montage.
40–44 Minutes in (~600 Mentions): Ed Sheeran inexplicably shows up in a cameo role.
Key Word Analysis
Moving past the general activity analysis, we can dive in a little further by looking at the content of all these tweets. I opted to use the nltk package in order to create a corpus of all the tweets throughout the episode.
A few measures had to be taken in order to verify that this corpus was meaningful. I removed all the typical stop-words right off the bat using nltk built-in functionality. Next, I removed words that were under three letters. I also removed any words that weren’t in the english dictionary. Lastly, I re-added any specific ‘thrones’ terminology like names of characters. Since for some reason ‘daenerys’ isn’t in the english dictionary, but that’s for another discussion.
Now we’ve cleaned up this giant corpus of over now just under 500,000 different words and can start understanding the data. In order to visualize the corpus, I created a data frame of the top 20 most frequent words to go along with a word cloud.
So as you can see, the clear-cut frontrunner was ‘premiere’ (makes sense). Next we had a couple other interesting data points in the form of ‘red’, ‘jorah’ and ‘varys’. I’ll get more into the character analysis later, but all of the ‘red’ tweets could be directed at the revenge of the red wedding orchestrated by arya in the first moments of the episode. Furthermore, I would be amiss to not mention that ‘sheeran’ came in at #12 with over 4,000 mentions. Take that as you may.
Upon the closing sequence of an episode, we often find ourselves asking our friends and selves: “which characters won (or lost) the night?”. Through data analysis, we can answer this question more clearly and accurately.
These results were particularly interesting to me. Jorah came in at number one (much to his dismay) with over 6000 mentions. While Varys placed second with nearly the same (can’t quite remember why this would be the case… anyone?). After that we have the usual suspects: Arya, Jon, Cersei, and Sansa. To my surprise, Daenerys clocked in as the 7th most tweeted about character, despite being the focal point of the last 15 minutes. My guess is that this is primarily due to the difficulty that comes with spelling her name, but I could be wrong. Last but not least (okay, probably least), is Hodor — who managed to somehow squeak into the top ten by being mentioned over 500 times. Hodor.
Wrap it up
So as this post comes to an end, I would like to reflect on a couple things. First off, by going through with this project, I realized the power that resides in seemingly trivial things like tweets when taken in large quantities. I plan to utilize this concept via Twitter and other mediums as I move forward with my work and idea generation practices.
Moving forward to next week, I’m thinking of putting out a similar post for each episode and then compiling the data into one set for a larger project upon the end of the short season. Please reach out to me with any ideas or questions that I can explore. Also, feel free to check out my code.
Thanks for reading! If you enjoyed this post and you’re feeling generous, perhaps follow me on Twitter. You can also subscribe in the form below to get future posts like this one straight to your inbox. 🔥