The first paper I wrote for my PhD just got published! Here’s the link to download it, and here’s the story behind the paper:
I started my PhD with the goal of critically examining the process and outcomes of social media science communication. Despite the flurry of activities in this domain and the huge amount of resources poured into digital public engagement activities, nobody (and I mean nobody) has ever paused to think, are we making any real change? Is the public more engaged with science and more scientifically literate than say, 10 years ago when Facebook and Twitter weren’t the media giants they are today?
Given my engineering background, I decided to use the method I know best to approach the problem. Friends and family would know that I have been fascinated by Artificial Intelligence and Machine Learning for a while. Given the abundance of social media big data, and the immense potential of machine learning to glean meaningful patterns from this data, I decided that’s what I am going to do.
First, I wanted to find out what are the ingredients of an engaging science-related social media message. What makes a message engaging? Is there a quantifiable difference between the factors that make a message engaging in the field of science and in other fields? In other words, are the ingredients of engaging science-related messages unique to the field of science?
I set out to test my hypothesis that there are indeed such ingredients. The key to building an effective predictive model is the choice of features we use. Feature engineering in machine learning is the process of selecting discriminating and relevant attributes that characterise the dataset. For instance, to predict house prices, examples of good features would be ‘location’, ‘property size, ‘number of bedrooms’. As you can imagine, there is virtually an endless amount of features that one can extract from social media messages, ranging from information about the author, content-specific features to information about the context, e.g. date and time, geographical features, etc.
My hunch was that the biggest factors influencing the engagement potential of a message are content-related. So I narrowed down my feature sets to include only content-related features. I used four types of features: n-grams (individual and sequence of words), psycholinguistic features (the psychological meaning behind words, derived using the widely-used LIWC software), grammatical features (e.g. presence of certain punctuations, words per sentence, etc.) and social media-specific features (e.g. presence of hashtags, URL, photos, etc.).
I used Python scikit-learn machine learning library to develop my classifier (i.e. the prediction model). If you have no background in programming, the learning curve for scikit-learn is pretty steep. But if you are familiar with Python, it is a very powerful machine learning library and comes with great documentation (very important!). After some experimentations, I ended up using three separate classifiers for my task: a multinomial Naive Bayes classifier, a linear model using Stochastic Gradient Descent and a version of Decision Tree called Extra Tree. I explained my rationale behind each model in my paper.
So what did I find? Turns out, loads! The most important and interesting findings all point to the fact that yes, there are discernible patterns in engaging science (well, space science, to be precise) related social media messages and yes, they are unique to space science (I experimented with data from three other fields: politics, business and nonprofit to verify this). In other words, there are significant and quantifiable differences between engaging and non-engaging space science-related social media messages, and the unique features that make space science captivating are visual elements, anger, authenticity, visual descriptions and a tentative tone.
This should go without saying, but I do have to caution that correlation does not imply causation. So it does not mean that if you write an angry tweet, it will definitely pick up thousands of retweets. Nevertheless, my findings provided an inkling of what makes science audience tick, and is a solid starting point for an interesting and important research project. In fact, the next issue I am investigating pertains to the intricate relationship between engagement and trust. So stay tuned, folks.
In the spirit of open science, I have decided to publish the source code of my predictive model so that you can attempt the same experiments yourself and verify or disprove my findings. Unfortunately Facebook and Twitter’s T&C prohibit me from making the raw data I collected with their APIs publicly available. However, I am publishing the script that I used to collect this data. All codes are available here: https://github.com/yiling-hwong/astro-ml.
Download the paper here.