collecting tweets from user: realDonaldTrump (maximum rounds = 16)
collected 200 tweets from realDonaldTrump in round 1 || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 2 || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 3 || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 4 || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 5 || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 6 || waiting for 15 seconds
collected 198 tweets from realDonaldTrump in round 7 || waiting for 15 seconds
collected 190 tweets from realDonaldTrump in round 8 || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 9 || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 10 || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 11 || waiting for 15 seconds
collected 198 tweets from realDonaldTrump in round 12 || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 13 || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 14 || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 15 || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 16 || waiting for 15 seconds
realDonaldTrump completed
collecting tweets from user: JoeBiden (maximum rounds = 16)
collected 200 tweets from JoeBiden in round 1 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 2 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 3 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 4 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 5 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 6 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 7 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 8 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 9 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 10 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 11 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 12 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 13 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 14 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 15 || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 16 || waiting for 15 seconds
JoeBiden completed
collecting tweets from user: KamalaHarris (maximum rounds = 16)
collected 200 tweets from KamalaHarris in round 1 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 2 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 3 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 4 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 5 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 6 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 7 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 8 || waiting for 15 seconds
collected 199 tweets from KamalaHarris in round 9 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 10 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 11 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 12 || waiting for 15 seconds
collected 199 tweets from KamalaHarris in round 13 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 14 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 15 || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 16 || waiting for 15 seconds
KamalaHarris completed
collecting tweets from user: Mike_Pence (maximum rounds = 16)
collected 200 tweets from Mike_Pence in round 1 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 2 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 3 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 4 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 5 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 6 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 7 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 8 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 9 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 10 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 11 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 12 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 13 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 14 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 15 || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 16 || waiting for 15 seconds
Mike_Pence completed
Twitter Sentiment and Engagement: The Case of the Biden Campaign
Twitter Sentiment and Engagement: The Case of the Biden Campaign
- Introduction: 310
- Hypotheses & Sub-RQs: 523
- Gathering data: 524
- Data Exploration & Evaluation: 562
- Evaluation: 548
- Limitations and Next Steps: 544
- Ethical and Normative Considerations: 559
Introduction
As the Presidential Election for the United States draws nearer, the Joe Biden campaign has run into a problem with its Twitter campaign. With just a few weeks left before the election, the Biden communication department is sharply divided in how to use his Twitter account in the final stretch of the campaign. Several members of the communications staff believe that in order to drum up enthusiasm among Biden supporters, his Twitter account should be used for negative campaigning. This would include character attacks and policy attacks against Donald Trump. Other members of Biden’s communication team believe just the opposite: negative campaigning will backfire for Biden. While negativity might help Biden rally some of his supporters, they argue, this will lead Trump to also go negative, which will benefit him more than Biden. Further, they argue that Democrats are different than Republicans and won’t react as favorably to the negativity as Trump supporters. As such the campaign has this research question:
RQ: Do tweets from political candidates that contain negative sentiment receive more engagement than tweets from political candidates that are not negative?
Many communication challenges cannot be solved by the use of digital data. However, as the present RQ question boils down to how different types of social media posts lead to different levels of online engagement with that post, this problem is one that should be looked at through the lens of digital data. Further, this case is relevant both from a theoretical perspective as well as a societal perspective. There has been ample research both into the negativity bias (Soroka & McAdams, 2015) as well as negative campaigning (Carraro & Castelli, 2010). This case will add to the research into whether a negativity bias also exists for political tweets as well as flush out the efficacy of negative campaigning on Twitter. For society, this research could also affect the campaign style of the Presidential race.
Hypotheses
People have a “negativity bias” when it comes to consuming news content, with individuals putting more weight and attention on negative information (Trussler & Soroka, 2014). Negative news, also known as “adverse media,” is news that focuses on unfavorable information and is often defined by its negative tone (Soroka, Fournier, & Nir, 2019). Studies have shown that people pay more attention to negative information than to positive information and are more likely to engage with it (Soroka & McAdams, 2015). As such, it is logical to think that negative tweets, or tweets with a negative sentiment, are more likely to attract the attention of Twitter users and lead to more engagement. Past research lends this support. Oz, Zheng, and Chen (2017) found that negative tweets had higher engagement than non-negative tweets when it comes to responses to White House’s Facebook and Twitter pages. Therefore, based on this argument by the members of Biden’s communication staff that argue in favor of negativity, the first two hypotheses are:
H1a: Negative sentiment in a tweet will be positively associated with number of retweets with the tweet.
H1b: Negative sentiment in a tweet will be positively associated with number of favorites with the tweet.
The opponents of the negative campaign strategy, however, have a valid point. Trump is a special case, who, as an avid twitter user, often resorts to coarse language, personal attacks, and straight incivility (Ott, 2017). Trump’s followers are not only more accustomed to the use of negative sentiment, they have actually shown a strong preference for tweets that include personal attacks (Lee & Xu, 2018). Therefore, while negativity might help Biden, it would help Trump even more. If the campaign becomes more negative on Twitter, that could backfire, leading Trump to be more negative and increasing his Twitter engagement. As such, the second set of hypotheses are:
H2a: The positive effect of negative sentiment on number of retweets will be greater for Trump tweets than for Biden tweets.
H2b: The positive effect of negative sentiment on number of favorites will be greater for Trump tweets than for Biden tweets.
Finally, the opponents of the negative campaign strategy also contend that Republicans are different than Democrats. The extensive work into Ideological Asymmetries by Jost (2017) backs this up. As people choose an ideology that aligns with their own psychological motivations, people of different ideologies are likely to have psychological differences. For example, research shows that Republicans a greater need to manage uncertainty and fear, while Democrats are more willing to accept some level of uncertainty in the hopes of social progress (Jost et al., 2003). It is possible that Democrats and Republicans also respond differently to the negativity. While personal attacks may work well with Republicans, that might not be the case for Democrats. Therefore, the final set of hypotheses are:
H3a: The positive effect of negative sentiment on number of retweets will be greater for Trump and Pence tweets than for Biden and Harris tweets.
H3b: The positive effect of negative sentiment on number of favorites will be greater for Trump and Pence tweets than for Biden and Harris tweets.
Data Collection
As the business challenge involves comparing the tweets, the first step in gathering the data was the obtain the last the recent tweets of Donald Trump, Joe Biden, Kamala Harris, and Mike Pence. To do this, the last 3,200 tweets from each twitter were gathered using Twitter’s API on 10 October, 2020. This method was chosen for two reasons: first, as opposed to scrapping the tweets that can often miss collecting relevant data, by using Twitter’s API, we can be reasonably confident that all of the planned tweets were gathered. Second, from a logistical standpoint, the present study is only concerned with recent Twitters that were posted during the election cycle. As Twitter’s API only allows the latest 3,200 tweets from a single user to be downloaded. This could be a problem if all user tweets were required, but since the focus is on the election, the last 3,200 tweets is sufficient.
In additional to obtaining the text of each tweet, the API downloaded some accompanying data, such as time of post, language of post, and whether media was included with the post. Also, relevant to this project, the API includes data of the overall engagement with each tweet, namely number of retweets and number of favorites.
In order to obtain the sentiment of the tweets, VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis will be run on each tweet, and the negative, positive, neutral, and compound polarity scores will be added to the dataset. VADER was chosen as it is quite good at analyzing social media posts (Hutto & Gilbert, 2014).
As for privacy, the tweets will be linked to the individual users, which does pose a problem for the privacy of the twitter user. For example, they may not wish for their tweets to be included in a sentiment analysis. However, as the accounts are used in public campaigns for political office, it would seem likely that the other campaigns are also investigating their twitter data, which mitigates potential privacy concerns. Further, the privacy of the users engaging with the tweets, whether by retweeting or favoriting a post, is protected as no data is collected on those users.
While the reasoning behind the use of Twitter’s API is sound, this does not mean the data is without potential biases. The first bias could be related to the timing of the tweets. Twitter uses tweet at different rates, so the last 3,200 tweets from Trump could represent a much shorter timespan than the last 3,200 tweets from Biden, and therefore could bias the data based on different temporal factors between users. Secondly, there is a clear bias against women and people of color in the dataset. As the dataset contains tweets of three white men and only one woman, the data is skewed towards representing white men. And finally, as only one election at one time is being investigate, the generalizability of the data to other elections could be questioned. That said, as the outcome variable is tweet engagement and not something like loan approval, there are no known unwarranted associations between the outcome and protected features such as race and gender.
Above are the needed packages for the project.
Get Tweets
Above is the code to retrieve the last 3,200 tweets by a user. This code was retrieved from the GetLatest3200TweetsFromUser file.
Code to indicate of which users to collect the tweets.
Above loop retrieves all tweets. The code has been made into a comment so the data remains the same if all the code is run again.
Trump data
created_at | id | id_str | full_text | truncated | display_text_range | entities | extended_entities | source | in_reply_to_status_id | ... | favorite_count | favorited | retweeted | possibly_sensitive | lang | retweeted_status | quoted_status_id | quoted_status_id_str | quoted_status_permalink | quoted_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sat Oct 10 03:09:32 +0000 2020 | 1314764977597755392 | 1314764977597755392 | I was honored to receive the first ever Presid... | False | [0, 191] | {'hashtags': [{'text': 'LESM', 'indices': [162... | {'media': [{'id': 1314700859079524352, 'id_str... | <a href="http://twitter.com/download/iphone" r... | nan | ... | 85771 | False | False | False | en | NaN | nan | NaN | NaN | NaN |
1 | Sat Oct 10 02:36:30 +0000 2020 | 1314756664143347712 | 1314756664143347712 | RT @marklevinshow: My interview with the presi... | False | [0, 129] | {'hashtags': [], 'symbols': [], 'user_mentions... | NaN | <a href="http://twitter.com/download/iphone" r... | nan | ... | 0 | False | False | False | en | {'created_at': 'Fri Oct 09 23:35:36 +0000 2020... | nan | NaN | NaN | NaN |
2 | Fri Oct 09 23:55:24 +0000 2020 | 1314716123250778114 | 1314716123250778114 | RT @realDonaldTrump: Will be in Sanford, Flori... | False | [0, 104] | {'hashtags': [], 'symbols': [], 'user_mentions... | NaN | <a href="http://twitter.com/download/iphone" r... | nan | ... | 0 | False | False | False | en | {'created_at': 'Fri Oct 09 21:04:39 +0000 2020... | nan | NaN | NaN | NaN |
3 | Fri Oct 09 23:35:09 +0000 2020 | 1314711027326562306 | 1314711027326562306 | Documents reveal that General Flynn was entrap... | False | [0, 72] | {'hashtags': [], 'symbols': [], 'user_mentions... | NaN | <a href="http://twitter.com/download/iphone" r... | nan | ... | 140093 | False | False | NaN | en | NaN | nan | NaN | NaN | NaN |
4 | Fri Oct 09 23:31:20 +0000 2020 | 1314710067699159041 | 1314710067699159041 | .@SteveScully, the Never Trumper next debate m... | False | [0, 196] | {'hashtags': [], 'symbols': [], 'user_mentions... | NaN | <a href="http://twitter.com/download/iphone" r... | nan | ... | 121620 | False | False | NaN | en | NaN | nan | NaN | NaN | NaN |
5 rows × 31 columns
created_at 0
id 0
id_str 0
full_text 0
truncated 0
display_text_range 0
entities 0
extended_entities 2514
source 0
in_reply_to_status_id 3074
in_reply_to_status_id_str 3074
in_reply_to_user_id 3071
in_reply_to_user_id_str 3071
in_reply_to_screen_name 3071
user 0
geo 3165
coordinates 3165
place 3165
contributors 3165
is_quote_status 0
retweet_count 0
favorite_count 0
favorited 0
retweeted 0
possibly_sensitive 1837
lang 0
retweeted_status 1598
quoted_status_id 2603
quoted_status_id_str 2603
quoted_status_permalink 2603
quoted_status 2830
Trump 0
Biden 0
Harris 0
Pence 0
Republican 0
dtype: int64
3165
The Trump dataset is imported. Variables are added for indicate the tweets are from Trump, who is a Republican. Then missing values are checked for the text of the tweet as well as the newly created variables. Finally, I check the lengths of the datasets. The same is then done for Joe Biden, Kamala Harris, and Mike Pence.
Biden Data
created_at 0
id 0
id_str 0
full_text 0
truncated 0
display_text_range 0
entities 0
extended_entities 2113
source 0
in_reply_to_status_id 3085
in_reply_to_status_id_str 3085
in_reply_to_user_id 3085
in_reply_to_user_id_str 3085
in_reply_to_screen_name 3085
user 0
geo 3185
coordinates 3185
place 3185
contributors 3185
is_quote_status 0
retweet_count 0
favorite_count 0
favorited 0
retweeted 0
possibly_sensitive 1007
lang 0
quoted_status_id 2733
quoted_status_id_str 2733
quoted_status_permalink 2733
quoted_status 2745
retweeted_status 3031
Trump 0
Biden 0
Harris 0
Pence 0
Republican 0
dtype: int64
3185
Harris Data
created_at 0
id 0
id_str 0
full_text 0
truncated 0
display_text_range 0
entities 0
source 0
in_reply_to_status_id 3129
in_reply_to_status_id_str 3129
in_reply_to_user_id 3129
in_reply_to_user_id_str 3129
in_reply_to_screen_name 3129
user 0
geo 3183
coordinates 3183
place 3182
contributors 3183
is_quote_status 0
retweet_count 0
favorite_count 0
favorited 0
retweeted 0
lang 0
possibly_sensitive 1272
retweeted_status 2760
extended_entities 2370
quoted_status_id 2706
quoted_status_id_str 2706
quoted_status_permalink 2706
quoted_status 2729
Trump 0
Biden 0
Harris 0
Pence 0
Republican 0
dtype: int64
3183
Pence Data
created_at 0
id 0
id_str 0
full_text 0
truncated 0
display_text_range 0
entities 0
source 0
in_reply_to_status_id 3124
in_reply_to_status_id_str 3124
in_reply_to_user_id 3124
in_reply_to_user_id_str 3124
in_reply_to_screen_name 3124
user 0
geo 3185
coordinates 3185
place 3185
contributors 3185
retweeted_status 928
is_quote_status 0
retweet_count 0
favorite_count 0
favorited 0
retweeted 0
lang 0
possibly_sensitive 1814
extended_entities 2181
quoted_status_id 3066
quoted_status_id_str 3066
quoted_status_permalink 3066
quoted_status 3167
Trump 0
Biden 0
Harris 0
Pence 0
Republican 0
dtype: int64
Merge
1.0
Finally, the four datasets are merged. Then a quick check is run to make sure the length of the new dataset is correct.
Data Cleaning
First, a simple inspection of the data is performed.
full_text | retweet_count | favorite_count | |
---|---|---|---|
0 | I was honored to receive the first ever Presid... | 20884 | 85771 |
1 | RT @marklevinshow: My interview with the presi... | 17307 | 0 |
2 | RT @realDonaldTrump: Will be in Sanford, Flori... | 25471 | 0 |
3 | Documents reveal that General Flynn was entrap... | 41969 | 140093 |
4 | .@SteveScully, the Never Trumper next debate m... | 33220 | 121620 |
12718
created_at 0
id 0
id_str 0
full_text 0
truncated 0
display_text_range 0
entities 0
extended_entities 9178
source 0
in_reply_to_status_id 12412
in_reply_to_status_id_str 12412
in_reply_to_user_id 12409
in_reply_to_user_id_str 12409
in_reply_to_screen_name 12409
user 0
geo 12718
coordinates 12718
place 12717
contributors 12718
is_quote_status 0
retweet_count 0
favorite_count 0
favorited 0
retweeted 0
possibly_sensitive 5930
lang 0
retweeted_status 8317
quoted_status_id 11108
quoted_status_id_str 11108
quoted_status_permalink 11108
quoted_status 11471
Trump 0
Biden 0
Harris 0
Pence 0
Republican 0
dtype: int64
Drop Retweets
The first task was to drop unwanted observations. For this project, tweets that are retweets are not of interest. This was decided for two reasons. First, the research question and hypotheses were about the negativity of Biden’s tweets. This is about the tweets he writes, not the tweets written by other people. It therefore makes sense to exclude retweets. Second, from a more practical standpoint, retweets are not favorited, only the original tweet can be favorited. Therefore, all retweets have a favorite count of zero, which is not an accurate representation of how much people liked or engaged with the retweet. Therefore, it was decided to drop all retweets from the dataset. To do so, a new variable was created to determine if the tweet was a retweet, and if it was, it was dropped.
full_text | is_retweet | |
---|---|---|
0 | I was honored to receive the first ever Presid... | 0 |
1 | RT @marklevinshow: My interview with the presi... | 1 |
2 | RT @realDonaldTrump: Will be in Sanford, Flori... | 1 |
3 | Documents reveal that General Flynn was entrap... | 0 |
4 | .@SteveScully, the Never Trumper next debate m... | 0 |
4401
4,401 of the tweets were retweets.
8317
The new dataset has 8,317 tweets, none of which are retweets.
full_text | retweet_count | favorite_count | is_retweet | |
---|---|---|---|---|
0 | I was honored to receive the first ever Presid... | 20884 | 85771 | 0 |
3 | Documents reveal that General Flynn was entrap... | 41969 | 140093 | 0 |
4 | .@SteveScully, the Never Trumper next debate m... | 33220 | 121620 | 0 |
5 | Thank you @SenatorDole. So true! https://t.co/... | 15147 | 58881 | 0 |
6 | https://t.co/UGIAvC7VA3 | 19078 | 54239 | 0 |
The index of the dataset was then reset.
Check date of Tweets
Next, it was important to ensure that none of the tweets were from before the election cycle, so the date created variable was changed into a datetime variable.
0 Sat Oct 10 03:09:32 +0000 2020
1 Fri Oct 09 23:35:09 +0000 2020
2 Fri Oct 09 23:31:20 +0000 2020
3 Fri Oct 09 23:01:54 +0000 2020
4 Fri Oct 09 22:30:20 +0000 2020
Name: created_at, dtype: object
0 2020-10-10 03:09:32+00:00
1 2020-10-09 23:35:09+00:00
2 2020-10-09 23:31:20+00:00
3 2020-10-09 23:01:54+00:00
4 2020-10-09 22:30:20+00:00
Name: created_at, dtype: datetime64[ns, UTC]
count 8317
unique 8191
top 2020-05-19 22:23:51+00:00
freq 4
first 2019-08-05 17:58:00+00:00
last 2020-10-10 03:09:32+00:00
Name: created_at, dtype: object
The oldest tweet is from August 5th, 2019. This is after all four had begun campaigning so no tweets need to be dropped.
index | created_at | id | id_str | full_text | truncated | display_text_range | entities | extended_entities | source | ... | quoted_status_id | quoted_status_id_str | quoted_status_permalink | quoted_status | Trump | Biden | Harris | Pence | Republican | is_retweet | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7388 | 199 | 2019-08-05 17:58:00+00:00 | 1158437011692429314 | 1158437011692429314 | Gun violence is an epidemic. It impacts our co... | False | [0, 179] | {'hashtags': [], 'symbols': [], 'user_mentions... | NaN | <a href="https://sproutsocial.com" rel="nofoll... | ... | 1158211041999970304.000 | 1158211041999970317 | {'url': 'https://t.co/GqZAZurc8D', 'expanded':... | {'created_at': 'Mon Aug 05 03:00:05 +0000 2019... | 0 | 0 | 1 | 0 | 0 | 0 |
1 rows × 38 columns
index | created_at | id | id_str | full_text | truncated | display_text_range | entities | extended_entities | source | ... | quoted_status_id | quoted_status_id_str | quoted_status_permalink | quoted_status | Trump | Biden | Harris | Pence | Republican | is_retweet | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4628 | 199 | 2019-10-26 21:03:00+00:00 | 1188199370463821824 | 1188199370463821824 | If you work hard, you should be able to share ... | False | [0, 276] | {'hashtags': [], 'symbols': [], 'user_mentions... | NaN | <a href="https://about.twitter.com/products/tw... | ... | nan | NaN | NaN | NaN | 0 | 1 | 0 | 0 | 0 | 0 |
1 rows × 38 columns
index | created_at | id | id_str | full_text | truncated | display_text_range | entities | extended_entities | source | ... | quoted_status_id | quoted_status_id_str | quoted_status_permalink | quoted_status | Trump | Biden | Harris | Pence | Republican | is_retweet | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1597 | 185 | 2020-07-17 16:25:03+00:00 | 1284162207232733185 | 1284162207232733185 | THANK YOU to the 5 million members of the @NRA... | False | [0, 284] | {'hashtags': [], 'symbols': [], 'user_mentions... | NaN | <a href="http://twitter.com/download/iphone" r... | ... | 1283748224243728384.000 | 1283748224243728384 | {'url': 'https://t.co/8ZhChqxgBI', 'expanded':... | {'created_at': 'Thu Jul 16 13:00:02 +0000 2020... | 1 | 0 | 0 | 0 | 1 | 0 |
1 rows × 38 columns
Add sentiment scores of each tweet
To add the sentiment scores of the tweets, I created a for loop that added the scores to lists that were then added to the dataset.
full_text | positive | negative | neutral | compound | |
---|---|---|---|---|---|
0 | I was honored to receive the first ever Presid... | 0.270 | 0.000 | 0.730 | 0.836 |
1 | Documents reveal that General Flynn was entrap... | 0.000 | 0.000 | 1.000 | 0.000 |
2 | .@SteveScully, the Never Trumper next debate m... | 0.000 | 0.173 | 0.827 | -0.742 |
3 | Thank you @SenatorDole. So true! https://t.co/... | 0.616 | 0.000 | 0.384 | 0.751 |
4 | https://t.co/UGIAvC7VA3 | 0.000 | 0.000 | 1.000 | 0.000 |
Media in tweet
Next, I added the control variable for whether media was included in the tweet. As some tweets can have photos or videos while others do not, it is important to control of the differences that might affect the overal engagement. I did this by adding a variable for whether the ‘extended_entities’ varaible mentioned media or not. I used a function provided in the ‘useful functions’ file.
0 {'media': [{'id': 1314700859079524352, 'id_str...
1 NaN
2 NaN
3 NaN
4 NaN
Name: extended_entities, dtype: object
media | extended_entities | |
---|---|---|
0 | 1 | {'media': [{'id': 1314700859079524352, 'id_str... |
1 | 0 | NaN |
2 | 0 | NaN |
3 | 0 | NaN |
4 | 0 | NaN |
Length of Tweet
A control variable for the length of the tweet was also created. Past research has shown different length tweets have different effects (Han, Gu, & Peng, 2019), so it is therefore important to control for these differences.
0 191
1 72
2 196
3 56
4 23
Name: length, dtype: int64
index int64
created_at datetime64[ns, UTC]
id int64
id_str object
full_text object
truncated bool
display_text_range object
entities object
extended_entities object
source object
in_reply_to_status_id object
in_reply_to_status_id_str object
in_reply_to_user_id object
in_reply_to_user_id_str object
in_reply_to_screen_name object
user object
geo object
coordinates object
place object
contributors object
is_quote_status bool
retweet_count int64
favorite_count int64
favorited bool
retweeted bool
possibly_sensitive object
lang object
retweeted_status object
quoted_status_id float64
quoted_status_id_str object
quoted_status_permalink object
quoted_status object
Trump int64
Biden int64
Harris int64
Pence int64
Republican int64
is_retweet int64
positive float64
negative float64
neutral float64
compound float64
media int64
length int64
dtype: object
negative 0
length 0
media 0
retweet_count 0
favorite_count 0
Trump 0
Republican 0
dtype: int64
No missing values in any of the varaibles of interest.
# Data Exploration and Evaluation
To begin the data exploration and evaluation process, descriptive tables were made. A summary of the descriptive statistic findings can be found at the end of this section.
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
retweet_count | 8317.000 | 9504.060 | 15685.700 | 0.000 | 1044.000 | 3803.000 | 11897.000 | 415300.000 |
favorite_count | 8317.000 | 45996.824 | 84388.376 | 0.000 | 4746.000 | 17474.000 | 54063.000 | 1897125.000 |
negative | 8317.000 | 0.075 | 0.100 | 0.000 | 0.000 | 0.037 | 0.127 | 0.831 |
length | 8317.000 | 184.580 | 83.943 | 7.000 | 118.000 | 199.000 | 260.000 | 320.000 |
media | 8317.000 | 0.340 | 0.474 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
user | Biden | Harris | Pence | Trump | |
---|---|---|---|---|---|
retweet_count | count | 3031.000 | 2760.000 | 928.000 | 1598.000 |
mean | 9488.394 | 4064.164 | 1925.446 | 23330.440 | |
std | 16340.594 | 7964.101 | 2845.855 | 19608.835 | |
min | 11.000 | 2.000 | 68.000 | 0.000 | |
25% | 1626.000 | 696.000 | 514.750 | 11194.250 | |
50% | 4719.000 | 1648.500 | 940.000 | 18227.500 | |
75% | 11310.000 | 4281.500 | 1971.500 | 29937.750 | |
max | 327694.000 | 184872.000 | 26943.000 | 415300.000 | |
favorite_count | count | 3031.000 | 2760.000 | 928.000 | 1598.000 |
mean | 50704.301 | 21228.082 | 9487.755 | 101049.254 | |
std | 99206.297 | 44229.749 | 13808.130 | 100050.109 | |
min | 34.000 | 12.000 | 259.000 | 0.000 | |
25% | 7168.500 | 2955.000 | 2698.000 | 44051.250 | |
50% | 20899.000 | 7642.000 | 4863.500 | 73987.000 | |
75% | 53120.500 | 21452.000 | 9785.250 | 125285.250 | |
max | 1897125.000 | 1001691.000 | 167461.000 | 1885859.000 | |
negative | count | 3031.000 | 2760.000 | 928.000 | 1598.000 |
mean | 0.079 | 0.090 | 0.027 | 0.072 | |
std | 0.096 | 0.100 | 0.054 | 0.118 | |
min | 0.000 | 0.000 | 0.000 | 0.000 | |
25% | 0.000 | 0.000 | 0.000 | 0.000 | |
50% | 0.050 | 0.066 | 0.000 | 0.000 | |
75% | 0.133 | 0.147 | 0.038 | 0.117 | |
max | 0.658 | 0.612 | 0.363 | 0.831 |
Trump | 0 | 1 | |
---|---|---|---|
negative | count | 6719.000 | 1598.000 |
mean | 0.076 | 0.072 | |
std | 0.095 | 0.118 | |
min | 0.000 | 0.000 | |
25% | 0.000 | 0.000 | |
50% | 0.046 | 0.000 | |
75% | 0.128 | 0.117 | |
max | 0.658 | 0.831 |
Republican | 0 | 1 | |
---|---|---|---|
retweet_count | count | 5791.000 | 2526.000 |
mean | 6903.197 | 15466.689 | |
std | 13315.315 | 18780.076 | |
min | 2.000 | 0.000 | |
25% | 940.500 | 1476.500 | |
50% | 2830.000 | 10606.500 | |
75% | 7677.000 | 22462.000 | |
max | 327694.000 | 415300.000 | |
favorite_count | count | 5791.000 | 2526.000 |
mean | 36655.887 | 67411.459 | |
std | 79368.617 | 91379.912 | |
min | 12.000 | 0.000 | |
25% | 4134.000 | 7624.500 | |
50% | 12720.000 | 42340.000 | |
75% | 36720.000 | 94136.750 | |
max | 1897125.000 | 1885859.000 | |
negative | count | 5791.000 | 2526.000 |
mean | 0.084 | 0.055 | |
std | 0.098 | 0.102 | |
min | 0.000 | 0.000 | |
25% | 0.000 | 0.000 | |
50% | 0.057 | 0.000 | |
75% | 0.140 | 0.081 | |
max | 0.658 | 0.831 |
Distribution plots of negative sentiment scores.
<AxesSubplot:xlabel='negative', ylabel='Density'>
This plot isn’t the most informative, as many of the values are 0. To have a more informative graph, the distribtion plot was zoomed in.
(0.0, 10.0)
Distribution plot of favorite count
<AxesSubplot:xlabel='favorite_count', ylabel='Density'>
This plot isn’t the most informative, as many of the values are 0. To have a more informative graph, the distribtion plot was zoomed in.
(0.0, 200000.0)
Distribution plot of the logarithmic transformation of favorite count
<AxesSubplot:xlabel='favorite_count', ylabel='Density'>
Distribution plot of retweet count
(0.0, 80000.0)
Distribution plot of the logarithmic transformation of retweet count
<AxesSubplot:xlabel='retweet_count', ylabel='Density'>
Distribution plot of the length of the tweet
<AxesSubplot:xlabel='length', ylabel='Density'>
<AxesSubplot:xlabel='media', ylabel='count'>
Above is a countplot for whether media was part of the tweet or not.
<AxesSubplot:xlabel='user', ylabel='favorite_count'>
Above is a barplot for average number of favorites per tweet by user.
<AxesSubplot:xlabel='user', ylabel='retweet_count'>
Above is a barplot for average number of retweets per tweet by user.
<AxesSubplot:xlabel='user', ylabel='negative'>
Above is a barplot for average negative sentiment per tweet by user.
<AxesSubplot:xlabel='Trump', ylabel='negative'>
Above is a barplot of the average negative sentiment per tweet between Trump tweets and non-Trump tweets.
<AxesSubplot:xlabel='Republican', ylabel='negative'>
Above is a barplot of the average negative sentiment per tweet between Republican tweets and non-Republican tweets.
<AxesSubplot:xlabel='negative', ylabel='retweet_count'>
Above is a regression plot of negative sentiment against retweet count.
<AxesSubplot:xlabel='negative', ylabel='retweet_count'>
Above is a regression plot of negative sentiment against the logarithmic transformation of retweet count.
<AxesSubplot:xlabel='negative', ylabel='favorite_count'>
Above is a regression plot of negative sentiment against favorite count.
<AxesSubplot:xlabel='negative', ylabel='favorite_count'>
Above is a regression plot of negative sentiment against the logarithmic transformation of favorite count.
Summary for stakeholders
The present research project uses two different dependent variables or outcomes for the concept of engagement. The first variable is the number of retweets each tweet has received. A retweet is when the tweet is reposted by another individual. For these four users, the average number of retweets was 9,504.06 (SD = 15,685.70). Trump had by far the highest average number of retweets (M = 23,330.44; SD = 19,608.84), followed by Biden (M = 9,488.39; SD = 16,340.59), Harris (M = 4,064.16; SD = 7,964.10), and Pence (M = 1,925.45; SD = 16,340.59), respectively. The second variable is the number of favorites each tweet has received. A favorite is when the tweet is liked or ‘favorited’ by another individual. For these four users, the average number of favorites was 45,996.82 (SD = 84,388.38). Trump again had the highest average number of favorites (M = 100,050.11; SD = 100,050.11), followed by Biden (M = 50,704.30; SD = 99,206.30), Harris (M = 21,228.08; SD = 44,229.75), and Pence (M = 9,487.76; SD = 13,808.13), respectively.
As for the sentiment of the tweets, the average tweet was not very negative, with an average negative polarity of 0.08 (SD = 0.10), with 0 being neutral and 1 being completely negative. Harris was the most negative (M = 0.09; SD = 0.10), followed closely by Biden (M = 0.08; SD = 0.10) and Trump (M = 0.07; SD = 0.12), with Pence being the least negative (M = 0.03; SD = 0.05).
Turning from specific users to Presidential and Vice Presidential candidates differences by party, Republicans on average had a higher number of retweets per tweet (M = 15,466.69; SD = 18,780.07) compared to Democrats (M = 6,903.20; SD = 13,315.32) and a higher number of favorites per tweet (M = 67,411.46; SD = 91,379.91) compared to Democrats (M = 36,655.89; SD = 79,368.62). Clearly, this is driven mostly by Trump’s popularity. In terms of negativity, Democrats had a higher average negative polarity scores (M = 0.08, SD = 0.10) compared to Republicans (M = 0.06, SD = 0.10).
The average length of the tweets was 184.58 characters (SD = 83.94), and about a third (34%) of the tweets included some form of media such as a video or photograph.
The distributions for retweet count, favorite count, and negative sentiment are positively skewed due to the high number of values around zero and due to the large number of positive outliers, making the data unbalanced. As these variables are not normally distributed, this could violate the regression assumption of normality as it implies that residuals might also not be normally distributed. This can be checked with a plot of errors, and if they are not normally distributed, this could be addressed using a log transformation, as shown in the distribution plots. However, for the sake of model interpretability and machine learning predictions, this project will use the original data without transformations (except for the above regression plots). This is a possible drawback, however, which is discussed in the limitation section below.
Because of this skew, the regression plots with negativity as the IV and retweet count or favorite count as the DV are not very informative. However, when the log is taken of the DVs, there seems to be a slight positive relationship between negative sentiment and engagement, as indicated by the slope of the regression line.
Models
Model 1: retweet count without controls
OLS Regression Results
==============================================================================
Dep. Variable: retweet_count R-squared: 0.218
Model: OLS Adj. R-squared: 0.217
Method: Least Squares F-statistic: 578.0
Date: Sun, 18 Oct 2020 Prob (F-statistic): 0.00
Time: 18:28:33 Log-Likelihood: -91127.
No. Observations: 8317 AIC: 1.823e+05
Df Residuals: 8312 BIC: 1.823e+05
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 8879.9043 280.050 31.708 0.000 8330.936 9428.872
negative 7716.0918 1547.219 4.987 0.000 4683.157 1.07e+04
Trump 1.39e+04 429.173 32.384 0.000 1.31e+04 1.47e+04
Pence -7164.3142 526.749 -13.601 0.000 -8196.874 -6131.755
Harris -5507.5940 365.513 -15.068 0.000 -6224.091 -4791.097
==============================================================================
Omnibus: 12425.873 Durbin-Watson: 1.481
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9983022.957
Skew: 8.916 Prob(JB): 0.00
Kurtosis: 171.789 Cond. No. 11.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Fit
The first model investigate the main effect of negative sentiment on retweet count. Binary variables are added for Trump, Pence, and Harris. Therefore the reference category is Biden tweets. The R-squared is 0.22, indicating 22% of the variance of retweet count is explained by the model.
Model 2: retweet count with controls
OLS Regression Results
==============================================================================
Dep. Variable: retweet_count R-squared: 0.235
Model: OLS Adj. R-squared: 0.234
Method: Least Squares F-statistic: 425.3
Date: Sun, 18 Oct 2020 Prob (F-statistic): 0.00
Time: 18:28:33 Log-Likelihood: -91034.
No. Observations: 8317 AIC: 1.821e+05
Df Residuals: 8310 BIC: 1.821e+05
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.409e+04 495.004 28.463 0.000 1.31e+04 1.51e+04
negative 7869.0982 1557.796 5.051 0.000 4815.429 1.09e+04
Trump 1.233e+04 445.700 27.675 0.000 1.15e+04 1.32e+04
Harris -5772.8657 362.105 -15.943 0.000 -6482.682 -5063.049
Pence -6395.8029 525.282 -12.176 0.000 -7425.486 -5366.119
length -20.1916 1.941 -10.403 0.000 -23.996 -16.387
media -3507.9851 331.520 -10.582 0.000 -4157.847 -2858.124
==============================================================================
Omnibus: 12443.464 Durbin-Watson: 1.486
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10317471.100
Skew: 8.928 Prob(JB): 0.00
Kurtosis: 174.621 Cond. No. 2.11e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Fit
The second model mirrors the first but includes the control variables of length and media. The R-squared improved to 0.24, so model 2 is preferred over model 1.
The positive effect of negative sentiment on retweet count can be visualized above.
Model 3: favorite count without controls
OLS Regression Results
==============================================================================
Dep. Variable: favorite_count R-squared: 0.132
Model: OLS Adj. R-squared: 0.132
Method: Least Squares F-statistic: 317.4
Date: Sun, 18 Oct 2020 Prob (F-statistic): 1.61e-254
Time: 18:28:34 Log-Likelihood: -1.0555e+05
No. Observations: 8317 AIC: 2.111e+05
Df Residuals: 8312 BIC: 2.111e+05
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.003e+04 1586.501 31.535 0.000 4.69e+04 5.31e+04
negative 8544.2518 8765.092 0.975 0.330 -8637.514 2.57e+04
Trump 5.041e+04 2431.291 20.733 0.000 4.56e+04 5.52e+04
Harris -2.957e+04 2070.655 -14.280 0.000 -3.36e+04 -2.55e+04
Pence -4.078e+04 2984.066 -13.664 0.000 -4.66e+04 -3.49e+04
==============================================================================
Omnibus: 11758.802 Durbin-Watson: 1.428
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5440616.514
Skew: 8.187 Prob(JB): 0.00
Kurtosis: 127.224 Cond. No. 11.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Fit
The third model investigate the main effect of negative sentiment on favorite count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.13, indicating 13% of the variance of favorite count is explained by the model.
Model 4: favorite count with controls
OLS Regression Results
==============================================================================
Dep. Variable: favorite_count R-squared: 0.170
Model: OLS Adj. R-squared: 0.170
Method: Least Squares F-statistic: 284.7
Date: Sun, 18 Oct 2020 Prob (F-statistic): 0.00
Time: 18:28:34 Log-Likelihood: -1.0536e+05
No. Observations: 8317 AIC: 2.107e+05
Df Residuals: 8310 BIC: 2.108e+05
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 9.223e+04 2773.014 33.259 0.000 8.68e+04 9.77e+04
negative 1.115e+04 8726.784 1.278 0.201 -5954.323 2.83e+04
Trump 3.761e+04 2496.816 15.063 0.000 3.27e+04 4.25e+04
Harris -3.164e+04 2028.517 -15.597 0.000 -3.56e+04 -2.77e+04
Pence -3.484e+04 2942.633 -11.839 0.000 -4.06e+04 -2.91e+04
length -166.7139 10.873 -15.333 0.000 -188.027 -145.401
media -2.691e+04 1857.176 -14.490 0.000 -3.06e+04 -2.33e+04
==============================================================================
Omnibus: 11746.789 Durbin-Watson: 1.423
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5685698.422
Skew: 8.148 Prob(JB): 0.00
Kurtosis: 130.049 Cond. No. 2.11e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Fit
The fourth model mirrors the third but includes the control variables of length and media. The R-squared improved to 0.17, so model 4 is preferred over model 3.
The positive effect of negative sentiment on favorite count can be visualized above.
Model 5: Interaction between negative sentiment and Trump tweets on retweet count
OLS Regression Results
==============================================================================
Dep. Variable: retweet_count R-squared: 0.237
Model: OLS Adj. R-squared: 0.236
Method: Least Squares F-statistic: 368.6
Date: Sun, 18 Oct 2020 Prob (F-statistic): 0.00
Time: 18:28:35 Log-Likelihood: -91023.
No. Observations: 8317 AIC: 1.821e+05
Df Residuals: 8309 BIC: 1.821e+05
Df Model: 7
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 1.451e+04 502.684 28.869 0.000 1.35e+04 1.55e+04
negative 3575.2435 1809.547 1.976 0.048 28.079 7122.408
Trump 1.113e+04 515.496 21.585 0.000 1.01e+04 1.21e+04
negative:Trump 1.593e+04 3428.586 4.647 0.000 9212.256 2.27e+04
Harris -5720.2414 361.835 -15.809 0.000 -6429.528 -5010.955
Pence -6644.8885 527.363 -12.600 0.000 -7678.652 -5611.125
length -20.8055 1.943 -10.708 0.000 -24.614 -16.997
media -3397.9422 331.955 -10.236 0.000 -4048.658 -2747.227
==============================================================================
Omnibus: 12482.896 Durbin-Watson: 1.488
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10530335.745
Skew: 8.981 Prob(JB): 0.00
Kurtosis: 176.391 Cond. No. 4.84e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.84e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Fit
The fifth model investigate the interaction effect of negative sentiment and Trump tweets on retweet count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.24, indicating 24% of the variance of retweet count is explained by the model.
The greater positive effect of negative sentiment on retweet count for Trump can be visualized above.
Model 6: Interaction between negative sentiment and Trump tweets on favorite count
OLS Regression Results
==============================================================================
Dep. Variable: favorite_count R-squared: 0.171
Model: OLS Adj. R-squared: 0.171
Method: Least Squares F-statistic: 245.6
Date: Sun, 18 Oct 2020 Prob (F-statistic): 0.00
Time: 18:28:36 Log-Likelihood: -1.0536e+05
No. Observations: 8317 AIC: 2.107e+05
Df Residuals: 8309 BIC: 2.108e+05
Df Model: 7
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 9.381e+04 2818.073 33.287 0.000 8.83e+04 9.93e+04
negative -4885.0701 1.01e+04 -0.482 0.630 -2.48e+04 1.5e+04
Trump 3.31e+04 2889.895 11.453 0.000 2.74e+04 3.88e+04
negative:Trump 5.951e+04 1.92e+04 3.096 0.002 2.18e+04 9.72e+04
Harris -3.144e+04 2028.463 -15.500 0.000 -3.54e+04 -2.75e+04
Pence -3.577e+04 2956.423 -12.098 0.000 -4.16e+04 -3e+04
length -169.0068 10.892 -15.516 0.000 -190.358 -147.655
media -2.65e+04 1860.958 -14.240 0.000 -3.01e+04 -2.29e+04
==============================================================================
Omnibus: 11768.342 Durbin-Watson: 1.424
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5756018.104
Skew: 8.174 Prob(JB): 0.00
Kurtosis: 130.838 Cond. No. 4.84e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.84e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Fit
The sixth model investigate the interaction effect of negative sentiment and Trump tweets on favorite count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.17, indicating 17% of the variance of retweet count is explained by the model.
The greater positive effect of negative sentiment on favorite count for Trump can be visualized above.
Model 7: Interaction between negative sentiment and Republican tweets on retweet count
OLS Regression Results
==============================================================================
Dep. Variable: retweet_count R-squared: 0.129
Model: OLS Adj. R-squared: 0.129
Method: Least Squares F-statistic: 246.6
Date: Sun, 18 Oct 2020 Prob (F-statistic): 2.20e-246
Time: 18:28:37 Log-Likelihood: -91572.
No. Observations: 8317 AIC: 1.832e+05
Df Residuals: 8311 BIC: 1.832e+05
Df Model: 5
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 1.533e+04 488.693 31.365 0.000 1.44e+04 1.63e+04
negative 2267.6330 1974.335 1.149 0.251 -1602.557 6137.823
Republican 5560.0105 428.633 12.971 0.000 4719.783 6400.238
negative:Republican 3.444e+04 3487.129 9.878 0.000 2.76e+04 4.13e+04
length -35.9548 2.003 -17.955 0.000 -39.880 -32.029
media -4785.3198 348.785 -13.720 0.000 -5469.026 -4101.613
==============================================================================
Omnibus: 11548.701 Durbin-Watson: 1.330
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6772282.307
Skew: 7.780 Prob(JB): 0.00
Kurtosis: 141.926 Cond. No. 4.68e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Fit
The seventh model investigate the interaction effect of negative sentiment and Republican tweets on retweet count. The R-squared is 0.13, indicating 13% of the variance of retweet count is explained by the model.
The greater positive effect of negative sentiment on retweet count for Republicans can be visualized above.
Model 8: Interaction between negative sentiment and Republican tweets on favorite count
OLS Regression Results
==============================================================================
Dep. Variable: favorite_count R-squared: 0.104
Model: OLS Adj. R-squared: 0.103
Method: Least Squares F-statistic: 192.6
Date: Sun, 18 Oct 2020 Prob (F-statistic): 9.64e-195
Time: 18:28:38 Log-Likelihood: -1.0569e+05
No. Observations: 8317 AIC: 2.114e+05
Df Residuals: 8311 BIC: 2.114e+05
Df Model: 5
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 9.257e+04 2667.140 34.707 0.000 8.73e+04 9.78e+04
negative -1.427e+04 1.08e+04 -1.324 0.185 -3.54e+04 6853.173
Republican 1.534e+04 2339.349 6.555 0.000 1.07e+04 1.99e+04
negative:Republican 1.403e+05 1.9e+04 7.374 0.000 1.03e+05 1.78e+05
length -226.8463 10.929 -20.756 0.000 -248.270 -205.422
media -3.134e+04 1903.565 -16.462 0.000 -3.51e+04 -2.76e+04
==============================================================================
Omnibus: 11283.257 Durbin-Watson: 1.329
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4524473.945
Skew: 7.587 Prob(JB): 0.00
Kurtosis: 116.251 Cond. No. 4.68e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Fit
The eighth model investigate the interaction effect of negative sentiment and Republican tweets on favorite count. The R-squared is 0.10, indicating 10% of the variance of favorite count is explained by the model.
The greater positive effect of negative sentiment on favorite count for Republicans can be visualized above.
Machine Learning Models for predictive analytics
Two new varaibles are created for the interaction terms: one is negative by Trump and the other is negative by Republican.
neg_trump | neg_rep | |
---|---|---|
0 | 0.000 | 0.000 |
1 | 0.000 | 0.000 |
2 | 0.173 | 0.173 |
3 | 0.000 | 0.000 |
4 | 0.000 | 0.000 |
Predictive Model for Model 5
Because the interaction term was significant in all the models, and because the R-squared slightly increased in the models that included the interaction term, the predictive models were made that mirrored that interaction models.
LinearRegression()
Above the code to create a Machine Learning model for Predictice Analytics for Model 5.
array([13471.48241824])
A Biden tweet that is not negative, that is 50 characters long, and does not have media is projected to have 13,471 retweets.
array([17046.72594331])
A Biden tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 17,047 retweets. That is an increase of 3,576 tweets.
array([24598.25359172])
A Trump tweet that is not negative, that is 50 characters long, and does not have media is projected to have 24,598 retweets.
array([44106.63682023])
A Trump tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 44,106 retweets. That is an increase of 19,508 tweets, a much larger increase than for Biden.
Predictive Model for Model 6
LinearRegression()
array([85354.96269929])
A Biden tweet that is not negative, that is 50 characters long, and does not have media is projected to have 85,354 favorites.
array([80469.89261586])
A Biden tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 80,469 favorites. That is a decrease of 4,885 favorites.
array([118453.34119362])
A Trump tweet that is not negative, that is 50 characters long, and does not have media is projected to have 118,453 favorites.
array([173078.08302607])
A Trump tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 173,078 favorites. That is an increase of 54,625 favorites.
Predictive Model for Model 7
LinearRegression()
array([13530.29502519])
A Biden or Harris tweet that is not negative, that is 50 characters long, and does not have media is projected to have 13,530 retweets.
array([15797.92798289])
A Biden or Harris tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 15,797 retweets.
array([19090.30549902])
A Trump or Pence tweet that is not negative, that is 50 characters long, and does not have media is projected to have 19,090 retweets.
array([55802.67925931])
A Trump or Pence tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 55,802 retweets.
Predictive Model for Model 8
LinearRegression()
array([81227.274972])
A Biden or Harris tweet that is not negative, that is 50 characters long, and does not have media is projected to have 81,227 favorites.
array([66958.12512068])
A Biden or Harris tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 66,958 favorites.
array([96562.36954165])
A Trump or Pence tweet that is not negative, that is 50 characters long, and does not have media is projected to have 96,562 favorites.
array([222631.76330142])
A Trump or Pence tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 222,631 favorites.
Lime for Model 5
[1.73e-01 1.00e+00 1.73e-01 0.00e+00 0.00e+00 1.96e+02 0.00e+00]
Intercept -5121.505631164369
Prediction_local [26287.6598739]
Right: 24935.6080733102
Here we can see how Model 5 came to its prediction for the second tweet in the data. A trump tweet, without media, with a length of 196 characters, and a negative polarity score of 0.17 has a predicted retweet count of 24,935.61.
Lime for Model 6
[ 0. 1. 0. 0. 0. 56. 0.]
Intercept -26271.41612258483
Prediction_local [116061.35526025]
Right: 117439.30043697983
Here we can see how Model 6 came to its prediction for the third tweet in the data. A Trump tweet, without media, with a length of 56 characters, and a negative polarity score of 0 has a predicted favorite count of 117,439.30.
Lime for Model 7
[ 0. 1. 0. 76. 0.]
Intercept 8697.157851260308
Prediction_local [18527.49277237]
Right: 18155.479650879075
Here we can see how Model 7 came to its prediction for the tenth tweet in the data. A Republican tweet, without media, with a length of 76 characters, and a negative polarity score of 0 has a predicted reweet count of 18,155.48.
Lime for Model 8
[2.14e-01 1.00e+00 2.14e-01 2.65e+02 0.00e+00]
Intercept 26688.734561201883
Prediction_local [64141.42040882]
Right: 74769.27471108577
Here we can see how Model 8 came to its prediction for the thirtieth tweet in the data. A Republican tweet, without media, with a length of 265 characters, and a negative polarity score of 0.21 has a predicted favorite count of 74,769.27.
Evaluation
Model 2 (F(6, 8310) = 425.3, p < .001, R2 = .24) is the preferred model for investigating the main effect of negativity on retweet count. The coefficient for negativity in this model is positive and is statistically significant (β = 7,869.09 , p < .001). This provides support for H1a. The predictions made by the predictive models validate this.
Model 4 (F(6, 8310) = 284.7, p < .001, R2 = .17) is the preferred model for investigating the main effect of negativity on favorite count. The coefficient for negativity in this model is positive and is statistically significant (β = 11,150 , p < .001). This provides support for H1b. Taking into consideration the significance testing models and the predictive models, it becomes clear that negative sentiment in a tweet is associated with greater engagement with that tweet.
For the interaction between negativity and Trump tweets on retweet count, Model 5 (F(7, 8309) = 368.6, p < .001, R2 = .24) is utilized. The coefficient for the interaction between negativity and Trump tweets is positive and statistically significant (β = 15,930 , p < .001). This provides support for H2a and the predictive models back this up.
For the interaction between negativity and Trump tweets on favorite count, Model 6 (F(7, 8309) = 245.6, p < .001, R2 = .17) is utilized. The coefficient for the interaction between negativity and Trump tweets is positive and statistically significant (β = 59,510, p = .002). This provides support for H2b. Taking into consideration the significance testing models and the predictive models, it becomes clear that the positive effect of negative sentiment on tweet engagement is moderated by whether the tweet was from Trump or not. In other words, negative tweets by Trump was associated with greater engagement than negative tweets by Biden.
Similar findings were found for H3a and H3b. Both Model 7 (F(5, 8311) = 246.6, p < .001, R2 = .17) and Model 8 (F(5, 8311) = 192.6, p < .001, R2 = .10) have a positive coefficient for the interaction between negativity and Republicans tweets on retweet count (β = 34,440, p < .001) and on favorite count (β = 140,300 , p < .001). These results provide solid support for H3a and H3b.
It is clear that negative sentiment is positively associated with tweet engagement, both in terms of retweet count and favorite count. It makes sense to argue that Biden should be more negative in his tweets. However, this conclusion become muddied when trying to determine whether Biden should go negative. Though doing so will increase his overall engagement, the effect will be even greater for Trump. If they get in a war of words that is increasingly negative, Trump will benefit a significant deal more than Biden. This finding is also true when comparing Trump and Pence to Biden and Harris: negative tweet sentiment is associated with a greater increase the overall Twitter engagement for Republicans relative to Democrats. As such, this project’s recommendation to the organization is not to go completely negative. The occasional negative tweet from Biden can help increase engagement, but by going completely negative, Biden runs the risk of actually helping Trump more than himself. A continuation of the status-quo, where Biden is already somewhat negative, is therefore recommended.
Limitations and Next Steps
While this project attempted to be as comprehensive as possible, there are still several limitations both to the data and the analysis that must be considered. The data is limited in two majors ways. First, as previously discussed, this data was restricted to the four politicians that are currently running for executive office of the United States. Based on this, the results and implications drawn cannot be used to generalize about other elections or other politicians. It is possible that the results would be quite different if the tweets were collected for candidates of a Senate or House race or if a similar project was conducted in a different country. Therefore, because of this data limitation, the conclusions drawn should only be for the present US Presidential Election and should not be generalized to other Presidential elections or really any other election, Presidential or not. As the present project was about the present US Presidential Election, however, this limitation is understandable. Future similar research projects could increase the generalizability of the results by including politicians for a wide range of elections, both at the local level and the national level.
Second, and more importantly, the data is also limited by Twitter’s API, as it only allows you to obtain the last 3,200 tweets from a user. While all the gathered were posted during the election cycle, they do not represent the same time span. The last 3,200 tweets for Biden date back until October 2019, while the last 3,200 tweets from Trump only date back to July 2020. This is problematic as it could lead to a confounding variable of time. For example, an event in early 2020 that led to many tweets with negative sentiment would be represented in Biden’s data, but not Trump’s. To remedy this, future projects should obtain the tweets a different way in order to include all tweets posted during the election cycle. This could be accomplished by creating or using a Twitter scrapper that is capable of scrapping all users tweets. Then all tweets that were not posted during the election cycle can be dropped from the data or excluded from the scrapper.
From an analysis perspective, this project was also limited in two major ways. As previously mentioned, the OLS regression assumption of normality could be violated due several variables not being normally distributed. While this does not bias the statistical estimates, it could undermine the significance tests. This would could be addressed by checking the normality assumption with tests such as the Durbin-Watson test. If it is found normality assumption is violated, the log of the variables could be taken or a different model that does not have the normality assumption, such as the Generalized Linear Model, could be used. Additionally, future studies should also check the other regression assumptions in addition to normality.
The final limitation is related to the use of the negative polarity scores as the variable for sentiment. There is an alternative variable of the compound sentiment score, which includes positive, neutral, and negative sentiment. As this project was focused on just the negative sentiment, it used the negative polarity scores. Since VADER also gives the compound polarity scores, future projects could investigate if the results hold when all three aspects of sentiment are included.
Ethical and Normative Considerations
All studies should consider the potential ethical and normative issues posed by their work, but this problem becomes even more acute for digital data and machine learning projects, including this one. In many regards, this ethical concerns are mitigated by the design and purpose of this project. For example, this project maintains complete respect for human autonomy, one of the guidelines for ethical AI according to the High-Level Expert Group on Artificial Intelligence (AI HLEG) established by the European Commission, as all the decisions about the tweet are still made by humans. This project was to better inform the communications team of the best use of Twitter. It did not decide which tweets were posted or how they were written. The project also has high levels of transparency and explicability as all the data is public (all tweet posts are public and the VADER sentiment package is open source), and the predictions made by the machine learning analysis can be explained through the use of LIME.
With that said, the present project is not without its ethical considerations. The first, and possibly most important, issue is the possibility to do harm (European Commission, 2018). This project investigated whether being more negative in tweet sentiment led to more engagement for four of the most prominent politicians in the world. While the overall recommendation was not to increase the amount of negative tweets, the positive effect of negative sentiment and tweet engagement was documented by this project. As such, it is possible that an organization’s takeaway is negativity increases engagement, which would lead to increasing levels of negativity in political campaigns. Further, if a similar report were to be conducted by the Trump campaign, they would certainly recommended increasing negativity in tweets as it benefits Trump more than Biden. In either case, this project or one similar to it could lead to increasing negativity in politics, would could be quite harmful both at the individual level and the societal level. Therefore, the overall harm caused by the recommendations must be taken into account.
A second concern is that this project could lead to the manipulation of consumer, or in this case, citizens (Finn & Wadhwa, 2014). Projects making recommendations about the ideal tweet sentiment could lead politicians to make insincere comments or flat-out falsehoods. For example, politicians might post very negative tweets about a topic to rally their base, even if they do not care about such issues. Further, it may lead them to negatively spin different event and policies just for the sake of increasing engagement. This would be a manipulation of the people. Instead of stating their true intentions and beliefs, projects such as these could lead politicians to mislead or lie in order to be more popular. Therefore, any recommendations made by such projects should be clear that changes in sentiment should not be conflated with changes in issues or opinions.
Finally, from a normative perspective, this project could have an effect on the Presidential Election, which could be problematic. This raises the question of what degree is society comfortable with AI helping to make decisions that could have profound effects on Presidential Elections. Further, as just mentioned, projects like these could lead to an increase in negativity in society, at least for political social media. Increasing negativity is potentially harmful to society and something society must consider when projects such as these are employed.
References
Carraro, L., & Castelli, L. (2010). The Implicit and Explicit Effects of Negative Political Campaigns: Is the Source Really Blamed? Political Psychology, 31(4), 617-645. doi:10.1111/j.1467-9221.2010.00771.x
European Commission (2019). ETHICS GUIDELINES FOR TRUSTWORTHY AI. High-Level Expert Group on Artificial Intelligence. https://ec.europa.eu/futurium/en/ai-alliance-consultation
Finn, R. L., & Wadhwa, K. (2014). The ethics of “smart” advertising and regulatory initiatives in the consumer intelligence industry. Info, 16(3), 22-39. doi:10.1108/info-12-2013-0059
Gilbert, C. H. E., & Hutto, E. (2014, June). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14).
Han, X., Gu, X., & Peng, S. (2019). Analysis of Tweet Form’s effect on users’ engagement on Twitter. Cogent Business & Management, 6(1). doi:10.1080/23311975.2018.1564168
Jost, J. T. (2017). Ideological Asymmetries and the Essence of Political Psychology. Political Psychology, 38(2), 167-208. doi:10.1111/pops.12407
Jost, J. T., Glaser, J., Kruglanski, A. W., & Sulloway, F. J. (2003). Political conservatism as motivated social cognition. Psychological Bulletin, 129(3), 339-375. doi:10.1037/0033-2909.129.3.339
Lee, J., & Xu, W. (2018). The more attacks, the more retweets: Trump’s and Clinton’s agenda setting on Twitter. Public Relations Review, 44(2), 201-213. doi:10.1016/j.pubrev.2017.10.002
Ott, B. L. (2016). The age of Twitter: Donald J. Trump and the politics of debasement. Critical Studies in Media Communication, 34(1), 59-68. doi:10.1080/15295036.2016.1266686
Oz, M., Zheng, P., & Chen, G. M. (2017). Twitter versus Facebook: Comparing incivility, impoliteness, and deliberative attributes. New Media & Society, 20(9), 3400-3419. doi:10.1177/1461444817749516
Soroka, S., & Mcadams, S. (2015). News, Politics, and Negativity. Political Communication, 32(1), 1-22. doi:10.1080/10584609.2014.881942
Soroka, S., Fournier, P., & Nir, L. (2019). Cross-national evidence of a negativity bias in psychophysiological reactions to news. Proceedings of the National Academy of Sciences, 116(38), 18888-18892. doi:10.1073/pnas.1908369116
Trussler, M., & Soroka, S. (2014). Consumer Demand for Cynical and Negative News Frames. The International Journal of Press/Politics, 19(3), 360-379. doi:10.1177/1940161214524832