Is Political Polarization Persistent?

Tvisha Malik
12 min readMay 2, 2021

By: Tvisha Malik

Part I: Overview

The rise of social media has allowed more people to directly voice their political opinions to large audiences by removing traditional media outlets as the primary platform for political opinions. Historically, approval ratings were largely based on national polls conducted by media outlets and think tanks. However, the last two election cycles showed the increasing importance of Twitter for candidates to communicate with voters, receive feedback, and understand voter sentiment. This shift away from polls was partly due to the perceived inaccuracy of traditional polls in 2016 when they underestimated Donald Trump’s electability.

This image shows weekly data from Gallup’s national poll leading up to the 2016 election. Favorability rating fluctuated weekly and the polls were expensive and time consuming to conduct.

The increased usage of social media can be an asset for campaigns. Campaigns can utilize simple web-scraping techniques or Twitter’s API coupled with sentiment analysis to understand the prevailing sentiments about their campaign. This allows them to target certain groups of voters and work to address their concerns to win them over. Even after a candidate is elected, sentiment analysis of recent Tweets can serve as an informal approval rating that can help guide policy decisions and future campaign decisions.

Given the importance of these tools, I wanted to see if there was a relationship between the outcomes of past elections and the current sentiment in Tweets from that region. For example, I wanted to see if areas of the country that overwhelmingly voted for Biden in 2020 see him more favorably now than areas that voted for him by a smaller margin or voted for Trump. A difference in current sentiment between states that voted for a candidate and states that didn’t would point to persisting polarization instead of a shift towards unifying behind the elected candidate.

This analysis can be useful for a few reasons.

  1. It can serve as an informal regional approval rating of sorts for the candidate in power.
  2. It can show if there is a move towards unity or bi-partisanship after elections.
  3. It can show if sentiment changes over time and if the margin of victory has any impact on sentiment in the long-run.
  4. It can help a candidate planning a future campaign decide what issues to address at regional campaign events. Unlike polls, this kind of data analysis tends to be low-cost and faster than calling thousands of voters and aggregating the results.

For this project, I was most interested in the second and third objective. I wanted to see if Twitter sentiment about the candidates varied based on how the stated voted in the last election and if there was any convergence in the sentiment about a candidate from“red” and “blue” states.

Part II: My Methodology and Analysis of the 2016 Election

To narrow the focus of my analysis, I decided to focus on the 2016 and 2020 Presidential election. I focused on high profile Presidential elections because there is a lot of publicly available data about election results and current Tweets about the candidates to analyze. Additionally, since it was a Federal election, I can compare different regions of the country (states) and see how the sentiment compares to their statewide election result.

To conduct this analysis, I started with a publicly available data set that included Presidential election results by state by candidate. Since I started with the 2016 election, I narrowed it down to only rows pertaining to results from 2016 for Hillary Clinton or Donald Trump. I then created a column that calculated the percentage of voters in each state that voted for each candidate. I created a histogram for each candidate which graphed the percent of vote won per state versus the frequency of states to understand what percent of voters each candidate was winning per state.

Note: Number of states also includes The District of Columbia (which is the outlier on both graphs) so there are 51 “states” per histogram.

As depicted by the graphs above, the candidates were winning states by very different margins. Clinton tended to carry urban areas in large states and win the state by a smaller overall margin (51–65%) than Trump. However, the states Clinton was winning tended to be states with large populations and a lot electoral votes. Conversely, Trump tended to win smaller, rural states by a large-margin (above 60%). These graphs alone show the polarized nature of many states during the 2016 election. This made it a good election for me to analyze the current state-wide sentiment since at the time of the election many states were overwhelmingly in favor of one candidate. These percentages also gave me a way to group states into states that voted for the candidate versus states that did not in my next few steps.

To determine if which candidate won the state in 2016 affected the current sentiment in Tweets coming from that region, I grouped states where less than 50% of voters voted for the candidate and states where more than 50% of the voters voted for the candidate and compared the most common words used in Tweets originating from the region in the last ten days. I created a word clouds to display these words. If the words were similar across groups or had similar underlying sentiment it would show a convergence in the kinds of issues and sentiment that voters from both groups of states which could point towards a decrease in polarization.

The results are depicted below for states where the candidate won less than 50% of the vote.

Donald Trump’s (left) and Hillary Clinton’s (right)

The results for states where the candidates won greater than 50% of voters is depicted below.

Donald Trump’s (left) and Hillary Clinton’s (right)

When analyzing the sentiments from the two groups of states for each candidate there are a lot of similarities.

  • For Clinton, negative words like “lied”, “WMDs”, and “Saddam” show up in both groups as well as more neutral references such as “George Bush” and “attorney.”
  • For Donald Trump’s sentiment analysis, you see “Steve Scalise” (a GOP Congressman), “February”, “WMDs”, “science”, “footage” (presumably police related), and “mask” in both group’s word cloud. However, for states where he won more than 50% of votes, you see the positive word “applaud.”
  • Overall, the words appearing in both group’s word clouds are broadly uniform across both groups of states which suggests that there isn’t a strong divergence in sentiment between voters in states who voted for or against a candidate. This provides some evidence towards post-election unity.

The Tweets suggest that there are issues of importance across both groups of states that candidates may want to address if they run again. Voters seem to be universally concerned about events surrounding Saddam Hussain, weapons of mass destruction, masks and science, lawsuits (presumably about the election), and flipping (presumably states). Thus, this form of analyzing Tweets seems to be more useful for understanding what issues are on the minds of voters as opposed to constituting an unofficial approval rating (since sentiment seems to be pretty uniform across the two groups).

One notable shortcoming of the analysis above is that the Tweets related to Donald Trump were presumably about the 2020 (most recent) Presidential election which makes it difficult to analyze how voters felt about Trump after 2016. Since Clinton has not run since 2016, and has not made very many public appearances, the current Tweets about her are more likely to be about her 2016 campaign. This makes it difficult to meaningfully compare the two candidates without any confounding factors. However, this analysis is still useful if it is conducted within a year of an election or for the purpose of analyzing partisan sentiment.

To try and corroborate my results from the word cloud analysis, I used the R Syuzhet Package to assign sentiment scores to the Tweets from each state for each candidate. I then graphed the sentiment scores against the percent of the voters in the state who voted for the candidate. A positive correlation would indicate that sentiment was more positive in Tweets from states that voted for the candidate than states that voted against them. The results are below.

The graph displays no correlation between the two variables which corroborated my earlier analysis that there does not seem to be a relationship between who a state voted for in 2016 and the sentiment of currents Tweets from the region. This could point to a decrease in polarization since the election or both parties’ supporters being equally vocal on Twitter.

Finally, to understand if the the percent of voters from a region who voted for the candidate is a statistically significant predictor of current sentiment (as determined by sentiment score) from the region, I ran a multiple regression. I also included the candidates as predictors to see if who the candidate was had an affect on sentiment. The result of the regression is below.

The predictor labeled “X”, which represents the percent of voters in a given state that voted for a candidate, is not a statistically significant predictor. This conclusion is in line with my analysis from the word clouds and scatterplot. However, the regression did find that the candidate was a significant predictor of sentiment. The results show that the level of significance is higher for Donald Trump than Hillary Clinton. This anecdotally makes sense given that Trump largely used Twitter to communicate with his base and was one of the first candidates to fully mobilize Twitter to run his campaign. His base still seems to have a strong presence on Twitter to this day and has been known to use the platform to rally together.

Part III: 2020 analysis

To understand if the results from part two are generalizable to future elections, I wanted to do the same analysis on the 2020 election. Since the Twitter API that I am using for sentiment analysis is using current Tweets and Donald Trump ran in both 2016 and 2020, I only had to create word clouds for Joe Biden since Trump’s would be the same under this methodology.

The percent of voters per state that voted for each candidate aggregated is depicted below.

The distribution between for the two candidates looked similar to the 2016 election with the notable exception that Biden won more states by 50%-60% than Clinton(presumably because he won and Clinton didn’t).

Using the same process detailed in part II, I split states into two groups: greater than 50% of voters who voted for the candidate and less than 50% of voters who voted for the candidate. I produced the following word clouds for Biden (which can be compared to Trump’s from part II).

Biden’s word cloud from states where he won less than 50% of the vote (left) an more than 50% of the vote (right)

Similar to the results from part two, there is not a meaningful difference in the sentiment of the words that appears in the word cloud from either group of states.

  • Some of the words that appear in Biden’s are not seen in Clinton’s or Trump’s. For example, words like “Marc Elias” (an election lawyer) and “Ramsey” (presumably Ramsey County in Minnesota where there has been reported police brutality) are more related to current issues.
  • There is no mention of “WMDs” or “Saddam” like in Clinton’s and Trump’s and no obviously negative words like “liar.” I think this is largely because Joe Biden currently holds office and the words are more focused on current events under his Presidency than controversial points during his campaign. Additionally, polls show that Joe Biden has consistently had an approval rating over 50 percent which may be a contributing factor to the positive/neutral sentiment in his word clouds.

The graph of sentiment score versus percent of vote won per state corroborates the findings from the word cloud that Biden is viewed neutrally/leaning positive across most states.

While this graph doesn’t show a positive relationship between the two variables, it does offer us some insight into how differently Joe Biden and Donald Trump are currently perceived across the country. This graph shows how polarizing of a candidate Trump is compared to Biden. Twitter users in most states see him as either favorable or unfavorable, but he is not viewed neutrally by voters in many states. Unlike the 2016 election, where both candidates where seen as very polarizing (which is depicted in the scatterplot from part two), Biden is seen as neutral across almost all states. This neutral sentiment can be an indicator of electability that can be useful for political consultants or party organizers who are trying to assess a candidate’s chances of winning a general election.

Finally, I ran the same multiple regression from part two on the sentiment scores assigned to each state for Biden and Trump. The multiple regression included the percent of voters per state who voted for the candidate and the candidate as predictors of sentiment score.

The regression ultimately found that for the 2020 election, neither the percent of voters or the candidate was a statistically significant predictor of the sentiment score. While the sentiment score was not significant for either election, the candidates were significant for 2016 but not 2020. This may be because both candidates used social media extensively in 2020 due to the pandemic and that voters from both parties increased their social media usage so neither candidate was a statistically significant predictor of sentiment like it was in 2016 (where the Clinton campaign and supporters seemed to utilize social media less).

Part IV: Conclusion

The conclusions from the word clouds, scatterplots, and sentiment analysis regression can be summed up into three main points:

  1. There does not appear to be a positive relationship between the percent of votes a candidate won in a state in the last election and the current sentiment in Tweets from the state. This means that on average a “blue” state is not more likely to Tweet about Clinton favorably than a “red” state.
  2. There are common issues of importance across both groups of states which points towards a decrease in polarization after elections. The word clouds showed that voters in “blue” states and “red” states seem to be interested in the same issues pertaining to a candidate which is useful for a candidate who is planning a campaign or deciding on a policy.
  3. The candidate is a stronger predictor of sentiment than the percent of vote they won in a state. Every piece of analysis points to the conclusion that Clinton and Trump were viewed as more polarizing than Biden. Their word clouds both had strongly positive and negative words, the scatterplots showed that Twitter users in many states saw them as overwhelmingly positive or negative, unlike Biden’s sentiment scores which were largely clustered around zero. Finally, the regression showed that Trump and Clinton were statistically significant predictors of sentiment, while Biden was not.

The overall conclusion of my analysis is that there is no relationship between how a state voted in the last election and the current sentiment from Tweets from the state. It was interesting that winning or losing by a state by a landslide is no different that a razor thin margin when it comes to predicting sentiment.

While conducting this analysis, I stumbled upon an unexpected statistically significant predictor of sentiment: the candidate. It seemed that for some candidates, their campaigns, policies, or personality created polarization in the sentiment of Tweets about them in a way that other candidates did not. Unfortunately, the nature of my data analysis did not allow me to explore exactly what actions were leading certain politicians to be statistically significant while others were not. Anecdotally, it makes sense because Clinton and Trump both were apart of more public scandals during their campaigns than Biden.

While analyzing Tweets was a very interesting exercise, a major shortcoming is that you can only analyze current Tweets (going back a couple of weeks from the current data). This means the type of analysis that can be done is inherently limited. To bypass this limitation in future projects, I will integrate other social media sources like Reddit which allow me to scrape data from any given time period so I can analyze the change in sentiment from a given period to today.

--

--