An exploratory analysis of diachronic shifts in psychological valence and sentiment based on a small unlabelled corpus of 60 documents. The documents were published from the 1950s‒2000s and selected for relevance to the topic of trans* gender identities.
The corpus was provided by Dr Lawrence Burns and Alexis Hansen of Grand Valley State University and included academic papers as well as shorter journal correspondence and online posts.
The corpus was predominantly drawn from the JSTOR archive and needed to be converted to usable text.
See http://discuss.contentmine.org/t/extracting-structured-text-from-ocred-bitmaps/641 for an account of this conversion work.
Requirement: to examine changes in the usage and valence of domain-specific words over time from a restricted vocabulary list within a small corpus of relevant publications and documents.
To analyse valence in texts, a set of techniques known as sentiment analysis are used.
Sentiment Analysis Terminology
Sentiment Analysis always involves a set or lexicon of polarities associated with words or n-grams (short sequences of words which commonly occur together and have meaning beyond their constituent parts e.g., 'gender identity'). The polarity lexicon may classify words into a small set of categories, typically positive and negative, and sometimes neutral, or it may use numerical values which can reflect a wider spectrum of sentiment polarity in the text. Sentiment Analysis is a classification task which analyses words within a piece of text and decides which category or value in the polarity lexicon they match most closely.
Ways of obtaining a polarity lexicon include:
- From a pre-existing source e.g., Bing Liu, VADER, Google ngram etc.
- By manual labelling of a vocabulary list with scores or categories (positive/negative etc.) -- good results but not scalable to large new lists
- Automatically inferred from a corpus and set of seed words
The last of these approaches is gaining in popularity as sentiment analysis is applied to a wider range of document types and subject domains.
Approach 1: Tracking diachronic changes in sentiment polarity of domain-specific terms relative to the corpus itself using SentProp
Word usage changes and changes in how words indicate sentiments can affect interpretation of valence for historical texts. The SentProp project (Hamilton et al (2016)) addresses this by creating sentiment lexicons for particular time periods.
This approach is appropriate for the current corpus as it can work with an unlabelled corpus and infer polarities from a list of seed-words. The default lists of positive and negative ‘historical’ seed words were used in this investigation.
Creating a series of Historical Polarity Lexicons using Polarity Induction with SentProp
SentProp provides a method for creating a polarity list automatically using its polarity induction technique.
Polarity induction takes as input a set of word embeddings, capturing the context in which words are used. SentProp uses the GloVe (Global Vector) format and technique to encode these word embeddings. GloVe is a high-dimensional vector space obtained by using machine-learning to train a shallow neural network to fit the word usages in the input corpus.
SentProp's polarity induction method takes short lists of seed-words, manually categorised into positive and negative polarity. From these seeds the method follows a 'random walk' through the space defined by the word embeddings. Using this method it is able to find words related to the positive or negative starting points defined by the seed words.
Creating word embeddings using GloVe
Each document had been converted to plain text from OCR-ed PDF (JSTOR) or from Microsoft .xps files. Headers added by the archiving process and references/bibliography sections were removed to leave mainly running text which would reflect word usage.
The 10 documents for each decade were combined. The text from each decade corpus was prepared by stripping out newlines and punctuation and multiple spaces and converting all words to lower case. The prepared corpora were input to GloVe and machine-learning techniques were used to fit vectors to the word-embedding data. Following the examples in SentProp, 100-dimensional vectors were generated.
Using per-decade embeddings with SentProp polarity induction
Using the per-decade embeddings produced by GloVe and seed lists for historical polarities, polarity induction was applied.
Unfortunately, although the technique was observed to work on some decade corpora, the overall results were not directly usable for the objective of tracking specific words from the domain list.
In particular, the polarity induction process did not succeed for 3 of the 6 decades. The process did successfully create polarity lexicons for the 1960s, 1990s and 2000s. However the individual lists were credible as examples of very small polarity lexicons, indicating that the basic approach and implementation had been applied to the corpora and could be explored on larger datasets.
Table 1. Characteristics of polarity lexicons which were successfully obtained from the corpus
So the process had not succeeded for all decades as required for a full and continuous plot of shifts over time.
However, because these successfully generated lexicons covered a wide range of time periods, further inspection was done to see whether some tracking of sentiment shift might still be possible from the 1960s to the 1990s/2000s.
Comparing sentiment on words of interest between 1960s and 1990s and 2000s polarity lexicons
In using the three polarity lexicons generated, a secondary problem was encountered. This approach produces lists of polarities which were relatively short (around 79‒140 words) and there was not enough consistency between the lists to track many of individual words from the specified list. Many words of interest were not in these short lexicons at all (in particular most trans*-specific terms such as ‘transvestite’ and ‘transgender’ did not appear at all). This lack of common vocabulary also reflects the wide variety of topic areas in the selected corpus. Lastly, the majority of words were allocated very similar scores so that using this lexicon would give most words neutral sentiment.
To put the results for the small 60-paper corpus in context, a typical example of polarity induction provided by SentProp is based on a corpus of 17 million words. The polarity lexicon induced from this large corpus contains 3442 entries.
By comparison in the decade corpora, typical wordcount for each decade is less than 100,000 and a polarity lexicon of around only 100 words is induced.
Although SentProp addresses the issue of analysing sentiment against usage against documents from the same time period, it was not able to construct a full enough lexicon from the small corpus here. Future work could explore using a larger corpus as an input for polarity induction. It would also be interesting to explore how the choice and number of seed-words affect the polarity induction results. SentProp is a collection of techniques and a work in progress and further investigation of the toolkit would likely be worthwhile.
Approach 2: Using a generic valence-aware Sentiment Analysis technique to analyse polarity of selected terms in context windows extracted from the corpus texts for each time period
In this approach, an existing sentiment analysis polarity lexicon was used to analyse the corpus of 60 documents. To compare polarities over time, the corpus was again divided into sets of 10 papers per decade 1950s‒2000s.
In selecting an existing sentiment analysis and polarity lexicon, the following requirements were considered:
- There is no existing classified example data to form a training set for a machine-learning classifier so a pre-trained classifier must be selected
- The lexicon should not be specific to a style, genre or topic (for instance the commonly used data from movie reviews, fiction or twitter)
- If possible the technique should reflect a range of sentiment polarity rather than a simple positive and negative, so that shifts in valence can be inspected.
Considering the above requirements, the decision was made to use VADER (Valence Aware Dictionary and sEntiment Reasoner) (Hutto and Gilbert (2014)) for sentiment analysis. VADER is designed to reflect a range of sentiment polarity rather than a set of categories. As it is based on a lexicon and rule-based it does not require the user to construct a training set of data to be classified. It is considered applicable to both social media and general written English. A limitation is that it has been designed to analyse contemporary English and does not include historical lexicons.
The corpus for a specific decade was read in as plain text.
To focus the analysis on individual terms, the Punkt sentence tokenizer (available as part of nltk) was run over the corpus. A list of only those sentences featuring the relevant term was then created. The VADER sentiment analysis was run over this list.
The polarity scores for every sentence were then plotted for each decade.
The VADER analysis captured some expected shifts in usage:
In the 1960s, ‘queer’ is used only with negative sentiment. It does not feature in 1970s and 1980s texts. Then finally in the 1990s and 2000s we see it find its contemporary range of usage as both a self-chosen identity term and also an academic term, which is predominantly positive in usage, but also a continuing term of abuse, especially when an author is quoting hatespeech.
For other terminology the pattern is less clear, although negative sentiment for words such as 'transvestite' often decreases in recent decades.
Inspecting the individual sentences against their scores reveals a wide range of examples.
It is noticeable that sentiment analysis of short extracts of an unlabelled document is not able to distinguish between negative sentiment in the expression of the author’s own viewpoint and negative sentiment from text which an author is quoting or discussing. For example in reporting first-person accounts of victimisation, the overall context is sympathetic and puts first-person narratives in the foreground but the sentiment analysis scores reflect the negative language quoted. Possible approaches to this include human labelling of valence and also running automated sentiment analysis on a larger extract from the text (such as a paragraph or window of words before and after, in order to provide more contextual information.
Notes on the corpus
This small corpus was intended as a pilot for data extraction and analysis for this domain, with a larger corpus of domain-specific scanned documents also in existence.
Although it was possible to apply the two techniques – SentProp and VADER – to the data, and see some sentiment scores and shifts which provided a basic proof of concept, various aspects of the corpus limited the results which could be obtained and the generality of the data.
Specific observations about the corpus for sentiment analysis:
Corpus size and comparative document counts: To track changes across time or disciplinary boundaries the corpus needed to be subdivided. For analysis across time this provided 10 documents per decade, but for discipline the counts varied more widely. If subdivided completely, several combinations of decade/discipline were not represented at all or only by a very small number of papers which could not be analysed effectively for general patterns as the particular study topic or phraseology would have undue prominence without sufficient items to generalise across. This also reflects the more general issues of doing corpus analysis using primary sources such as case studies.
Variety of sub-topics within the corpus: The corpus had been chosen to represent a variety of sources and sub-topics and also contained a wide variation in types of document, from short published correspondence and blog-style text to longer academic papers. This variation contributed to the variation in word count which limited the successful application of the SentProp analysis to specific decades only. It was also challenging to find terms which were common to the time periods and could be tracked across them. In part this is due to the rapidly changing language in this domain, for instance, ‘transgender’ does not appear in the 1950s corpus, while ‘pervert’ does not appear in the 2000s. Even taking these sharp usage changes into account, a larger corpus would be more likely to provide a larger range of options of relevant vocabulary to track across time.
This makes tracking of continuous semantic transitions across time or discipline more difficult.
Coverage of specific disciplines across decades:
As well as a per-decade analysis across the corpus, we were asked to track shifts in valence across disciplines across time. We were provided with a classification which marked documents as Biology, Psychology, Sociology/Anthropology etc., Medicine and Law.
Table 2 shows the distribution of the 60 papers across disciplines and decades.
Table 2: Counts of documents by decade and discipline.
In a number of the combinations there were not even multiple papers to compare for a generalised analysis. Given the limited results achieved in applying sentiment analysis to the ten-paper corpora in the per-decade approach (described above), it was decided that a similar analysis on the per-discpline/decade was out of scope for the small corpus and so this was not pursued.
C.J. Hutto, Eric Gilbert (2014) 'VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text', ICWSM-14
William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. Proceedings of EMNLP. 2016. (to appear; arXiv:1606.02820).