Morphoditar package use case: Analysis of lyrics

Michael Škvrňák michael.skvrnak@rozhlas.cz

2017-10-13

Introduction

This vignette demonstrates functions contained in the morphoditar R package which accesses the MorphoDiTa (Morphological Dictionary and Tagger) API developed by Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.

The demonstration is performed using lyrics of a non-random selection of Czech bands from the local do-it-yourself scene with genres ranging from hardcore-punk to indie pop1. In particular, it includes the bands:

Please note that MorphoDiTa is available for non-commercial purposes only.

Motivation

Research on psychology of language indicate that the choice of words is associated with psychological state of authors. For instance, according to Tausczik and Pennebaker (2010): People who are experiencing physical or emotional pain tend to have their attention drawn to themselves and subsequently use more first-person singular pronouns.

Also, “Pronouns and verb tense are useful linguistic elements that can help identify focus, which, in turn, can show priorities, intentions, and processing. Some care should be taken in evaluating how pronouns and verbs are used. An exception to the pronoun- attention rule concerns first-person plural pronouns—“we,” “us,” and “our.” Sometimes “we” can signal a sense of group identity, such as when couples are asked to evaluate their marriages to an interviewer, the more the participants use “we,” the better their marriage”

Thus, it does not seem to be an overstretch to assume that music bands playing different genres will use pronouns in different frequencies as their lyrics should focus on different priorities.

To quote, a hardcore punk band not included in the analysis Antisocial Skills, in particular their song Different frequencies:

We just we have different priorities
We work on different frequencies

In general, we could assume that the attention which bands playing music of different genres differs in respect to what they perceive as the thing that sucks. To put it bluntly, indie pop bands tend to sing songs like “you don’t love me and it sucks”, in contrast to hardcore punk bands which tend to focus on flaws of “the society/consumerism/capitalism which sucks”.

Based on this, we can formulate tentative hypotheses like:

Analysis

(If you don’t care about the programming stuff, scroll down to colourful charts).

Processing the data

Let’s start with loading the data. They are already stored in the package, so we need to call the data function.

devtools::load_all(".")
## Loading morphoditar
library(morphoditar)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

data("dukla")
data("esazlesa")
data("gattaca")
data("plusminusnula")
data("remek")
data("role")

str(dukla)
## 'data.frame':    5 obs. of  3 variables:
##  $ album : Factor w/ 1 level "Vinohrady": 1 1 1 1 1
##  $ lyrics: Factor w/ 5 levels "Byla horká noc\njenom vítr foukal\nA tvý bílý nohy \nchladně zářej do tmy\nAni nevím proč \ntočí se mi hlava\nChytla jsi mě za "| __truncated__,..: 5 3 4 1 2
##  $ song  : Factor w/ 5 levels "Holubi","Kudy dál",..: 2 4 5 3 1

morphoditaR functions

Before continuing, let’s clarify what functions are defined in the package.

During processing of the data, the first function needed is the tag_morphodita function which send requests to the API and returns data (tagged words). More specifically, it returns S3 object with attributes specifying the request (URL, model, tagset, and language) and data.frame with the output from the API.

out <- tag_morphodita("To je život")
str(out)
## List of 5
##  $ url   : chr "http://lindat.mff.cuni.cz/services/morphodita/api/tag?data=To%20je%20%C5%BEivot&output=json&convert_tagset=pdt_to_conll2009"
##  $ model : chr "czech-morfflex-pdt-161115"
##  $ tagset: chr "pdt_to_conll2009"
##  $ lang  : chr "cz"
##  $ output:'data.frame':  3 obs. of  3 variables:
##   ..$ token: chr [1:3] "To" "je" "život"
##   ..$ lemma: chr [1:3] "ten" "být" "život"
##   ..$ tag  : chr [1:3] "POS=P|SubPOS=D|Gen=N|Num=S|Cas=1" "POS=V|SubPOS=B|Num=S|Per=3|Ten=P|Neg=A|Voi=A" "POS=N|SubPOS=N|Gen=I|Num=S|Cas=1|Neg=A"
##  - attr(*, "class")= chr "morphodita_api"
out
## MorphoDiTa API call:
## 
## Tagset: pdt_to_conll2009 
## Model: czech-morfflex-pdt-161115 
## Language: cz 
## 
## Output:
##   token lemma                                          tag
## 1    To   ten             POS=P|SubPOS=D|Gen=N|Num=S|Cas=1
## 2    je   být POS=V|SubPOS=B|Num=S|Per=3|Ten=P|Neg=A|Voi=A
## 3 život život       POS=N|SubPOS=N|Gen=I|Num=S|Cas=1|Neg=A

Clearly, the API does not return tags in a form which would enable to work with them easily. Therefore, another function splits the tags into separate columns.
In the case of Czech language models, the tags can be separated into categories specified in this document, and in the case of English language model, the tags are specified here.

out1 <- out %>% split_tags
out1$output
##   token lemma POS SUBPOS GENDER NUMBER CASE POSSGENDER POSSNUMBER PERSON
## 1    To   ten   P      D      N      S    1         NA       <NA>   <NA>
## 2    je   být   V      B   <NA>      S <NA>         NA       <NA>      3
## 3 život život   N      N      I      S    1         NA       <NA>   <NA>
##   TENSE GRADE NEGATION VOICE RESERVE1 RESERVE2 VAR
## 1  <NA>  <NA>     <NA>  <NA>       NA       NA   -
## 2     P  <NA>        A     A       NA       NA   -
## 3  <NA>  <NA>        A  <NA>       NA       NA   -

The categories, however, are not very comprehensible and consulting the categories with the documents mentioned above is not user-friendly. Thus, there is recode_tags function which gives the tags proper labels.

out2 <- out1 %>% recode_tags
out2$output
##   token lemma     POS                       SUBPOS              GENDER
## 1    To   ten Pronoun       Pronoun, demonstrative              Neuter
## 2    je   být    Verb Verb, present or future form                <NA>
## 3 život život    Noun                Noun, general Masculine inanimate
##     NUMBER       CASE POSSGENDER POSSNUMBER PERSON   TENSE GRADE
## 1 Singular Nominative       <NA>       <NA>   <NA>    <NA>  <NA>
## 2 Singular       <NA>       <NA>       <NA>      3 Present  <NA>
## 3 Singular Nominative       <NA>       <NA>   <NA>    <NA>  <NA>
##      NEGATION  VOICE RESERVE1 RESERVE2
## 1        <NA>   <NA>       NA       NA
## 2 Affirmative Active       NA       NA
## 3 Affirmative   <NA>       NA       NA
##                                                           VAR
## 1 Not applicable (basic variant, standard contemporary style)
## 2 Not applicable (basic variant, standard contemporary style)
## 3 Not applicable (basic variant, standard contemporary style)

We can continue in the analysis by defining helper functions which tag the lyrics and split the returned tags into columns and recode them using the functions mentioned above. Also, we need to define function that will create a dataset containing the tagged lyrics together with data on the band, album, and song the lyrics come from.

# Define helper functions

## Tag lyrics
tag_data <- function(df){
    lapply(df$lyrics, function(x) x %>% tag_morphodita %>% split_tags %>% recode_tags)
}

## Add metadata to the tagged output
add_metadata <- function(output, source_lyrics, band){
    for (i in seq_len(length(output))){
        output[[i]]$output$song <- source_lyrics$song[i]
        output[[i]]$output$album <- source_lyrics$album[i]
        output[[i]]$output$band <- band
    }
    
    do.call(rbind, lapply(output, function(x) x[["output"]]))
}

Then, we process all of the datasets with lyrics and merge them together into single dataset.

dukla_lyrics <- add_metadata(tag_data(dukla), dukla, "Dukla")
esazlesa_lyrics <- add_metadata(tag_data(esazlesa), esazlesa, "Esazlesa")
gattaca_lyrics <- add_metadata(tag_data(gattaca), gattaca, "Gattaca")
plusminusnula_lyrics <- add_metadata(tag_data(plusminusnula), plusminusnula, "±0")
remek_lyrics <- add_metadata(tag_data(remek), remek, "Remek")
role_lyrics <- add_metadata(tag_data(role), role, "Role")

all_lyrics <- do.call(rbind, lapply(apropos("_lyrics"), get))
all_lyrics$genre <- ifelse(all_lyrics$band %in% c("Esazlesa", "Gattaca", "Remek"), 
                           "hardcore", "indie something")

The resulting dataset looks like this:

str(all_lyrics)
## 'data.frame':    6285 obs. of  21 variables:
##  $ token     : chr  "Zdá" "se" "," "že" ...
##  $ lemma     : chr  "zdát" "se" "," "že" ...
##  $ POS       : Factor w/ 12 levels "Adjective","Numeral",..: 8 7 12 5 3 7 3 8 8 8 ...
##  $ SUBPOS    : Factor w/ 74 levels "Abbreviation used as an adverb",..: 21 12 15 3 48 23 48 62 21 52 ...
##  $ GENDER    : Factor w/ 10 levels "Feminine","Feminine or Neuter",..: NA NA NA NA NA 5 NA 5 NA NA ...
##  $ NUMBER    : Factor w/ 5 levels "Dual","Plural",..: 3 5 NA NA NA 3 NA 3 3 NA ...
##  $ CASE      : Factor w/ 8 levels "Nominative","Genitive",..: NA 4 NA NA NA 1 NA NA NA NA ...
##  $ POSSGENDER: Factor w/ 4 levels "Feminine possessor",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ POSSNUMBER: Factor w/ 2 levels "Plural","Singular": NA NA NA NA NA NA NA NA NA NA ...
##  $ PERSON    : Factor w/ 4 levels "1","2","3","Any": 3 NA NA NA NA NA NA 4 1 NA ...
##  $ TENSE     : Factor w/ 5 levels "Future","Past or Present",..: 3 NA NA NA NA NA NA 4 3 NA ...
##  $ GRADE     : Factor w/ 3 levels "Positive","Comparative",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ NEGATION  : Factor w/ 2 levels "Affirmative",..: 1 NA NA NA 1 NA NA 1 1 1 ...
##  $ VOICE     : Factor w/ 2 levels "Active","Passive": 1 NA NA NA NA NA NA 1 1 NA ...
##  $ RESERVE1  : logi  NA NA NA NA NA NA ...
##  $ RESERVE2  : logi  NA NA NA NA NA NA ...
##  $ VAR       : Factor w/ 10 levels "Not applicable (basic variant, standard contemporary style)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ song      : Factor w/ 78 levels "Holubi","Kudy dál",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ album     : Factor w/ 17 levels "Vinohrady","Společnost psů",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ band      : chr  "Dukla" "Dukla" "Dukla" "Dukla" ...
##  $ genre     : chr  "indie something" "indie something" "indie something" "indie something" ...

Personal pronouns

Then we can focus on personal pronouns (já, ty, on/ona/ono, my, vy, oni/ony) contained in the lyrics.

personal_pronouns <- subset(all_lyrics, SUBPOS == "Personal pronoun, clitical (short) form" | 
                                        SUBPOS == "Personal pronoun")
personal_pronouns$type <- paste(personal_pronouns$NUMBER, personal_pronouns$PERSON)

pronouns_sum <- personal_pronouns %>% group_by(type, band, genre) %>% summarise(length = n())
ggplot(pronouns_sum, aes(x = band, y = length, fill = type)) + 
    geom_bar(stat = "identity") + facet_grid(. ~ genre, scales = "free") + 
    labs(title = "Personal pronouns", y = "frequency of pronouns")

Possessive pronouns

possessive_pronouns <- subset(all_lyrics, 
                              SUBPOS == "Possessive pronoun 'můj', 'tvůj', 'jeho/její'")
possessive_pronouns$type <- paste(possessive_pronouns$POSNUMBER, possessive_pronouns$PERSON)

possessive_sum <- possessive_pronouns %>% group_by(type, band, genre) %>% summarise(length = n())
ggplot(possessive_sum, aes(x = band, y = length, fill = type)) + 
    geom_bar(stat = "identity") + facet_grid(. ~ genre, scales = "free") + 
    labs(title = "Possessive pronouns", y = "frequency of pronouns")

Interpretation

So the distinction between the genres is not that clear-cut as one could expect. The most common personal pronoun is first-person singular for all bands except for Esazlesa which use second-person singular more often. Their use of second-person singular also present a deviation from the hypothesis.

We can explain this deviation by the fact that the underlying mechanism which should drive the selection of personal pronouns does not always hold. Hardcore bands are not predestined to write about despair caused by modern societies, but they can also sing about failures within inter-personal relationships as they do in For Better or Worse/V dobrém i ve zlém which is about failed marriage.
In addition, some of their lyrics are not available on Bandcamp, such as the lyrics of Middle Children of History/Průměrný děti historie where they quote Palahniuk’s Fight Club/Tyler Durden:
“Advertising has us chasing cars and clothes, working jobs we hate so we can buy shit we don’t need. We’re the middle children of history, man. No purpose or place. We have no Great War. No Great Depression. Our Great War’s a spiritual war… our Great Depression is our lives.”
(Although, the English translation contain much more personal pronouns than the Czech translation. Therefore, exploring the conjugation of verbs might be another possible approach how to investigate the difference between genres/authors.)

Apart from that, the use of plural pronouns merely supports the hypothesis.

If you miss statistical tests, you can use “perverse language of statistics and numbers” (Gattaca: Workfare) to compute them yourself. DIY, d’oh.


  1. More specifically, the selection was based on the author’s perception that the bands don’t suck and they have lyrics on bandcamp so that it could be scraped.