We visualize simple, phonetically-based representations of the sonnets of William Shakespeare (1564-1616), and those of two of his contemporaries, Sir Philip Sidney (1554-1586) and Edmund Spenser (1553-1599).
Initial features are based on either counts of phonemes within each sonnet ("Phoneme Count"), or the sound similarity vectors from Parrish, A., 2017 ("Sound Similarity"), which make use of interleaved phonetic feature bigrams. To obtain the phoneme-count based features, we count the occurance of phonemes within each sonnet using the CMU pronunciation dictionary, to extract phonemes from words and divide the counts of a phoneme within each sonnet by the total counts of the phoneme across all sonnets in the dataset, as in Holdsworth, T. L., 2019. The sound similarity features for a sonnet are calculated by taking the mean across all words in the sonnets of the 50-dimensional sound similarity vectors of each word, which we accessed via this dictionary from A. Parrish..
We visualize these features in two dimensions by using linear discriminant analysis (LDA), principal components analysis (PCA), or a tree.
In linear discriminant analysis, we project linearly onto two dimensions in a way that, roughly speaking, maximizes the separation between the poems by different authors. We see that the simple phonetic features used here do allow for some separation between sonnets by different authors. Although our intent was to provide a simple visualization, rather than explore classification by author, we note that classification based on these models is possible. (Leaving one sonnet out and training LDA on the others gives a 74% accuracy in classifying Shakespeare vs. not Shakespeare using the phoneme count features, and 73% using the sound similarity features).
We can also use LDA to look at the phonemic properties of sonnets by specific poets. If we run LDA on the two class case of Shakespeare vs. the other poets, thus projecting onto a one-dimensional space, it is interesting to inspect the direction of that one-dimensional subspace within phoneme space. We see that the phonemes whose presence and frequency most contribute to a classification of Shakespeare as the author under this model are 'L', 'OY', 'S', 'HH', and 'CH', while the phonemes that contribute most towards a classification of not Shakespeare are 'DH', 'AH', 'IH', 'Y' and 'AE'. The sonnet with the highest score in the Shakespeare direction under this discrimative model is Shakespeare's sonnet LXIX, "Those parts of thee that the world's eye doth view..."
Principal component analysis does not take into account the authors of the poems, but rather projects linearly onto the orthogonal directions along which the data is varying the most. Because PCA does not use the author information, it interesting to note from the visualization that in the projection onto the first two principal components we can nevertheless observe some clustering by author.
Lastly, to construct the tree, we first construct a matrix of pairwise distances between the sonnets, using either cosine distance between the normalized phoneme count features, or the euclidean distance between the mean similarity vectors. A neighbor-joining algorithm is used to construct the tree from these pairwise distances, as in Holdsworth, T. L., 2019.
Code for this project is available on GitHub.