I Warrant He Hath A Thousand Of These Letters, Writ With Blank Space For Different Names

In York I received a flyer that was uniquely individualized.  A couple of weeks prior, for Thanksgiving dinner, my partner Emily and I ordered out from Pizza Hut and we bought two “Create Your Own” pizzas.  In the advert we received later, there is a picture which is presumably Pizza Hut’s portrayal of a normal “Create Your Own Pizza” and it states that “Your Create Your Own Pizza was just the beginning . . .”.  We realized that the ad was tailored to our past buying trends but the same ad, with a change in type of pizza on the front, had gone out to our friends’ houses as well.  To me it seemed like a printed form of Google’s personalized web ads based on your browsing history.  In both cases, each visitor receives a unique response but in essence the website applies the same algorithm to every visitor.  On my WordPress blogs, I also receive a lot of spam “comments” in which a stock phrase or statement is pasted into the comment box of blog posts but caught by a program called Askimet.  I usually take time to read them all before deleting them and they seem to follow the same format as the ads for Pizza Hut and Google.  For instance, there is usually a space filled in by information pertinent to the viewer, usually an outgoing link to other things is present (in Pizza Hut’s case it was printed pictures of additional pizzas), and the use of other acquired information within a set framework.  Because the ad was printed like this, I began thinking about fill-in-the-blanks in literature, particularly in terms of genre and mode related to textual features. 

Like the modern examples of the creation of individuality through similarity above, if literature can also be reduced to such a scale a situation arises “the reader is free to fill in the blanks but is at the same time constrained by the patterns supplied in the text” (Freund, 142). [1] In this way, the text of the literature can be understood as malleably formulaic.  Indeed, “text conventions are ‘constitutive’ rather than ‘regulative,’ i.e., they constitute rather than regulate a form or genre” (Beach, 17). [2] This means that groups of texts can exist within literature because of the similar constitution between them and simultaneously be regulated by the texts they encompass. Indeed, “whether an author adheres strictly to a genre or deviates from it, his intention is expressed to some degree in his basic choice of genre” (Mills, 264) [3].  To rephrase, literary genre can be understood by examining the intrinsic tie between groups of texts and the individual texts themselves.  The significance of this in Docuscope’s case is that both single texts and clusters of texts can be studied to understand the similarities between them, something we already do.  However, I would like to abstract this away into questioning the value of structure in the texts, particularly in regard to the presence of certain word clusters in our results compared to single words.

There is a reliance on form in Docuscope’s results which I feel like has not been tested.  I know that Docuscope measures strings of words and individual words, but I am interested in what the extent of these string values plays into what we see in JMP.  In other words, are Docuscope’s results adequately balanced between the content and the form of a text?  In order to test this, I randomized each of Shakespeare’s plays through the same algorithm and ran them through JMP. [The full code in Java is here and an example of a randomized play is here]  The results of this experiment are below.

When compared with the original text files, the randomized files are almost completely contrasting.  Principal Component One and Two are highly discriminating and pull apart both sets of data really well.  However, the rest of the Principal Components do not contrast the two data sets as much as they blend them.

The dendrogram, a cluster-level analysis on Ward’s, illustrates this strong division accompanied by a weaker similarity.

In this, only Merry Wives of Windsor and Midsummer Night’s Dream join up with their random counterparts.  This is interesting because it would suggest that certain strings, when markers of genre or authorship, play a significant role in making a particular corpus specific. However, when analyzing multiple distinct groups, the hypothetical seventeen-dimensional space in which the measurements for the dendrograms take place is distorted or distended in such a way that statistical difference can be inflated.  For example, when the measurements are contained to each data set, a more accurate portrait emerges. [4]

After viewing the picture above, I feel like the question changes, from how other data sets are different from the original corpus, into how is the original corpus is different from the other data.  Curious results, such as in the VARD corpus, where genre appears more readily and more concrete than in the original data set prompt a discussion of what exactly we are assuming in placing emphasis on a particular corpus.  In addition, while the original corpus and the randomized texts share a nearly identical top and bottom section, the randomized data does appear to make a finer and more precise distinction within the middle groups.

The different visuals, particularly including both data sets in a dendrogram compared to analyzing them separately, seem to point to conclusions in opposite directions.  I feel like, in one respect, the results suggest a great significance in form over content in Docuscope’s results.  There is a high contrast between randomized and non-randomized texts where making assumptions with the randomized data looks like it would skew our perception of the sets.  In contrast, the similarities, and apparently greater sorting power of the randomized texts when isolated, emphasize content over structure.  In my opinion, the fill-in-the-blank aspect of genre, at least in terms of Shakespeare’s corpus, has still not been tested.  I feel like it would need a second author in a similar mode to decide anything.  However, I think that while we can make many assumptions and decide upon one version of Shakespeare’s texts, it might be more beneficial to explore other alternatives.  We can then picture Shakespeare’s corpus as being able to exist on multiple planes, like the texts themselves.  This would alleviate problems like when data sets are combined.  Instead, if we view each data set as its own representation of Shakespeare’s corpus, the works can create their own multiple-dimensional field through which we can narrow our results.  By navigating that plane, we can then precisely expand upon our previous assumptions and conclusions.

*The file for All’s Well That Ends Well was modified so that spaces and line breaks show up on .txt files.  This only involved pasting it into Word and pasting it back into .txt.  I am not sure if this has affected my results since I got the play this way and have never changed it before.

[1]        Freund, Elizabeth. The Return of the Reader: Reader-Response Criticism. Great Britain: Methuen & Co., 1987.

[2]        Beach, Richard. A Teacher’s Introduction to Reader-Response Theories. USA: National Council of Teachers of English, 1993.

[3]        Mills, Gordon. Hamlet’s Castle: The Study of Literature as a Social Experience. USA: University of Texas Press, 1976.

[4]        A humorous illustration of different map projections, and feelings about them, is available at http://xkcd.com/977/

Leave a Reply