As a condition of Faculty-Student Collaborative Research at Skidmore College, I was required to give a presentation regarding the work Professor Enderle and I had done over the course of a month. The reader of this post is meant to follow along with the slides downloaded from the link below while reading the text.  

A Distant Reading of Empire

The goal of this project has been to complete a “distant reading” of a corpus of about 2,500 books published between 1757 and 1795. We used a technique called topic modeling in MALLET. By using this technique, we were able to draw conclusions about the colonial relationship between Britain and India during this period. I will give a bit of background before we get into what I assume are unfamiliar terms to just about everyone in this room.

Literary scholars often employ a technique called “close reading.” The scholar carefully analyzes a text sentence-by-sentence and then draws conclusions about the text as whole. As an example, let’s take a look at the first few sentences of Pride and Prejudice (1918) by Jane Austen:

“It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighborhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters”(Austen 1).

A close reading of this passage will unveil the premise of the entire novel. Austen writes: “It is a truth universally acknowledged….” Based on just these first few words, the scholar can conclude that in the world of Pride and Prejudice––the world of the eighteenth-century upper-middle class––everyone agrees on something, and that something is most likely the main point Austen is trying to make. Austen explains what this truth is “that a single man in possession of a good fortune must be in want of wife.” Thus, the scholar draws the conclusion that the book will discuss marriage and money. Austen continues: “this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.” The scholar would then note that the novel discusses a single man (Darcy), in possession of a good fortune, (10,000 pounds a year), who ultimately marries the protagonist, Lizzie Bennet. She will also note the unusual use of the word “property” in respect to a man––since when is a man with money considered property? At this time, women were not even allowed to inherit money, and so were basically given away by their fathers to their husbands. Austen, by conflating rich, single men with property, plays on the traditional marital roles and sets the stage for the (surprisingly) matriarchal society she portrays. Thus, the traditional view is that the literary scholar can “close read” only the first lines of Pride and Prejudice and already draw conclusions about the topics the novel may address.

Although close reading is the most popular technique employed by literary scholars, one of its drawbacks is that critics draw conclusions about large literary moments from a small corpus. Thus, even though the critic may in fact be correct in his analysis, he has most likely only supported it with a few books. Our project is based on the assumption that the study of literature can be improved by using a technique called distant reading alongside close reading.

Franco Moretti introduces this idea in an essay called “Conjectures on World Literature” (2000). As indicated by the name, distant reading is the exact opposite of close reading. Rather than reading small bits of texts in a detailed manner, distant reading entails the analysis of huge swaths of texts “from a distance.” Our project is an example of distant reading—we have taken a large corpus of 2,500 texts and analyzed it from a macro-level. Rather than drawing conclusions about a large corpus from a few novels the way a more traditional scholar might, the “distant-reader” can bring broad-based knowledge of a period gained from a distant reading to her close reading of individual novels.

            Matthew Jockers takes Moretti’s work one step further. In his book Macroanalysis, Jockers writes: “Though not ‘everything’ has been digitized, we have reached a tipping point, an event horizon where enough text and literature have been encoded to both allow and, indeed, force us to ask an entirely new set of questions about literature and the literary record”(Jockers 4). The type of analysis Jockers has pursued is part of a larger, burgeoning field called Digital Humanities, or DH for short. It is broadly defined as the intersection of the humanities and computing and includes everything from Archive-Building to Game Theory. Like Professor Enderle and I, Jockers also used topic modeling in MALLET to analyze theme.

One aspect of the Digital Humanities includes the maintenance of research blogs. We have kept our own blog that would allow us to share our research with the greater DH community. It is called “Reading from a Distance.” We have faithfully updated this blog with posts regarding our research as well as Digital Humanities as a whole.

            Jockers describes topic models as “the mother of all collocation tools”(Jockers 123). MALLET uses LDA, or latent Dirichlet allocation. This algorithm basically assumes that each document in a group of documents must be made up from a mixture of potential topics. It then places words that co-occur often into bins together. This function has exciting potential for the study of literature because it can analyze thousands of .txt files in just a few hours. Scholars can use this technology to perform a distant reading without actually having to read every book in the study, which would ultimately be impossible.

            MALLET produces a topic model that looks like a list of different topics with each one possessing a different set of words. It is then the job of the scholar to examine each topic and assign a name that connects all of the words in the topic. For example, we have called Topic 1 in this model “Trade and Public Revenue.” This is the more interpretative part of topic modeling, and I think it is why we can still consider this type of research a subfield of the humanities. It is important to analyze each word in the topic for its individual meaning as well as in the context of the other words in the topic. We spent quite a bit of time naming the topics because of the subjectivity of the endeavor—we wanted this project to remain analytical and so we tried to name the topics as accurately as possible.

            While topic modeling often seems like “magic” to those who do not understand how it works, it has flaws like any other technique. One of the big issues for humanists is that MALLET is run on the command line of the computer. This presents a pretty steep learning curve, particularly for older faculty who are unfamiliar with computers. Even as a digital native, it took me a fairly long time to feel comfortable on the command line. This is a major methodological problem because MALLET can produce results that seem to “make sense” even if it is used incorrectly. For example, MALLET’s output can change depending on the number of topics the scholar chooses. It is important to go through multiple iterations of the model using a different number of topics each time until you discover the clearest picture. Furthermore, large datasets, like the one we received from HathiTrust, require a lot of preprocessing to even understand the metadata. Professor Enderle had to write a script that would pull out instances of four digits occurring together so we could figure out the date of publication of the texts. We had to do something similar to make sure that we were only using English-language texts. If MALLET comes across a foreign language it will place all of the words in that language into one topic. Although it sometimes appears to understand the texts it analyzes, it does not have the interpretive ability to actually recognize the semantic meaning of the words.

We decided on a 150-topic topic model which yielded interesting results. I will only talk about the topic regarding India, Topic 114, though there were multiple exciting topics. The first three words of this topic are “company,” “nabob,” and “india.” Each of these words is a clear indicator of the colonial relationship between Britain and India. It is interesting to note that each word refers to British interests in India rather than the actual inhabitants of the subcontinent.

“Company” refers to the East India Company, which was given a Royal Charter in 1600 by Queen Elizabeth I. In 1754, the British East India Company went to war with the French East India Company. Robert Clive, the head of the army, defeated the French ally during the Battle of Plassey in 1757. “Nabob” is derived from “nawab” which is a term to describe a governor during the Mughal period. It was used in Britain to describe an Englishman who had “gone native.” This was a major concern during this period. Finally, the term “India” pulls this topic together. The word India is an interesting term because it is a British word. “India” was not used by anyone native to the subcontinent before the British. Thus, we can conclude that during the late eighteenth-century, the British were concerned with India merely for its monetary value. In other words, the British-Indian relationship at this point was different from what it would later become during the 1800s, particularly at the start of the Raj in 1857.

Other interesting words in this topic include “Clive” and “Hastings.” “Clive” refers to Robert Clive, who was a major player in the solidification of the East India Company in India. As mentioned previously, he defeated the French in 1757 at the Battle of Plassey, a major turning-point for the British. “Hastings” refers to Warren Hastings, who was the first governor-general of Bengal. He was famously impeached by Edmund Burke, who was staunchly anti-colonialist, and acquitted in 1795. Even with this type of limited analysis, it is clear to see the benefits of topic modeling in the study of literature.

We were also able to graph this topic over time. This type of visualization presents interesting possibilities for analysis. For example, our topic model shows a connection between the North American colonies and India. The spike in the graph between 1765 and 1770 corresponds with the June 1767 Townshend Duties imposed on the North American colonists on imports like glass, paint, lead, and, most importantly for our purposes, tea. The British introduced tea to the Indian subcontinent, and proceeded to mass-produce it there. It is thus unsurprising that we see a spike in conversation about India at the same time as the Townshend Duties. We see an even bigger spike 1770 and 1775. We will attribute this spike to the “Boston Tea Party” on 16 December 1773, when the American colonists dressed as Native Americans and threw East India Company Tea in the Boston Harbor to protest the tea tax. By using the information introduced by the topic model alongside previous knowledge about the period we are studying, we can draw conclusions about the interconnectedness of British colonial relations as a whole.

In the nineteenth-century, J.R. Seeley argued: “we seem, as it were, to have conquered half the world in a fit absence of mind.” This graph draws his claims into question. I argue that the British were cognizant of their goals in India, and that the spike between 1780 and 1784 serves as evidence. I believe this spike in the graph can be attributed to the 1783 East India Bill and to Pitt’s Act of 1784. The East India Bill was drafted by Edmund Burke and suggested that the government rule in India while the East India Company handled trade. Pitt’s Act also attempted to curb Company control in India. The Company became less autonomous after this act was passed, which was a major stepping-stone toward British colonization of India. It resulted in a sort of bi-furcated government in India, which would eventually evolve to become the British Raj after the Indian Army Mutiny in 1857. Once the government became involved in India, there is a clear uptick in “cultural colonization,” rather than only a concern with trade. The spike in conversation surrounding these two bills suggests that the British were certainly aware of the consequences of government intervention in another country. Thus, topic modeling can be used to generate new evidence for or against arguments from a macro-level.

Finally, Professor Enderle and I worked with a digital humanities start-up called Empire Windrush to create more complex visualizations of the topics. We wanted to find a way to depict our research in a more thought-provoking manner than just a line graph. Furthermore, we wanted interactive graphs to use on our blog for our readers to use. Through this process I learned more about the Digital Humanities subfield as well as how to effectively present my research on a blogging platform. I also worked to learn Python, a programming language used for data-manipulation. Now that this month is over, I plan to continue working to learn Python as well as updating the blog with information about this project. We are planning to continue this project over the course of the year as well, and I would like to bring it to the Re:Humanities conference in April.

"A Distant Reading of Empire" Abstract


In this project, we have used MALLET (MAchine Learning for LanguagE Toolkit) to read a corpus of over 3,000 text files from a dataset requested from HathiTrust. We have drawn conclusions about the themes circulating during the late eighteenth-century, (specifically regarding India). We completed a 150-topic model that we then, with the help of the programmers at Empire Windrush, visualized in an interactive network graph. The graph allowed us to understand the connections between multiple topics as well as examine the changes that take place in the connections between topics over time. We also engaged in the conversation surrounding the Digital Humanities online by creating and actively updating a blog called The blog presents arguments about the Digital Humanities and addresses this project.

The Subjectivity of Naming Your Topics


by Mae Capozzi

One of the most challenging aspects of this project was deciding what to name our topics. Our topic model consisted of 150 topics, with 200 words in each. We had to go through each topic in an attempt to find a word or phrase that described it as accurately as possible, which was a serious task.

Oftentimes, scholars using MALLET will name a topic based on the first three words that appear in the topic. For example, we named Topic 1 “Trade and Public Revenue,” but we could have easily named it “trade, public, money,” which are the first three words in the topic. We chose to assign a name to the topic because we think it better represents the topic than only the first three words. I think it is a fairly controversial choice to decide to name your topics using words that don’t appear in the topic, but I also think it is the right choice.

We went about this task by bouncing ideas off of each other and going through a couple of drafts of our list. While Scott was sending data off to the programmers at Empire Windrush, a start-up company interested in helping digital humanists create visualizations of their research, I went through each of the topics and attempted to name them the best I could. I then sent my list over to Scott, and he added and edited my list. We went through this process a couple of times before we eventually agreed on all of the topic names.

When we were unable to name a topic, we assigned the name “Noisy Data X.” One of the critiques of topic modeling is that some of the topics are entirely senseless. For example, “Noisy Data 2” looks like this:

While keeping in mind that the “long f” present in the eighteenth-century translates to an “s” in MALLET unless you write a script to keep it from doing so (which we did not), this topic still does not make sense. Instead of trying to name this topic arbitrarily, we decided to make it clear that this topic was senseless and could be viewed as an outlier. While critics of topic modeling in the humanities may point to this as a clear flaw in MALLET, I argue that it is minor at most. Because we are evaluating so many texts from such a macro level, we have accepted that we will miss some detail. This is merely an outshoot of that same concept. Sure, we have ten topics that we characterized as “Noisy Data,” but we also have 140 clear topics.

Digital Humanities 101


by Mae Capozzi

For those readers who are just dipping their toes into the vast digital humanities world, here is a list of books and blogs that got me started, and that will hopefully help you too.



1. Distant Reading by Franco Moretti

This book discusses Moretti’s thought process over the course of his career, and lays out his concept of distant reading, which is a basis for a lot of Digital Humanities work in Literary Studies.

2. Macroanalysis by Matthew Jockers

In Macroanalysis, Jockers discusses ways to put distant reading to work from a more technologically-minded perspective. He discusses different programs available to humanists interested in delving into the Digital Humanities.





Why Using a Machine to Study Literature Isn't as Heretical as it Sounds


by Mae Capozzi

Of the backlash against the Digital Humanities, (of which there is plenty, I assure you), the most interesting to me is the fear that if we begin to use computers to read, we will irrevocably lose the humanistic aspect of reading. For many, reading is a sensory experience. Readers want to touch and smell the book––they want to feel something. Many scholars just don’t want to fully quantify what they view as a wonderfully qualitative experience. It is one thing to theorize based on a close reading of hundreds of texts over the course of a lifetime; it is another thing entirely to analyze thousands of texts in just a few hours using a computer program like MALLET.

Franco Moretti, in “The Slaughterhouse of Literature,” explains that “if we set today’s canon of nineteenth-century British novels at two hundred titles (which is a very high figure), they would still be only about 0.5 per cent of all published novels.” So here we have a task that is only solvable with the help of a computer––it would take thousands of lifetimes to read the other 99.5% of nineteenth-century British novels. How about all of the other novels written in the nineteenth-century in the Western world, or the non-Western world, or in other centuries? I could go on and on. The sheer massiveness of this project makes it seem wrong not to take advantage of this technology, (or to at least give it a whirl).


While I locate myself firmly within the contra-canon camp, there is always the pro-canon argument that the canon consists of the best books ever written and anything outside of the canon is not worth serious literary study. The latter group can certainly argue that because only the canon is worthwhile, why take the time to read the other 99.5% of texts? While scholars may never reconcile on this point, I believe that even supporters of the traditional canon can get behind DH, because by allowing us to read outside of the canon, we can garner a deeper understanding of why canonical texts are not “sent to the slaughterhouse” with the other 99.5%.

Re:Humanities Conference at Haverford College


by Mae Capozzi

Play. Power. Production.


In April 2014, I brought a very preliminary version of this research to the Re:Humanities Conference at Haverford College. As the only undergraduate digital humanities conference in the U.S., I was excited to participate. The experience exceeded my expectations. Every student came with duly researched projects regarding topics from video game design to archive building.

The two-day conference began with a keynote speech entitled “The Political Power of Play” by Adeline Koh. Koh is the Director of DH@Stockton, Assistant Professor of Literature at Richard Stockton College, and cofounder of #DHPoco.

The students involved in the conference were split into two separate groups; each group presented on different days. The first group included titles such as: Narrative and Gameplay: Adapting Physical Narratives to a Digital Medium by Hannah Weissmann from Haverford College, An Algorithm for Serendipity: An Inquiry of Online Dating for Emotional Beings in a Digital Age by Shireen Saxena from Bryn Mawr College, and Imagining the Straight(?) Gate: Messiah, Utopia, and Queer Internet Parody by Benjamin Bernard-Herman and Dylan Hillerbrand from Swarthmore College.

On the second day, we saw mostly thesis presentations from seniors. These projects included: Poetry as a Complex System by Bronwen Hudson from the University of Vermont, The Lightning of Possible Storms: Critical Theory, Interactive Fiction, and the Pedagogy of Narrative Ludology in Bioshock Infinite by Marissa Koors at Emerson College, and Trauma, Enslavement, Embodiment, and Freedom: A New Media Approach to Narratives of Enslavement by Elizabeth Alexander at Amherst College.

The conference concluded with a keynote speech by Mary Flanagan called “Humanist Design.” Flanagan is the author of Critical Play and is the Sherman Fairchild Distinguished Professor in Digital Humanities at Dartmouth College.

I found it particularly exciting that the conference was at an undergraduate level, because it seems as though DH is the next step for disciplines like literary studies, social history, etc. I would be surprised if 90% of the students involved in this conference did not continue onto graduate school to become the next wave of humanists. Thus, I think it is necessary for undergraduates to hone their skills and have the opportunity to share their research with a larger audience than just their home institutions.

Furthermore, one of the beautiful aspects of DH is its “shareability.” Although each scholar recognized that their research could just as easily been shared online, it was exciting to attend a traditional conference. This type of multifaceted symposium allowed students to share new, digital projects while practicing real-time presentation and discussion skills.

Another interesting aspect of this conference was that we were encouraged to live tweet during presentations (#rehum14). It was a strange feeling to have my iPhone in hand during presentations at first, but I soon warmed up to the new sensation. In fact, it brought on a new sense of engagement with the presenter, as I could summarize or question an argument moments after it was mentioned. This aspect of the conference, once again, added to the “shareability” of DH as a discipline. I look forward to Re:Humanities ’15 in the spring.

Distant Reading and the Slaughterhouse of Literature


by Mae Capozzi

I first heard about distant reading in a 200-level English class on World Literature and was immediately hooked. Though I have always enjoyed close reading, I was excited to discover there were different ways to think about literature than book-by-book. I had often felt that familiar feeling of not having read enough, and when I spoke to other academics they expressed that same anxiety. Here, I thought, was the solution. Of course, as I investigated Moretti’s concept more deeply, I realized I could not be farther from the truth. Moretti’s goal is not to simplify. Rather, he seeks to expand beyond the canon––to understand not only why canonical texts work, but why other books sputter out without so much as a spark.