Presentation Notes for “A Distant Reading of Empire”


As a condition of Faculty-Student Collaborative Research at Skidmore College, I was required to give a presentation regarding the work Professor Enderle and I had done over the course of a month. The reader of this post is meant to follow along with the slides downloaded from the link below while reading the text.  

A Distant Reading of Empire

The goal of this project has been to complete a “distant reading” of a corpus of about 2,500 books published between 1757 and 1795. We used a technique called topic modeling in MALLET. By using this technique, we were able to draw conclusions about the colonial relationship between Britain and India during this period. I will give a bit of background before we get into what I assume are unfamiliar terms to just about everyone in this room.

First Slide

Literary scholars often employ a technique called “close reading.” The scholar carefully analyzes a text sentence-by-sentence and then draws conclusions about the text as whole. As an example, let’s take a look at the first few sentences of Pride and Prejudice (1918) by Jane Austen:

“It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighborhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters”(Austen 1).

A close reading of this passage will unveil the premise of the entire novel. Austen writes: “It is a truth universally acknowledged….” Based on just these first few words, the scholar can conclude that in the world of Pride and Prejudice––the world of the eighteenth-century upper-middle class––everyone agrees on something, and that something is most likely the main point Austen is trying to make. Austen explains what this truth is “that a single man in possession of a good fortune must be in want of wife.” Thus, the scholar draws the conclusion that the book will discuss marriage and money. Austen continues: “this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.” The scholar would then note that the novel discusses a single man (Darcy), in possession of a good fortune, (10,000 pounds a year), who ultimately marries the protagonist, Lizzie Bennet. She will also note the unusual use of the word “property” in respect to a man––since when is a man with money considered property? At this time, women were not even allowed to inherit money, and so were basically given away by their fathers to their husbands. Austen, by conflating rich, single men with property, plays on the traditional marital roles and sets the stage for the (surprisingly) matriarchal society she portrays. Thus, the traditional view is that the literary scholar can “close read” only the first lines of Pride and Prejudice and already draw conclusions about the topics the novel may address.

Next Slide

Although close reading is the most popular technique employed by literary scholars, one of its drawbacks is that critics draw conclusions about large literary moments from a small corpus. Thus, even though the critic may in fact be correct in his analysis, he has most likely only supported it with a few books. Our project is based on the assumption that the study of literature can be improved by using a technique called distant reading alongside close reading.

Next Slide

Franco Moretti introduces this idea in an essay called “Conjectures on World Literature” (2000). As indicated by the name, distant reading is the exact opposite of close reading. Rather than reading small bits of texts in a detailed manner, distant reading entails the analysis of huge swaths of texts “from a distance.” Our project is an example of distant reading—we have taken a large corpus of 2,500 texts and analyzed it from a macro-level. Rather than drawing conclusions about a large corpus from a few novels the way a more traditional scholar might, the “distant-reader” can bring broad-based knowledge of a period gained from a distant reading to her close reading of individual novels.

Next Slide

            Matthew Jockers takes Moretti’s work one step further. In his book Macroanalysis, Jockers writes: “Though not ‘everything’ has been digitized, we have reached a tipping point, an event horizon where enough text and literature have been encoded to both allow and, indeed, force us to ask an entirely new set of questions about literature and the literary record”(Jockers 4). The type of analysis Jockers has pursued is part of a larger, burgeoning field called Digital Humanities, or DH for short. It is broadly defined as the intersection of the humanities and computing and includes everything from Archive-Building to Game Theory. Like Professor Enderle and I, Jockers also used topic modeling in MALLET to analyze theme.

Next Slide

One aspect of the Digital Humanities includes the maintenance of research blogs. We have kept our own blog that would allow us to share our research with the greater DH community. It is called “Reading from a Distance.” We have faithfully updated this blog with posts regarding our research as well as Digital Humanities as a whole.

Next Slide

            Jockers describes topic models as “the mother of all collocation tools”(Jockers 123). MALLET uses LDA, or latent Dirichlet allocation. This algorithm basically assumes that each document in a group of documents must be made up from a mixture of potential topics. It then places words that co-occur often into bins together. This function has exciting potential for the study of literature because it can analyze thousands of .txt files in just a few hours. Scholars can use this technology to perform a distant reading without actually having to read every book in the study, which would ultimately be impossible.

Next Slide

            MALLET produces a topic model that looks like a list of different topics with each one possessing a different set of words. It is then the job of the scholar to examine each topic and assign a name that connects all of the words in the topic. For example, we have called Topic 1 in this model “Trade and Public Revenue.” This is the more interpretative part of topic modeling, and I think it is why we can still consider this type of research a subfield of the humanities. It is important to analyze each word in the topic for its individual meaning as well as in the context of the other words in the topic. We spent quite a bit of time naming the topics because of the subjectivity of the endeavor—we wanted this project to remain analytical and so we tried to name the topics as accurately as possible.

Next Slide

            While topic modeling often seems like “magic” to those who do not understand how it works, it has flaws like any other technique. One of the big issues for humanists is that MALLET is run on the command line of the computer. This presents a pretty steep learning curve, particularly for older faculty who are unfamiliar with computers. Even as a digital native, it took me a fairly long time to feel comfortable on the command line. This is a major methodological problem because MALLET can produce results that seem to “make sense” even if it is used incorrectly. For example, MALLET’s output can change depending on the number of topics the scholar chooses. It is important to go through multiple iterations of the model using a different number of topics each time until you discover the clearest picture. Furthermore, large datasets, like the one we received from HathiTrust, require a lot of preprocessing to even understand the metadata. Professor Enderle had to write a script that would pull out instances of four digits occurring together so we could figure out the date of publication of the texts. We had to do something similar to make sure that we were only using English-language texts. If MALLET comes across a foreign language it will place all of the words in that language into one topic. Although it sometimes appears to understand the texts it analyzes, it does not have the interpretive ability to actually recognize the semantic meaning of the words.

Next Slide

We decided on a 150-topic topic model which yielded interesting results. I will only talk about the topic regarding India, Topic 114, though there were multiple exciting topics. The first three words of this topic are “company,” “nabob,” and “india.” Each of these words is a clear indicator of the colonial relationship between Britain and India. It is interesting to note that each word refers to British interests in India rather than the actual inhabitants of the subcontinent.

Next Slide

“Company” refers to the East India Company, which was given a Royal Charter in 1600 by Queen Elizabeth I. In 1754, the British East India Company went to war with the French East India Company. Robert Clive, the head of the army, defeated the French ally during the Battle of Plassey in 1757. “Nabob” is derived from “nawab” which is a term to describe a governor during the Mughal period. It was used in Britain to describe an Englishman who had “gone native.” This was a major concern during this period. Finally, the term “India” pulls this topic together. The word India is an interesting term because it is a British word. “India” was not used by anyone native to the subcontinent before the British. Thus, we can conclude that during the late eighteenth-century, the British were concerned with India merely for its monetary value. In other words, the British-Indian relationship at this point was different from what it would later become during the 1800s, particularly at the start of the Raj in 1857.

Next Slide

Other interesting words in this topic include “Clive” and “Hastings.” “Clive” refers to Robert Clive, who was a major player in the solidification of the East India Company in India. As mentioned previously, he defeated the French in 1757 at the Battle of Plassey, a major turning-point for the British. “Hastings” refers to Warren Hastings, who was the first governor-general of Bengal. He was famously impeached by Edmund Burke, who was staunchly anti-colonialist, and acquitted in 1795. Even with this type of limited analysis, it is clear to see the benefits of topic modeling in the study of literature.

Next Slide

We were also able to graph this topic over time. This type of visualization presents interesting possibilities for analysis. For example, our topic model shows a connection between the North American colonies and India. The spike in the graph between 1765 and 1770 corresponds with the June 1767 Townshend Duties imposed on the North American colonists on imports like glass, paint, lead, and, most importantly for our purposes, tea. The British introduced tea to the Indian subcontinent, and proceeded to mass-produce it there. It is thus unsurprising that we see a spike in conversation about India at the same time as the Townshend Duties. We see an even bigger spike 1770 and 1775. We will attribute this spike to the “Boston Tea Party” on 16 December 1773, when the American colonists dressed as Native Americans and threw East India Company Tea in the Boston Harbor to protest the tea tax. By using the information introduced by the topic model alongside previous knowledge about the period we are studying, we can draw conclusions about the interconnectedness of British colonial relations as a whole.

In the nineteenth-century, J.R. Seeley argued: “we seem, as it were, to have conquered half the world in a fit absence of mind.” This graph draws his claims into question. I argue that the British were cognizant of their goals in India, and that the spike between 1780 and 1784 serves as evidence. I believe this spike in the graph can be attributed to the 1783 East India Bill and to Pitt’s Act of 1784. The East India Bill was drafted by Edmund Burke and suggested that the government rule in India while the East India Company handled trade. Pitt’s Act also attempted to curb Company control in India. The Company became less autonomous after this act was passed, which was a major stepping-stone toward British colonization of India. It resulted in a sort of bi-furcated government in India, which would eventually evolve to become the British Raj after the Indian Army Mutiny in 1857. Once the government became involved in India, there is a clear uptick in “cultural colonization,” rather than only a concern with trade. The spike in conversation surrounding these two bills suggests that the British were certainly aware of the consequences of government intervention in another country. Thus, topic modeling can be used to generate new evidence for or against arguments from a macro-level.

Next Slide

Finally, Professor Enderle and I worked with a digital humanities start-up called Empire Windrush to create more complex visualizations of the topics. We wanted to find a way to depict our research in a more thought-provoking manner than just a line graph. Furthermore, we wanted interactive graphs to use on our blog for our readers to use. Through this process I learned more about the Digital Humanities subfield as well as how to effectively present my research on a blogging platform. I also worked to learn Python, a programming language used for data-manipulation. Now that this month is over, I plan to continue working to learn Python as well as updating the blog with information about this project. We are planning to continue this project over the course of the year as well, and I would like to bring it to the Re:Humanities conference in April.

Digital Humanities 101


by Mae Capozzi

For those readers who are just dipping their toes into the vast digital humanities world, here is a list of books and blogs that got me started, and that will hopefully help you too.



1. Distant Reading by Franco Moretti

This book discusses Moretti’s thought process over the course of his career, and lays out his concept of distant reading, which is a basis for a lot of Digital Humanities work in Literary Studies.

2. Macroanalysis by Matthew Jockers

In Macroanalysis, Jockers discusses ways to put distant reading to work from a more technologically-minded perspective. He discusses different programs available to humanists interested in delving into the Digital Humanities.





Why Using a Machine to Study Literature Isn’t as Heretical as it Sounds


by Mae Capozzi

Of the backlash against the Digital Humanities, (of which there is plenty, I assure you), the most interesting to me is the fear that if we begin to use computers to read, we will irrevocably lose the humanistic aspect of reading. For many, reading is a sensory experience. Readers want to touch and smell the book––they want to feel something. Many scholars just don’t want to fully quantify what they view as a wonderfully qualitative experience. It is one thing to theorize based on a close reading of hundreds of texts over the course of a lifetime; it is another thing entirely to analyze thousands of texts in just a few hours using a computer program like MALLET.

Franco Moretti, in “The Slaughterhouse of Literature,” explains that “if we set today’s canon of nineteenth-century British novels at two hundred titles (which is a very high figure), they would still be only about 0.5 per cent of all published novels.” So here we have a task that is only solvable with the help of a computer––it would take thousands of lifetimes to read the other 99.5% of nineteenth-century British novels. How about all of the other novels written in the nineteenth-century in the Western world, or the non-Western world, or in other centuries? I could go on and on. The sheer massiveness of this project makes it seem wrong not to take advantage of this technology, (or to at least give it a whirl).


While I locate myself firmly within the contra-canon camp, there is always the pro-canon argument that the canon consists of the best books ever written and anything outside of the canon is not worth serious literary study. The latter group can certainly argue that because only the canon is worthwhile, why take the time to read the other 99.5% of texts? While scholars may never reconcile on this point, I believe that even supporters of the traditional canon can get behind DH, because by allowing us to read outside of the canon, we can garner a deeper understanding of why canonical texts are not “sent to the slaughterhouse” with the other 99.5%.

Distant Reading and the Slaughterhouse of Literature


by Mae Capozzi

I first heard about distant reading in a 200-level English class on World Literature and was immediately hooked. Though I have always enjoyed close reading, I was excited to discover there were different ways to think about literature than book-by-book. I had often felt that familiar feeling of not having read enough, and when I spoke to other academics they expressed that same anxiety. Here, I thought, was the solution. Of course, as I investigated Moretti’s concept more deeply, I realized I could not be farther from the truth. Moretti’s goal is not to simplify. Rather, he seeks to expand beyond the canon––to understand not only why canonical texts work, but why other books sputter out without so much as a spark.





“The United States is the country of close reading, so I don’t expect this idea to be particularly popular. But the trouble with close reading (in all of its incarnations, from the new criticism to deconstruction) is that it necessarily depends on an extremely small canon. This may have become an unconscious and invisible premise by now, but it is an iron one nonetheless: you invest so much in individual texts only if you think that very few of them really matter. Otherwise, it doesn’t make sense. And if you want to look beyond the canon…close reading will not do it. It’s not designed to do it, it’s designed to do the opposite. At bottom, it’s a theological exercise––very solemn treatment of very few texts taken very seriously––whereas what we really need is a little pact with the devil: we know how to read texts, now let’s learn how not to read them”(Moretti 48).

Hello internet. Welcome to our blog!

The goal of this blog is to create a record of our application of Moretti’s concept of Distant Reading using topic modeling to eighteenth and nineteenth-century British texts. Reductively, distant reading is the opposite of close reading; the scholar examines many texts on a macro scale rather than one text on a micro scale. While some theorists have embraced Moretti’s idea, others are appalled by the suggestion that distant reading is somehow better than the more traditional method. We certainly do not believe distant reading should ever replace real human reading of texts. Rather, we assert that it can (and should) be used as a supplement to close reading.

Because of the newness of this field of inquiry, we are unsure of exactly how this project will look. Preliminarily, we are interested in looking for unexpected connections between genres and exploring different types of visualizations.  Ultimately, we feel as though it is important to share our research with other DH scholars, especially because the field is, as of yet, so unexamined.

More on all of this later…


Feel free to read through our “About” page by pressing the “+” button at the bottom of the page.


Moretti, Franco. Distant Reading. London: Verso, 2013. Print.