Presentation Notes for “A Distant Reading of Empire”


As a condition of Faculty-Student Collaborative Research at Skidmore College, I was required to give a presentation regarding the work Professor Enderle and I had done over the course of a month. The reader of this post is meant to follow along with the slides downloaded from the link below while reading the text.  

A Distant Reading of Empire

The goal of this project has been to complete a “distant reading” of a corpus of about 2,500 books published between 1757 and 1795. We used a technique called topic modeling in MALLET. By using this technique, we were able to draw conclusions about the colonial relationship between Britain and India during this period. I will give a bit of background before we get into what I assume are unfamiliar terms to just about everyone in this room.

First Slide

Literary scholars often employ a technique called “close reading.” The scholar carefully analyzes a text sentence-by-sentence and then draws conclusions about the text as whole. As an example, let’s take a look at the first few sentences of Pride and Prejudice (1918) by Jane Austen:

“It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighborhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters”(Austen 1).

A close reading of this passage will unveil the premise of the entire novel. Austen writes: “It is a truth universally acknowledged….” Based on just these first few words, the scholar can conclude that in the world of Pride and Prejudice––the world of the eighteenth-century upper-middle class––everyone agrees on something, and that something is most likely the main point Austen is trying to make. Austen explains what this truth is “that a single man in possession of a good fortune must be in want of wife.” Thus, the scholar draws the conclusion that the book will discuss marriage and money. Austen continues: “this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.” The scholar would then note that the novel discusses a single man (Darcy), in possession of a good fortune, (10,000 pounds a year), who ultimately marries the protagonist, Lizzie Bennet. She will also note the unusual use of the word “property” in respect to a man––since when is a man with money considered property? At this time, women were not even allowed to inherit money, and so were basically given away by their fathers to their husbands. Austen, by conflating rich, single men with property, plays on the traditional marital roles and sets the stage for the (surprisingly) matriarchal society she portrays. Thus, the traditional view is that the literary scholar can “close read” only the first lines of Pride and Prejudice and already draw conclusions about the topics the novel may address.

Next Slide

Although close reading is the most popular technique employed by literary scholars, one of its drawbacks is that critics draw conclusions about large literary moments from a small corpus. Thus, even though the critic may in fact be correct in his analysis, he has most likely only supported it with a few books. Our project is based on the assumption that the study of literature can be improved by using a technique called distant reading alongside close reading.

Next Slide

Franco Moretti introduces this idea in an essay called “Conjectures on World Literature” (2000). As indicated by the name, distant reading is the exact opposite of close reading. Rather than reading small bits of texts in a detailed manner, distant reading entails the analysis of huge swaths of texts “from a distance.” Our project is an example of distant reading—we have taken a large corpus of 2,500 texts and analyzed it from a macro-level. Rather than drawing conclusions about a large corpus from a few novels the way a more traditional scholar might, the “distant-reader” can bring broad-based knowledge of a period gained from a distant reading to her close reading of individual novels.

Next Slide

            Matthew Jockers takes Moretti’s work one step further. In his book Macroanalysis, Jockers writes: “Though not ‘everything’ has been digitized, we have reached a tipping point, an event horizon where enough text and literature have been encoded to both allow and, indeed, force us to ask an entirely new set of questions about literature and the literary record”(Jockers 4). The type of analysis Jockers has pursued is part of a larger, burgeoning field called Digital Humanities, or DH for short. It is broadly defined as the intersection of the humanities and computing and includes everything from Archive-Building to Game Theory. Like Professor Enderle and I, Jockers also used topic modeling in MALLET to analyze theme.

Next Slide

One aspect of the Digital Humanities includes the maintenance of research blogs. We have kept our own blog that would allow us to share our research with the greater DH community. It is called “Reading from a Distance.” We have faithfully updated this blog with posts regarding our research as well as Digital Humanities as a whole.

Next Slide

            Jockers describes topic models as “the mother of all collocation tools”(Jockers 123). MALLET uses LDA, or latent Dirichlet allocation. This algorithm basically assumes that each document in a group of documents must be made up from a mixture of potential topics. It then places words that co-occur often into bins together. This function has exciting potential for the study of literature because it can analyze thousands of .txt files in just a few hours. Scholars can use this technology to perform a distant reading without actually having to read every book in the study, which would ultimately be impossible.

Next Slide

            MALLET produces a topic model that looks like a list of different topics with each one possessing a different set of words. It is then the job of the scholar to examine each topic and assign a name that connects all of the words in the topic. For example, we have called Topic 1 in this model “Trade and Public Revenue.” This is the more interpretative part of topic modeling, and I think it is why we can still consider this type of research a subfield of the humanities. It is important to analyze each word in the topic for its individual meaning as well as in the context of the other words in the topic. We spent quite a bit of time naming the topics because of the subjectivity of the endeavor—we wanted this project to remain analytical and so we tried to name the topics as accurately as possible.

Next Slide

            While topic modeling often seems like “magic” to those who do not understand how it works, it has flaws like any other technique. One of the big issues for humanists is that MALLET is run on the command line of the computer. This presents a pretty steep learning curve, particularly for older faculty who are unfamiliar with computers. Even as a digital native, it took me a fairly long time to feel comfortable on the command line. This is a major methodological problem because MALLET can produce results that seem to “make sense” even if it is used incorrectly. For example, MALLET’s output can change depending on the number of topics the scholar chooses. It is important to go through multiple iterations of the model using a different number of topics each time until you discover the clearest picture. Furthermore, large datasets, like the one we received from HathiTrust, require a lot of preprocessing to even understand the metadata. Professor Enderle had to write a script that would pull out instances of four digits occurring together so we could figure out the date of publication of the texts. We had to do something similar to make sure that we were only using English-language texts. If MALLET comes across a foreign language it will place all of the words in that language into one topic. Although it sometimes appears to understand the texts it analyzes, it does not have the interpretive ability to actually recognize the semantic meaning of the words.

Next Slide

We decided on a 150-topic topic model which yielded interesting results. I will only talk about the topic regarding India, Topic 114, though there were multiple exciting topics. The first three words of this topic are “company,” “nabob,” and “india.” Each of these words is a clear indicator of the colonial relationship between Britain and India. It is interesting to note that each word refers to British interests in India rather than the actual inhabitants of the subcontinent.

Next Slide

“Company” refers to the East India Company, which was given a Royal Charter in 1600 by Queen Elizabeth I. In 1754, the British East India Company went to war with the French East India Company. Robert Clive, the head of the army, defeated the French ally during the Battle of Plassey in 1757. “Nabob” is derived from “nawab” which is a term to describe a governor during the Mughal period. It was used in Britain to describe an Englishman who had “gone native.” This was a major concern during this period. Finally, the term “India” pulls this topic together. The word India is an interesting term because it is a British word. “India” was not used by anyone native to the subcontinent before the British. Thus, we can conclude that during the late eighteenth-century, the British were concerned with India merely for its monetary value. In other words, the British-Indian relationship at this point was different from what it would later become during the 1800s, particularly at the start of the Raj in 1857.

Next Slide

Other interesting words in this topic include “Clive” and “Hastings.” “Clive” refers to Robert Clive, who was a major player in the solidification of the East India Company in India. As mentioned previously, he defeated the French in 1757 at the Battle of Plassey, a major turning-point for the British. “Hastings” refers to Warren Hastings, who was the first governor-general of Bengal. He was famously impeached by Edmund Burke, who was staunchly anti-colonialist, and acquitted in 1795. Even with this type of limited analysis, it is clear to see the benefits of topic modeling in the study of literature.

Next Slide

We were also able to graph this topic over time. This type of visualization presents interesting possibilities for analysis. For example, our topic model shows a connection between the North American colonies and India. The spike in the graph between 1765 and 1770 corresponds with the June 1767 Townshend Duties imposed on the North American colonists on imports like glass, paint, lead, and, most importantly for our purposes, tea. The British introduced tea to the Indian subcontinent, and proceeded to mass-produce it there. It is thus unsurprising that we see a spike in conversation about India at the same time as the Townshend Duties. We see an even bigger spike 1770 and 1775. We will attribute this spike to the “Boston Tea Party” on 16 December 1773, when the American colonists dressed as Native Americans and threw East India Company Tea in the Boston Harbor to protest the tea tax. By using the information introduced by the topic model alongside previous knowledge about the period we are studying, we can draw conclusions about the interconnectedness of British colonial relations as a whole.

In the nineteenth-century, J.R. Seeley argued: “we seem, as it were, to have conquered half the world in a fit absence of mind.” This graph draws his claims into question. I argue that the British were cognizant of their goals in India, and that the spike between 1780 and 1784 serves as evidence. I believe this spike in the graph can be attributed to the 1783 East India Bill and to Pitt’s Act of 1784. The East India Bill was drafted by Edmund Burke and suggested that the government rule in India while the East India Company handled trade. Pitt’s Act also attempted to curb Company control in India. The Company became less autonomous after this act was passed, which was a major stepping-stone toward British colonization of India. It resulted in a sort of bi-furcated government in India, which would eventually evolve to become the British Raj after the Indian Army Mutiny in 1857. Once the government became involved in India, there is a clear uptick in “cultural colonization,” rather than only a concern with trade. The spike in conversation surrounding these two bills suggests that the British were certainly aware of the consequences of government intervention in another country. Thus, topic modeling can be used to generate new evidence for or against arguments from a macro-level.

Next Slide

Finally, Professor Enderle and I worked with a digital humanities start-up called Empire Windrush to create more complex visualizations of the topics. We wanted to find a way to depict our research in a more thought-provoking manner than just a line graph. Furthermore, we wanted interactive graphs to use on our blog for our readers to use. Through this process I learned more about the Digital Humanities subfield as well as how to effectively present my research on a blogging platform. I also worked to learn Python, a programming language used for data-manipulation. Now that this month is over, I plan to continue working to learn Python as well as updating the blog with information about this project. We are planning to continue this project over the course of the year as well, and I would like to bring it to the Re:Humanities conference in April.


“A Distant Reading of Empire” Abstract


In this project, we have used MALLET (MAchine Learning for LanguagE Toolkit) to read a corpus of over 3,000 text files from a dataset requested from HathiTrust. We have drawn conclusions about the themes circulating during the late eighteenth-century, (specifically regarding India). We completed a 150-topic model that we then, with the help of the programmers at Empire Windrush, visualized in an interactive network graph. The graph allowed us to understand the connections between multiple topics as well as examine the changes that take place in the connections between topics over time. We also engaged in the conversation surrounding the Digital Humanities online by creating and actively updating a blog called The blog presents arguments about the Digital Humanities and addresses this project.

The Subjectivity of Naming Your Topics


by Mae Capozzi

One of the most challenging aspects of this project was deciding what to name our topics. Our topic model consisted of 150 topics, with 200 words in each. We had to go through each topic in an attempt to find a word or phrase that described it as accurately as possible, which was a serious task.

Oftentimes, scholars using MALLET will name a topic based on the first three words that appear in the topic. For example, we named Topic 1 “Trade and Public Revenue,” but we could have easily named it “trade, public, money,” which are the first three words in the topic. We chose to assign a name to the topic because we think it better represents the topic than only the first three words. I think it is a fairly controversial choice to decide to name your topics using words that don’t appear in the topic, but I also think it is the right choice.

We went about this task by bouncing ideas off of each other and going through a couple of drafts of our list. While Scott was sending data off to the programmers at Empire Windrush, a start-up company interested in helping digital humanists create visualizations of their research, I went through each of the topics and attempted to name them the best I could. I then sent my list over to Scott, and he added and edited my list. We went through this process a couple of times before we eventually agreed on all of the topic names.

When we were unable to name a topic, we assigned the name “Noisy Data X.” One of the critiques of topic modeling is that some of the topics are entirely senseless. For example, “Noisy Data 2” looks like this:

2 0.33333 arid ftate firft neceffary paffed ftill laft effed poffeffion 
riot fince charader whp ray juft poffeffed artd poffible mafter whilft 
ftand forae foon fubjeds haye thp pur neceffity ftood greateft paffage 
tirae faft perfed raore pther ffiort fent pne engliffi uppn leaft wiffi 
veffels raay ftrength notwithftanding years expeded affembly raoft view 
ftrong fb serm rae thpfe ffiew bufinefs mpre felf sec thfe effeds honor 
paft whofe paffing pnly affift mpft veffel year affiftance ofa fad 
affured yery intp hiftory preffed ftanding wpuld expreffed eftabliffied 
charafter fight ffiip city danger ofhis letters ffiips thd ara direded 
things rauch fbr dp theni conduft fubjefts farae ift bad hirafelf wiu 
views anecdotes impoffible raade pepple adions office raany whieh ftay 
ypu poffefs twp fenfe effential expreffion paffages tha farther dodrine 
cpuld affert adion fupport fix oyer vil fup pot end raan haying perfeft 
mafters call knowledge fpme affedion ads yale favor puniffiment 
affemblies eafy ffiat dear negled inftead abput ftriking inthe paffes 
charaders leffer requeft thaf raen pwn effeftaffent dodor fign enjoy 
jewiffi weu neceffaries iffue long fof beginning exad weak appendix prize wa chriftian ftead finally frorn afferted orie exprefs addreffed diredly increafed capable poffibly juftice diftinguiffied theif ftrangers 

While keeping in mind that the “long f” present in the eighteenth-century translates to an “s” in MALLET unless you write a script to keep it from doing so (which we did not), this topic still does not make sense. Instead of trying to name this topic arbitrarily, we decided to make it clear that this topic was senseless and could be viewed as an outlier. While critics of topic modeling in the humanities may point to this as a clear flaw in MALLET, I argue that it is minor at most. Because we are evaluating so many texts from such a macro level, we have accepted that we will miss some detail. This is merely an outshoot of that same concept. Sure, we have ten topics that we characterized as “Noisy Data,” but we also have 140 clear topics.

Digital Humanities 101


by Mae Capozzi

For those readers who are just dipping their toes into the vast digital humanities world, here is a list of books and blogs that got me started, and that will hopefully help you too.



1. Distant Reading by Franco Moretti

This book discusses Moretti’s thought process over the course of his career, and lays out his concept of distant reading, which is a basis for a lot of Digital Humanities work in Literary Studies.

2. Macroanalysis by Matthew Jockers

In Macroanalysis, Jockers discusses ways to put distant reading to work from a more technologically-minded perspective. He discusses different programs available to humanists interested in delving into the Digital Humanities.








“The United States is the country of close reading, so I don’t expect this idea to be particularly popular. But the trouble with close reading (in all of its incarnations, from the new criticism to deconstruction) is that it necessarily depends on an extremely small canon. This may have become an unconscious and invisible premise by now, but it is an iron one nonetheless: you invest so much in individual texts only if you think that very few of them really matter. Otherwise, it doesn’t make sense. And if you want to look beyond the canon…close reading will not do it. It’s not designed to do it, it’s designed to do the opposite. At bottom, it’s a theological exercise––very solemn treatment of very few texts taken very seriously––whereas what we really need is a little pact with the devil: we know how to read texts, now let’s learn how not to read them”(Moretti 48).

Hello internet. Welcome to our blog!

The goal of this blog is to create a record of our application of Moretti’s concept of Distant Reading using topic modeling to eighteenth and nineteenth-century British texts. Reductively, distant reading is the opposite of close reading; the scholar examines many texts on a macro scale rather than one text on a micro scale. While some theorists have embraced Moretti’s idea, others are appalled by the suggestion that distant reading is somehow better than the more traditional method. We certainly do not believe distant reading should ever replace real human reading of texts. Rather, we assert that it can (and should) be used as a supplement to close reading.

Because of the newness of this field of inquiry, we are unsure of exactly how this project will look. Preliminarily, we are interested in looking for unexpected connections between genres and exploring different types of visualizations.  Ultimately, we feel as though it is important to share our research with other DH scholars, especially because the field is, as of yet, so unexamined.

More on all of this later…


Feel free to read through our “About” page by pressing the “+” button at the bottom of the page.


Moretti, Franco. Distant Reading. London: Verso, 2013. Print.