In this project, we have used MALLET (MAchine Learning for LanguagE Toolkit) to read a corpus of over 3,000 text files from a dataset requested from HathiTrust. We have drawn conclusions about the themes circulating during the late eighteenth-century, (specifically regarding India). We completed a 150-topic model that we then, with the help of the programmers at Empire Windrush, visualized in an interactive network graph. The graph allowed us to understand the connections between multiple topics as well as examine the changes that take place in the connections between topics over time. We also engaged in the conversation surrounding the Digital Humanities online by creating and actively updating a blog called www.readingfromadistance.wordpress.com. The blog presents arguments about the Digital Humanities and addresses this project.

The Subjectivity of Naming Your Topics


by Mae Capozzi

One of the most challenging aspects of this project was deciding what to name our topics. Our topic model consisted of 150 topics, with 200 words in each. We had to go through each topic in an attempt to find a word or phrase that described it as accurately as possible, which was a serious task.

Oftentimes, scholars using MALLET will name a topic based on the first three words that appear in the topic. For example, we named Topic 1 “Trade and Public Revenue,” but we could have easily named it “trade, public, money,” which are the first three words in the topic. We chose to assign a name to the topic because we think it better represents the topic than only the first three words. I think it is a fairly controversial choice to decide to name your topics using words that don’t appear in the topic, but I also think it is the right choice.

We went about this task by bouncing ideas off of each other and going through a couple of drafts of our list. While Scott was sending data off to the programmers at Empire Windrush, a start-up company interested in helping digital humanists create visualizations of their research, I went through each of the topics and attempted to name them the best I could. I then sent my list over to Scott, and he added and edited my list. We went through this process a couple of times before we eventually agreed on all of the topic names.

When we were unable to name a topic, we assigned the name “Noisy Data X.” One of the critiques of topic modeling is that some of the topics are entirely senseless. For example, “Noisy Data 2” looks like this:

2 0.33333 arid ftate firft neceffary paffed ftill laft effed poffeffion 
riot fince charader whp ray juft poffeffed artd poffible mafter whilft 
ftand forae foon fubjeds haye thp pur neceffity ftood greateft paffage 
tirae faft perfed raore pther ffiort fent pne engliffi uppn leaft wiffi 
veffels raay ftrength notwithftanding years expeded affembly raoft view 
ftrong fb serm rae thpfe ffiew bufinefs mpre felf sec thfe effeds honor 
paft whofe paffing pnly affift mpft veffel year affiftance ofa fad 
affured yery intp hiftory preffed ftanding wpuld expreffed eftabliffied 
charafter fight ffiip city danger ofhis letters ffiips thd ara direded 
things rauch fbr dp theni conduft fubjefts farae ift bad hirafelf wiu 
views anecdotes impoffible raade pepple adions office raany whieh ftay 
ypu poffefs twp fenfe effential expreffion paffages tha farther dodrine 
cpuld affert adion fupport fix oyer vil fup pot end raan haying perfeft 
mafters call knowledge fpme affedion ads yale favor puniffiment 
affemblies eafy ffiat dear negled inftead abput ftriking inthe paffes 
charaders leffer requeft thaf raen pwn effeftaffent dodor fign enjoy 
jewiffi weu neceffaries iffue long fof beginning exad weak appendix prize wa chriftian ftead finally frorn afferted orie exprefs addreffed diredly increafed capable poffibly juftice diftinguiffied theif ftrangers 

While keeping in mind that the “long f” present in the eighteenth-century translates to an “s” in MALLET unless you write a script to keep it from doing so (which we did not), this topic still does not make sense. Instead of trying to name this topic arbitrarily, we decided to make it clear that this topic was senseless and could be viewed as an outlier. While critics of topic modeling in the humanities may point to this as a clear flaw in MALLET, I argue that it is minor at most. Because we are evaluating so many texts from such a macro level, we have accepted that we will miss some detail. This is merely an outshoot of that same concept. Sure, we have ten topics that we characterized as “Noisy Data,” but we also have 140 clear topics.

Digital Humanities 101


by Mae Capozzi

Why Using a Machine to Study Literature Isn’t as Heretical as it Sounds


by Mae Capozzi

Of the backlash against the Digital Humanities, (of which there is plenty, I assure you), the most interesting to me is the fear that if we begin to use computers to read, we will irrevocably lose the humanistic aspect of reading. For many, reading is a sensory experience. Readers want to touch and smell the book––they want to feel something. Many scholars just don’t want to fully quantify what they view as a wonderfully qualitative experience. It is one thing to theorize based on a close reading of hundreds of texts over the course of a lifetime; it is another thing entirely to analyze thousands of texts in just a few hours using a computer program like MALLET.

Franco Moretti, in “The Slaughterhouse of Literature,” explains that “if we set today’s canon of nineteenth-century British novels at two hundred titles (which is a very high figure), they would still be only about 0.5 per cent of all published novels.” So here we have a task that is only solvable with the help of a computer––it would take thousands of lifetimes to read the other 99.5% of nineteenth-century British novels. How about all of the other novels written in the nineteenth-century in the Western world, or the non-Western world, or in other centuries? I could go on and on. The sheer massiveness of this project makes it seem wrong not to take advantage of this technology, (or to at least give it a whirl).


While I locate myself firmly within the contra-canon camp, there is always the pro-canon argument that the canon consists of the best books ever written and anything outside of the canon is not worth serious literary study. The latter group can certainly argue that because only the canon is worthwhile, why take the time to read the other 99.5% of texts? While scholars may never reconcile on this point, I believe that even supporters of the traditional canon can get behind DH, because by allowing us to read outside of the canon, we can garner a deeper understanding of why canonical texts are not “sent to the slaughterhouse” with the other 99.5%.

Re:Humanities Conference at Haverford College


by Mae Capozzi

Play. Power. Production.


In April 2014, I brought a very preliminary version of this research to the Re:Humanities Conference at Haverford College. As the only undergraduate digital humanities conference in the U.S., I was excited to participate. The experience exceeded my expectations. Every student came with duly researched projects regarding topics from video game design to archive building.

The two-day conference began with a keynote speech entitled “The Political Power of Play” by Adeline Koh. Koh is the Director of DH@Stockton, Assistant Professor of Literature at Richard Stockton College, and cofounder of #DHPoco.

The students involved in the conference were split into two separate groups; each group presented on different days. The first group included titles such as: Narrative and Gameplay: Adapting Physical Narratives to a Digital Medium by Hannah Weissmann from Haverford College, An Algorithm for Serendipity: An Inquiry of Online Dating for Emotional Beings in a Digital Age by Shireen Saxena from Bryn Mawr College, and Imagining the Straight(?) Gate: Messiah, Utopia, and Queer Internet Parody by Benjamin Bernard-Herman and Dylan Hillerbrand from Swarthmore College.

On the second day, we saw mostly thesis presentations from seniors. These projects included: Poetry as a Complex System by Bronwen Hudson from the University of Vermont, The Lightning of Possible Storms: Critical Theory, Interactive Fiction, and the Pedagogy of Narrative Ludology in Bioshock Infinite by Marissa Koors at Emerson College, and Trauma, Enslavement, Embodiment, and Freedom: A New Media Approach to Narratives of Enslavement by Elizabeth Alexander at Amherst College.

The conference concluded with a keynote speech by Mary Flanagan called “Humanist Design.” Flanagan is the author of Critical Play and is the Sherman Fairchild Distinguished Professor in Digital Humanities at Dartmouth College.

I found it particularly exciting that the conference was at an undergraduate level, because it seems as though DH is the next step for disciplines like literary studies, social history, etc. I would be surprised if 90% of the students involved in this conference did not continue onto graduate school to become the next wave of humanists. Thus, I think it is necessary for undergraduates to hone their skills and have the opportunity to share their research with a larger audience than just their home institutions.

Furthermore, one of the beautiful aspects of DH is its “shareability.” Although each scholar recognized that their research could just as easily been shared online, it was exciting to attend a traditional conference. This type of multifaceted symposium allowed students to share new, digital projects while practicing real-time presentation and discussion skills.

Another interesting aspect of this conference was that we were encouraged to live tweet during presentations (#rehum14). It was a strange feeling to have my iPhone in hand during presentations at first, but I soon warmed up to the new sensation. In fact, it brought on a new sense of engagement with the presenter, as I could summarize or question an argument moments after it was mentioned. This aspect of the conference, once again, added to the “shareability” of DH as a discipline. I look forward to Re:Humanities ’15 in the spring.




“The United States is the country of close reading, so I don’t expect this idea to be particularly popular. But the trouble with close reading (in all of its incarnations, from the new criticism to deconstruction) is that it necessarily depends on an extremely small canon. This may have become an unconscious and invisible premise by now, but it is an iron one nonetheless: you invest so much in individual texts only if you think that very few of them really matter. Otherwise, it doesn’t make sense. And if you want to look beyond the canon…close reading will not do it. It’s not designed to do it, it’s designed to do the opposite. At bottom, it’s a theological exercise––very solemn treatment of very few texts taken very seriously––whereas what we really need is a little pact with the devil: we know how to read texts, now let’s learn how not to read them”(Moretti 48).

Hello internet. Welcome to our blog!

The goal of this blog is to create a record of our application of Moretti’s concept of Distant Reading using topic modeling to eighteenth and nineteenth-century British texts. Reductively, distant reading is the opposite of close reading; the scholar examines many texts on a macro scale rather than one text on a micro scale. While some theorists have embraced Moretti’s idea, others are appalled by the suggestion that distant reading is somehow better than the more traditional method. We certainly do not believe distant reading should ever replace real human reading of texts. Rather, we assert that it can (and should) be used as a supplement to close reading.

Because of the newness of this field of inquiry, we are unsure of exactly how this project will look. Preliminarily, we are interested in looking for unexpected connections between genres and exploring different types of visualizations.  Ultimately, we feel as though it is important to share our research with other DH scholars, especially because the field is, as of yet, so unexamined.

More on all of this later…


Moretti, Franco. Distant Reading. London: Verso, 2013. Print.