In this project, we have used MALLET (MAchine Learning for LanguagE Toolkit) to read a corpus of over 3,000 text files from a dataset requested from HathiTrust. We have drawn conclusions about the themes circulating during the late eighteenth-century, (specifically regarding India). We completed a 150-topic model that we then, with the help of the programmers at Empire Windrush, visualized in an interactive network graph. The graph allowed us to understand the connections between multiple topics as well as examine the changes that take place in the connections between topics over time. We also engaged in the conversation surrounding the Digital Humanities online by creating and actively updating a blog called www.readingfromadistance.wordpress.com. The blog presents arguments about the Digital Humanities and addresses this project.
by Mae Capozzi
One of the most challenging aspects of this project was deciding what to name our topics. Our topic model consisted of 150 topics, with 200 words in each. We had to go through each topic in an attempt to find a word or phrase that described it as accurately as possible, which was a serious task.
Oftentimes, scholars using MALLET will name a topic based on the first three words that appear in the topic. For example, we named Topic 1 “Trade and Public Revenue,” but we could have easily named it “trade, public, money,” which are the first three words in the topic. We chose to assign a name to the topic because we think it better represents the topic than only the first three words. I think it is a fairly controversial choice to decide to name your topics using words that don’t appear in the topic, but I also think it is the right choice.
We went about this task by bouncing ideas off of each other and going through a couple of drafts of our list. While Scott was sending data off to the programmers at Empire Windrush, a start-up company interested in helping digital humanists create visualizations of their research, I went through each of the topics and attempted to name them the best I could. I then sent my list over to Scott, and he added and edited my list. We went through this process a couple of times before we eventually agreed on all of the topic names.
When we were unable to name a topic, we assigned the name “Noisy Data X.” One of the critiques of topic modeling is that some of the topics are entirely senseless. For example, “Noisy Data 2” looks like this:
2 0.33333 arid ftate firft neceffary paffed ftill laft effed poffeffion riot fince charader whp ray juft poffeffed artd poffible mafter whilft ftand forae foon fubjeds haye thp pur neceffity ftood greateft paffage tirae faft perfed raore pther ffiort fent pne engliffi uppn leaft wiffi veffels raay ftrength notwithftanding years expeded affembly raoft view ftrong fb serm rae thpfe ffiew bufinefs mpre felf sec thfe effeds honor paft whofe paffing pnly affift mpft veffel year affiftance ofa fad affured yery intp hiftory preffed ftanding wpuld expreffed eftabliffied charafter fight ffiip city danger ofhis letters ffiips thd ara direded things rauch fbr dp theni conduft fubjefts farae ift bad hirafelf wiu views anecdotes impoffible raade pepple adions office raany whieh ftay ypu poffefs twp fenfe effential expreffion paffages tha farther dodrine cpuld affert adion fupport fix oyer vil fup pot end raan haying perfeft mafters call knowledge fpme affedion ads yale favor puniffiment affemblies eafy ffiat dear negled inftead abput ftriking inthe paffes charaders leffer requeft thaf raen pwn effeftaffent dodor fign enjoy jewiffi weu neceffaries iffue long fof beginning exad weak appendix prize wa chriftian ftead finally frorn afferted orie exprefs addreffed diredly increafed capable poffibly juftice diftinguiffied theif ftrangers
While keeping in mind that the “long f” present in the eighteenth-century translates to an “s” in MALLET unless you write a script to keep it from doing so (which we did not), this topic still does not make sense. Instead of trying to name this topic arbitrarily, we decided to make it clear that this topic was senseless and could be viewed as an outlier. While critics of topic modeling in the humanities may point to this as a clear flaw in MALLET, I argue that it is minor at most. Because we are evaluating so many texts from such a macro level, we have accepted that we will miss some detail. This is merely an outshoot of that same concept. Sure, we have ten topics that we characterized as “Noisy Data,” but we also have 140 clear topics.