Week 6: Topic Modeling
An introduction to Topic Modeling
In Megan R. Brett Journal of Digital Humanities, Vol. 2, No. 1 Winter 2012, she introduces the concept of topic modeling via explanations of the tools in topic modeling and the utilization of other humanists’ posts on topic modeling.
Though the intended audience is historians, general readers can also comprehend and utilize topic modeling. Firstly, one must retain what a topic is before moving onto topic modeling. According to Megan R. Brett ∥2012∥ “One definition offered on Twitter during a conference on topic modeling described a topic as “a recurring pattern of co-occurring words.” Now, what exactly is topic modeling? Topic modeling is a form of text mining, in other words it is a way of identifying patterns in a corpus. A corpus is a large collection of texts that can be run through a tool which groups words in the corpus into topics. Other humanists may have different definitions of topic modeling such as Miriam Posner description of topic modeling as “a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts.” – Megan R. Brett ∥2012∥. The tools utilized in topic modeling look through a corpus for clusters of words and group them together by a process of similarity. When topic modeling is done well, the words in the featured topic will make sense.
The way in which topic modeling works is fairly simple. Firstly, one can imagine reading an article with a set of highlighters. Then as one moves along the words, one should highlight key words of themes in the paper; each with a different color. After, the highlighted words should be extracted and grouped together based on the color given during the selection process.
In order to topic model, a humanists needs: 1. A large corpus: humanists would need a large corpus as topic modeling is better fit for a large collection of texts. Sometimes, the corpus may have to be prepared depending on the tool utilized. In other words, the text would have to be tokenized which means changing the human readable sentences into a string of words via the stripping of punctuation and capitalization. 2. Familiarity with the corpus: obviously before topic modeling, humanists should be familiar with the corpus. This means that they should know what the corpus contains or what it is about. 3. A tool to perform topic modeling: If humanists are to perform topic modeling, then they should utilize tools such as MALLET and LDA. These facilitate an easier modeling of the corpus by either doing tokenizing for the user or creating visualizations. 4. A way to understand the results of topic modeling: The result of topic modeling may not be readable to humans. It is important that humanists understand what the program is telling them through the use of visualizations. This is because the tool in topic modeling can be fallible and if the algorithm fails, return some bizarre results.
In conclusion, topic modeling makes an excellent tool for discovery though not useful as evidence. However, it can be fun and useful though it may be complicated and messy. To better understand topic modeling, one should attempt topic modeling as this is one of the best ways to comprehend for practice makes perfect.