Week 4: What is Data mining and Data Curation

In humanities, there are different types of data, but what is data in humanities? Data in the humanities is defined by Luciano Floridi as data at its most basic level as the absence of uniformity. Data can be represented in many formats and on many different supports. An example is digital data which is discrete and is usually represented in the form of a binary notation involving two symbols; 0 and 1. In addition, digital data can be processed in data structures that can either be linear such as arrays and matrices, hierarchical like an XML file where it is tree like structure in which items have parent-child or sibling relations with each other and multi-relational where each data item is a node interconnected in a network of nodes such as graph databases. There is also structured and unstructured data and semi-structured data. However, there are two core types of data in the humanities. They are big data and smart data. Big data is relatively unstructured, messy and implicit, relatively large in volume and varied in form while smart data is semi-structured or structured, clean and explicit, relatively small in volume and of limited heterogeneity.

Now that there is an understanding of data in the humanities, one should comprehend data mining. Firstly Data Mining is also known as knowledge discovery in data ∥KDD∥. It is the process of uncovering patterns and other valuable information from large data sets. With the evolution of data warehousing technology and the growth of big data, data mining has accelerated over the years. It has assisted organizations in transforming raw data into useful knowledge. Data mining provides insightful data analysis where it can either describe the target dataset or predict the outcomes through the use of machine learning algorithms. In humanities, data mining exemplifies the interdisciplinary efforts of the humanities. It provides answers while prompting further questions from new discoveries.

“As an enterprise, “digital humanities” ∥formerly “humanities computing”∥ dates back to the late 1940s ∥debatably, even earlier∥ and, since at least the 1980s, the curation of digital humanities research data has been an associated area of research, activity, and concern.” – Trevor Munoz ∥2013∥. Trevor Munoz suggests that data curation can help to accentuate and augment publishing in a way that will serve the needs of the digital humanities community. He believes that a connection between publishing and data curation is important in the context of strategic decisions and it draws directly on the unique skills of librarians and aligns directly with library missions and values in ways that other kinds of publishing endeavors may not. To fully comprehend, one has to know what data curation refers to and how it is relevant in the humanities. Data curation “is information work that integrates closely with the disciplinary practices and needs of researchers in order to “maintain digital information that is produced in the course of research in a manner that preserves its meaning and usefulness as a potential input for further research.” ∥Munoz and Renear 2011∥ – Trevor Munoz ∥2013∥. Data curation is relevant in the humanities as it will help enable humanists to publish work efficiently and effectively and makes it easier to discover and analyze data.

All in all, data mining and data curation is important as they both help humanists to improve their writings and to improve their analytical and discovery skills. In addition, the curation and mining of data fine tunes any articles being produced by humanists and creates distinctive important publications.