Thing 8: Text Mining
In this post we look at text mining and provide an introduction to some of the concepts and tools to get you started.
Text mining (also known as text data mining or text analytics) is a broad name for a number of processes and practices that examine large collections of written resources to generate new information. Typically this is done using specialised computer software, which are able to extract precise information base on much more than just keywords. The software can search for entities or concepts, relationships, phrases and sentences.
These software tools often use computer-based algorithms based on Natural Language Processing (NLP) to enable a computer to “read” and analyse text-based information. NLP interprets the meaning of the text and identifies, extracts, synthesises and analyses relevant facts and relationships to answer a query.
What others have tried
To conduct text mining requires two things: access to a source of textual data and the computational skills to do something with it. Text data can be gathered and processed from a wide variety of sources, including documents, books, digital archives, libraries catalogues, websites, and social media streams like Facebook, Twitter and Instagram. Humanities scholars can work with a single source of data, through to large corpus of materials around a particular topic or from a specific collection.
One such example would be to conduct a sentiment analysis of political speeches to determine the frequency of positive or negative words or terms. To go even further, the same speeches could elicit a public response in the form of millions of tweets with people expressing a myriad of views, opinions and degrees of support for particular political issues or politicians themselves. Programming tools such as Python and R allow you to collect, prepare and perform analysis of this type of unstructured data. Social Network Analysis (SNA) is another technique used by many Humanities scholars to reveal hidden and complex patterns and structures in textual sources. Applications of this range from visualising the connections in digitised collections of early manuscripts, through to the online activity of people using social media.
As always when using these analytical tools (especially those only available online) to analyse your data, you must carefully consider the potential privacy risks and what measures will be needed to mitigate those risks (e.g. making personal or sensitive data anonymous). For privacy, ethics and security issues, it is strongly recommended to contact experts from the University’s Research Ethics and Integrity prior to using any of these online tools.
Working with data-driven techniques, software packages, and visualisations can be a steep learning curve. An easy way to test the water is to begin with some of the public tools available for text mining.
- Go to the Museum of Australian Democracy online service to analyse word frequency within Australian election speeches.
- Go to Wordle and create a word cloud using a recent speech or editorial opinion piece. Now use the same text in a similar tool – Textalyser . Compare the results produced by each algorithm. How does the word frequency compare? What are the top word phrases used?
- For those keen to develop programming skills, learn about other tools and techniques used in natural language processing using Python.
As we mentioned earlier, the depth of this topic means that we’re only able to cover a small fraction of available text mining tools and processes. There are certainly other options that might be worth considering depending on your specific requirements and research questions. Here are some popular text mining tools to help you go further:
|Closed source||Closed source||GPL / GNU General Public License|
|Web application||Web application||Desktop application|
|TXT, CSV, HTML, XML, PDF, RTF, URL||TXT, CSV, HTML, XML, PDF, RTF, URL||
TXT, CSV, HTML, XML, URL
|TXT, CSV, XML||TXT, CSV, HTML, XML||CSV, TAB|
If text mining is something you’re considering engaging with, there are a couple of places within the University that can help.
The Digital Studio provides a range of services and infrastructure to support University researchers, professionals and selected industry experts and students working on digital projects in the humanities, arts and social sciences (HASS).
Social and Cultural Informatics Platform (SCIP)
The SCIP team work as a part of the Faculty of Arts Digital Studio and work closely with Melbourne Graduate School of Education and their research support staff. SCIP partners with the Digital Studio and MGSE to support research, raising awareness, digital research practice, and on training and workshops.
This post was written by Sarah Petchell (Collection Development Team Leader, Research and Collections) and Greg D’Arcy (Informatics Specialist, Social and Cultural Informatics Platform,(SCIP))