Thing 15: Text Mining
Are you a researcher working on text-based projects? Ever tried to make sense of all those social media posts, or analyse a long and complex literary text? Wrangling large volumes of text can be a challenge, so in this post Kim Doyle introduces text mining concepts and tools to make this task easier.
Text mining (also known as text data mining or text analytics) is a broad name for a number of processes and practices that gather and examine large collections of written resources to discover new information or answer a specific research question. Typically, this analysis begins with information retrieval, which involves the identification of relevant textual materials in a file, database, on the Web, or in some other digitised format. Automated analysis is performed by specialised computer software to structure text for analysis, derive patterns from the resulting data, and interpret the output. One of the most common and important methodologies for processing text is Natural Language Processing (NLP). It can be used to extract precise information, analyse meaning, classify text, find relevant entities and relationships in language, and more.
That Thing you do: integration into practice
Where to get data?
Text data can be gathered and processed from a wide variety of sources, including documents, books, digital archives, library catalogues, websites, and social media streams like Facebook and Twitter, or any combination of these. The format these sources are stored in will determine data acquisition techniques. In many cases you may already have your data, or be able to simply download text files from a repository. However, if you want to harvest data from the Web, you may need to connect to an Application Programming Interface (API) or perform Web scraping, especially if you are interested in social media data. This may require programming skills, depending on the scale and scope of the data collection.
The size and structure of the data will determine the most appropriate textual analysis techniques. Small datasets might not be appropriate for computational analysis. Some statistical techniques, such as topic modelling, require large amounts of data to produce meaningful results. Beyond the size of the data, the methods of analysis will be determined by your research interests. It is important to make sure methods are appropriate both for the size and structure of the data, but also that they are able to answer your research questions.
Some examples of computational techniques include:
- A linguistic analysis of male and female pronouns in a large corpus of Australian news articles to examine gender balance in news language
- Using lexical and syntactic changes to detect dementia through analysis of novelists works
- A network analysis of Milton’s Paradise Lost
Tools and techniques
Working with data-driven techniques, software packages, and programming tools can be a steep learning curve. Below are a few tools to get you started. The Web-based tool Voyant is a good introduction to text mining and can address many research questions. On the other end of the scale, programming tools such as Python and R are customisable, but can be intimidating at first glance. Below are introductory texts for R and Python written specifically for simplified text mining. Orange Text Mining uses a graphical user interface to create workflows to analyse text. You will find a range of introductory videos on its website.
|Voyant Tools||Orange Text Mining||Text Mining with R||Textblob: Python Library|
|Licence||Closed source||GPL / GNU General Public License||Open source||Open source|
|Tool type||Web application||Desktop application||Programming language||Programming language|
|Import formats||TXT, CSV, HTML, XML, PDF, RTF, URL||TXT, CSV, HTML, XML, URL||Most formats||Most formats|
|Export formats||TXT, CSV, XML||CSV, TAB||Most formats||Most formats|
The data, tools and techniques you use should be documented in your Data Management Plan. This will help you conceive your data project, keep good documentation, and maintain your data for future research in accordance with the University’s data retention policy. As always, when using these analytical tools to analyse your data (especially those tools only available online), you must carefully consider the potential privacy risks, and what measures will be needed to mitigate those risks (e.g. making personal or sensitive data anonymous). For privacy, ethics and security issues, it is strongly recommended to contact experts from the University’s Office of Research Ethics and Integrity prior to using any of these online tools. In some cases ethics approval should be sought before data collection (Facebook data, for example) and this should be factored into project timelines.
The depth of this topic means only a small fraction of available text mining tools and processes are covered here. There are certainly other options that might be worth considering depending on your specific requirements and research questions. If you are keen to delve deeper and learn some coding, Natural Language Processing with Python is a good introduction to concepts in NLP and writing programs, regardless of previous programming experience.
Where to get help
If text mining is something you’re considering in your research, there are a couple of places within the University that can help:
- Digital Studio. The Digital Studio provides a range of services and infrastructure to support University researchers, professionals, and selected industry experts and students working on digital projects in the humanities, arts, and social sciences (HASS).
- Social and Cultural Informatics Platform (SCIP). The SCIP team work as a part of the Faculty of Arts Digital Studio and work closely with Melbourne Graduate School of Education (MGSE) and their research support staff. SCIP partners with the Digital Studio and MGSE to support research, raising awareness, digital research practice, and training and workshops.
- Research Computing Services (RCS). RCS offers specialised computing services to researchers including infrastructure such as the Melbourne Research Cloud and free digital skills training. Upcoming training in Python, R and many other tools can be found on their eventbrite.
About the author
Kim Doyle is a Research Data Specialist at the Melbourne Data Analytics Platform (MDAP) and a PhD in Media and Communications at the University of Melbourne. Previously, she taught natural language processing and data mining to researchers at the University of Melbourne’s Research Computing Services for a number of years. Her research interests include political communication, social media, and computational social science.
Want more from 23 Research Things? Sign up to our mailing list to never miss a post.