Thing 18: Text mining tools

Cirrus word-cloud of this post using Voyant Tools
Cirrus word-cloud of this post using Voyant Tools

This week, we look at text mining and three great tools to get you started. Thing 18 was written by Andy Tseng (Data Infrastructure Architect Research Services).

 

Getting started

Text Mining, also often referred to as Text Data Mining or Text Analytics, is a process of filtering out specific or high-quality information from (usually) a large collection of texts via the use of various statistical and/or machine-learning algorithms.

Text mining tools enable us to extract core facts and trends from a large body of data and process those facts to derive patterns and structures that will help us make inferences and predictions about the output.

This is a big topic and there are a large number of tools available, but to get started with text mining we’ll look at some examples that are easy to learn and that should help you to get started with basic text analysis.

 

Voyant Tools

Voyant Tools (formerly known as Voyeur) is a user-friendly, web-based reading and analysis environment for digital texts. Voyant Tools lets you work with your own text collections in a variety of formats (e.g., plain text, HTML, XML, PDF, RTF, and MS Word). It also allows you to work directly with existing text collections on the Internet just by typing in the website’s URL.

 

TAPoRware

TAPoRware is a similar suite of online tools that allows you to perform text analysis on HTML, XML and plain text files. It can also analyse websites via their URLs.

 

Orange Text Mining

Orange Text Mining is an add-on for Orange data mining software package that extends Orange by providing tools for analysing texts. Orange is an open-source data analysis and visualisation tools for both novice and experts using Python scripting. Several add-ons available for specialised bioinformatics or text mining purposes.

 

Considerations

As always when using these analytical tools (especially those only available online) to analyse your data, you must consider carefully the potential privacy risks and what measures (e.g. anonymisation of personal or sensitive data) will be needed to mitigate those risks. For privacy, ethics and security issues, it is strongly recommended to contact experts from the University’s Office for Research Ethics and Integrity prior using any of these online tools to analyse your research data.

 

Reflection and integration into practice

Voyant Tools

Voyant Tools is probably the most powerful web-based tool for generic text analysis. It particularly excels when you’re dealing with large bodies of text and it also allows you to develop their own scripts to extend its functionality.

Its web interface is extremely easy to use. You can perform many basic text-analysis tasks without spending too much time reading the manual. Many of its built-in functions (e.g., visualising the frequencies and trends of the selected text within a particular document) are performed automatically as soon as the file is loaded. Voyant also allows you to insert a direct URL link to any Web page and start analysing it automatically.

There is also a wide range of tools that can be used with Voyant for additional features. Find out more here.

 

TAPoRware

Written in Ruby (an open-source programming language), TAPoRware consists of a set of text analysis tools that you can use online to analyse HTML, XML and plain text files. Again, you can also analyse web pages and documents just by simply providing the relevant URL. Each TAPoRware tool can also be used as a web service via TAPoR Portal.

The interface of each tool is clean-cut with a very minimalist feel to it, but they all perform admirably with whatever tasks you throw at them.

 

Orange Text Mining

Orange is a desktop application that requires local installation first and it offers the best performance of the three tools discussed in this post but is also perhaps the more complicated. It’s also an ‘open source’ tool as opposed to the other two ‘closed source’ options.

Orange offers different visualisation outputs (e.g., bar charts, scatter plots, dendrograms, networks, heat maps, etc.) and also allows you to design your own data analysis steps via its visual programming environment. A Python scripting interface is also available for users to code their own algorithms as well as develop complex data analysis procedures.

Table of Comparisons

Voyant Tools

TAPoRware

Orange Text Mining

Cost

Free

Free

Free

Licence

Closed source

Closed source

GPL / GNU General Public License

Usability

Easy

Easy

Easy

Tool type

Web application

Web application

Desktop application

Import formats

TXT, CSV, HTML, XML, PDF, RTF, URL

TXT, CSV, HTML, XML, URL

TXT, CSV

Export formats

TXT, CSV, XML

TXT, CSV, HTML, XML

CSV, TAB

 

Deep Dive…

Needless to say, given the depth of this topic, this post is only able to cover a small fraction of available text mining tools. There are certainly other options that might also be worth considering depending on your specific requirements. For instance, Juxta is an open-source multi-platform desktop tool that provides a user-friendly interface and can perform many textual criticism tasks on TXT and XML files.

KNIME Analytics Platform is yet another powerful tool for analysing datasets. It’s open-source (GPL license) and offers rich features, such as data pre-processing and cleansing, data modelling, data analysis and data mining. KNIME also integrates well with Weka’s analysis modules and, with additional plugins, it is also possible to run custom R-scripts within it.

Andy Tseng (Data Infrastructure Architect Research Services).

 


One Response to “Thing 18: Text mining tools”

  1. Don says:

    Hello, I want to share a free text mining tool called Textsift.com based on artificial intelligence. Feel free to use as you like for research. Here is a demo of the tool applied to this page.

    http://textsift.com?url=http://blogs.unimelb.edu.au/23researchthings/2014/08/04/thing-18-text-mining-tools/

    Please don’t hesitate to contact me for further questions.

Leave a Reply

Your email address will not be published. Required fields are marked *