Wordstat with QDA Miner and SimStat
Provalis Research, Canada
Date of review: September 2008
What it does
Windows-based software for analysing textual data such as answers to openended questions along side other quantitative data, or to analyse qualitative transcripts. Uses a range of dictionary-based textual statistical analysis methods.
Ease of use
Compatibility with other software
Value for money
SimStat + QDA Miner + WordStat bundle – one-off purchase cost of $4,195 or $14,380 for 5 licences. 75% discount for academic users on most prices. Upgrades offered to licenced users at a discount.
- Analyse and cross-tab verbatim responses as you would any standard question
- Can use to code data automatically using a machine learning method
- Easy to create own subject-specific dictionaries, which can be reused on similar projects
- Full versions available to download for month’s free trial
- Standalone system – no collaborative or multi-user capabilities
- Steep learning curve – requires an expert user
- Dictionary and word based analysis only: does not support natural language processing or learn by example methods
- Does not support Triple S data
It’s often the answers to openended questions that offer the richest insights in a survey – especially in online surveys, if the questions are specific and well targeted. Yet this valuable resource remains almost unexplored in most quantitative surveys due to the sheer effort involved in analysing it. The principal method used – manual coding to a codeframe – is virtually unchanged since the 1940s. The only nod to the information age is the number of coding departments that use Excel as a surrogate electronic coding sheet – the IT equivalent of equipping your horse and cart with satnav.
WordStat is a very versatile bit of software that offers you countless ways to analyse openended textual data with virtually the same ease as analysing normal ‘closed’ questions in a cross-tab package. It comes as an add-on module to Simstat – which is a feature-rich desktop statistical package for analysing survey data with decent cross-tab and charting capabilities. Interestingly, WordStat also functions equally well as an add-on to another program: QDA Miner, a code-and-retrieve analysis suite for qualitative researchers. All three are developed by Provalis Research, based in Montreal.
WordStat, combined with the other two programs, offers a bewildering choice of ways to dig deeply into verbatim data and make sense of it. In this review, I will focus on two of the more interesting ones to market researchers, but this versatile program has many more tricks up its sleeve than I can cover here.
First, without doing any conventional coding at all, you can effectively cross-tab or slice your data according to any demographics or dependent variables in your data. For this, you would start out in SimStat.
SimStat will let you carry out normal quantitative analysis of data as simple cross-tabs or by applying a wide range of statistics – factors and cluster analysis, regression, correlation and so on. To do textual analysis, you need to start with the verbatim data in the same data file as the other numeric and coded questions. This is pretty much how most web interviewing packages provide the data these days. SimStat works around the concept of dependent and independent variables: pick the any demographic want as a the independent variable, pick the verbatim question as the dependent variable and pick Content Analysis from the bottom of the Statistic menu, and SimStat will fire up the separate WordStat module to let you cross-tab and dig into your openended text.
Once within WordStat there is a wide array of reports and ways to look at the words that respondents have given, in aggregate or case by case, against your dependent variable or by the total sample. There are charts, including dendograms and very informative heatmaps that show the relationship between words, and you can adjust the proximity factor used when looking at words used in the vicinity of other words. Indeed, there is more you are ever likely to use.
However, if you would like to follow a more conventional coding model, you need to enter your data via QDA Miner, which is equally comfortable dealing with records that contain only unstructured text such as focus group transcripts, or quant data with a mix of closed and open fields. QDA Miner has the concept of codeframes at the heart of it, and you can create multiple codeframes and code directly into then. You can search for similar items and then code them all together, and you can extend the codeframe as you go too. Any coding you make can be exported back to the data, for analysis in SimStat or other tools. Nothing too unconventional there.
Step into WordStat, though, and you shift from one era to the next, for the capabilities of WordStat are now at your disposal to build automatic machine learning-based verbatim text classifiers. In other words, WordStat will take your coded examples, identify all the words and combinations of words that characterise those examples, and build the algorithms to perform automated text categorisation according to your examples and your codeframe. Again, there are reports and charts available to you, to understand the extent to which your classifiers are accurate. Accuracy depends on many factors, but suffice to say here, the automatic classifiers can be as good as human coders, and on large datasets, will be much more consistent.
Unlike some other classification models, WordStat is a dictionary-based system, and it works principally on words and the relationship of words to others, rather than on actual phrases. There is a separate module for creating reusable subject-specific dictionaries and the system comes with general dictionaries in about 15 languages. It also contains a range of tools to clean up texts and to overlook elementary spelling mistakes, with its fuzzy matching logic.
There are extensive academic debates about whether this is the best method for coding, but as everything is does is transparent, can be interrogated, changed and improved, it is as likely to be as good as any other method – and it is certainly better than throwing away 10,000 verbatim responses because nobody has the time or energy to look at them.
This is, however, an expert’s system. Coders and coding supervisors would probably struggle with it in its present form. Coding is not the only function that WordStat handles, and because it has to be accessed via one of two other programs, that adds another layer of complexity.
The disjunction between these three different programs is a slightly awkward one, though it is something users report they get used to. Neither does the system offer anything in the way of collaborative tools – through of course, if you are able to code data automatically, it does mean the work done by ten people really can be done by one. Don’t expect to be able to use this simply by reading the manuals – you would need to have some education in text categorisation basics from Provalis Research first. However, with a little effort, this tool could save weeks of work, and even allow you start including, rather than avoiding openended questions in your questionnaire design.
Customer Viewpoint: US Merit Systems Protection Board, Washington DC
John Ford is a Research Psychologist at the US Merit Systems Protection Board in Washington DC, where he uses WordStat to process large surveys of Federal employees which often contain vast amounts of openended data.
“In the last two large survey we have done, we have had around 40,000 responses. We have been able to move away from asking the openend at the end which say ‘do you have any comments’ to doing more targeted openended questions that ask about specific things.
“We asked federal employees to identify their most crucial training need and describe it in a few sentences. We used a framework of 27 competencies to classify them. QDA Miner was very flexible in helping us to decide what the framework was, settle on it, and very quickly classify the competencies.
“With WordStat I was able to build a predictive model that would duplicate the manual coders performance at about 83% accuracy, and by tweaking a couple of other things we were able to gain another three to four per cent in predictive accuracy.
“We also observed that some of the technical competencies are much easier to classify than some of the soft skills – which is not a surprising result – but we were able to look at the differences and make some decisions around this.
“When you look at a fully automated method, there is always going to be variation according to what kind of question you are working with. Using QDA Miner and WordStat together helps you understand what those difference are.
“With Wordstat you can start out with some raw text, and you can do some mining of it, you can create dictionaries, you can expand them with synonyms and build yourself a really good dictionary in very little time. If you work in example mode, you can mine the examples you need for your dictionary. The software will tell you the words and phrases that characterise those answers.
“To make the most of automated coding, you have to focus the questions more, and move people away from asking questions that are very general. In educating people about these, I have started to call then the ‘what did you do on your summer vacation’ question’. You can never anticipate where everyone is going to go. I have noticed there is also a role that the length of the response plays. Ideally, the question should be answerable in a sentence or two. You cannot do much with automated classification if the answer goes on for a couple of pages.”
A version of this review first appeared in Research, the magazine of the Market Research Society, September 2008, Issue 507