The latest news from the meaning blog

 

Simstat, Wordstat & QDA Miner reviewed

In Brief

Wordstat with QDA Miner and SimStat

Provalis Research, Canada
Date of review: September 2008

What it does

Windows-based software for analysing textual data such as answers to openended questions along side other quantitative data, or to analyse qualitative transcripts. Uses a range of dictionary-based textual statistical analysis methods.

Our ratings

Score 3 out of 5Ease of use

Score 3.5 out of 5Compatibility with other software

Score 4.5 out of 5Value for money

Cost

SimStat + QDA Miner + WordStat bundle – one-off purchase cost of $4,195 or $14,380 for 5 licences. 75% discount for academic users on most prices. Upgrades offered to licenced users at a discount.

Pros

  • Analyse and cross-tab verbatim responses as you would any standard question
  • Can use to code data automatically using a machine learning method
  • Easy to create own subject-specific dictionaries, which can be reused on similar projects
  • Full versions available to download for month’s free trial

Cons

  • Standalone system – no collaborative or multi-user capabilities
  • Steep learning curve – requires an expert user
  • Dictionary and word based analysis only: does not support natural language processing or learn by example methods
  • Does not support Triple S data

In Depth

It’s often the answers to openended questions that offer the richest insights in a survey – especially in online surveys, if the questions are specific and well targeted. Yet this valuable resource remains almost unexplored in most quantitative surveys due to the sheer effort involved in analysing it. The principal method used – manual coding to a codeframe – is virtually unchanged since the 1940s. The only nod to the information age is the number of coding departments that use Excel as a surrogate electronic coding sheet – the IT equivalent of equipping your horse and cart with satnav.

WordStat is a very versatile bit of software that offers you countless ways to analyse openended textual data with virtually the same ease as analysing normal ‘closed’ questions in a cross-tab package. It comes as an add-on module to Simstat – which is a feature-rich desktop statistical package for analysing survey data with decent cross-tab and charting capabilities. Interestingly, WordStat also functions equally well as an add-on to another program: QDA Miner, a code-and-retrieve analysis suite for qualitative researchers. All three are developed by Provalis Research, based in Montreal.

WordStat, combined with the other two programs, offers a bewildering choice of ways to dig deeply into verbatim data and make sense of it. In this review, I will focus on two of the more interesting ones to market researchers, but this versatile program has many more tricks up its sleeve than I can cover here.

First, without doing any conventional coding at all, you can effectively cross-tab or slice your data according to any demographics or dependent variables in your data. For this, you would start out in SimStat.

SimStat will let you carry out normal quantitative analysis of data as simple cross-tabs or by applying a wide range of statistics – factors and cluster analysis, regression, correlation and so on. To do textual analysis, you need to start with the verbatim data in the same data file as the other numeric and coded questions. This is pretty much how most web interviewing packages provide the data these days. SimStat works around the concept of dependent and independent variables: pick the any demographic want as a the independent variable, pick the verbatim question as the dependent variable and pick Content Analysis from the bottom of the Statistic menu, and SimStat will fire up the separate WordStat module to let you cross-tab and dig into your openended text.

Once within WordStat there is a wide array of reports and ways to look at the words that respondents have given, in aggregate or case by case, against your dependent variable or by the total sample. There are charts, including dendograms and very informative heatmaps that show the relationship between words, and you can adjust the proximity factor used when looking at words used in the vicinity of other words. Indeed, there is more you are ever likely to use.

However, if you would like to follow a more conventional coding model, you need to enter your data via QDA Miner, which is equally comfortable dealing with records that contain only unstructured text such as focus group transcripts, or quant data with a mix of closed and open fields. QDA Miner has the concept of codeframes at the heart of it, and you can create multiple codeframes and code directly into then. You can search for similar items and then code them all together, and you can extend the codeframe as you go too. Any coding you make can be exported back to the data, for analysis in SimStat or other tools. Nothing too unconventional there.

Step into WordStat, though, and you shift from one era to the next, for the capabilities of WordStat are now at your disposal to build automatic machine learning-based verbatim text classifiers. In other words, WordStat will take your coded examples, identify all the words and combinations of words that characterise those examples, and build the algorithms to perform automated text categorisation according to your examples and your codeframe. Again, there are reports and charts available to you, to understand the extent to which your classifiers are accurate. Accuracy depends on many factors, but suffice to say here, the automatic classifiers can be as good as human coders, and on large datasets, will be much more consistent.

Unlike some other classification models, WordStat is a dictionary-based system, and it works principally on words and the relationship of words to others, rather than on actual phrases. There is a separate module for creating reusable subject-specific dictionaries and the system comes with general dictionaries in about 15 languages. It also contains a range of tools to clean up texts and to overlook elementary spelling mistakes, with its fuzzy matching logic.

There are extensive academic debates about whether this is the best method for coding, but as everything is does is transparent, can be interrogated, changed and improved, it is as likely to be as good as any other method – and it is certainly better than throwing away 10,000 verbatim responses because nobody has the time or energy to look at them.

This is, however, an expert’s system. Coders and coding supervisors would probably struggle with it in its present form. Coding is not the only function that WordStat handles, and because it has to be accessed via one of two other programs, that adds another layer of complexity.

The disjunction between these three different programs is a slightly awkward one, though it is something users report they get used to. Neither does the system offer anything in the way of collaborative tools – through of course, if you are able to code data automatically, it does mean the work done by ten people really can be done by one. Don’t expect to be able to use this simply by reading the manuals – you would need to have some education in text categorisation basics from Provalis Research first. However, with a little effort, this tool could save weeks of work, and even allow you start including, rather than avoiding openended questions in your questionnaire design.

Customer Viewpoint: US Merit Systems Protection Board, Washington DC

John Ford is a Research Psychologist at the US Merit Systems Protection Board in Washington DC, where he uses WordStat to process large surveys of Federal employees which often contain vast amounts of openended data.

“In the last two large survey we have done, we have had around 40,000 responses. We have been able to move away from asking the openend at the end which say ‘do you have any comments’ to doing more targeted openended questions that ask about specific things.

“We asked federal employees to identify their most crucial training need and describe it in a few sentences. We used a framework of 27 competencies to classify them. QDA Miner was very flexible in helping us to decide what the framework was, settle on it, and very quickly classify the competencies.

“With WordStat I was able to build a predictive model that would duplicate the manual coders performance at about 83% accuracy, and by tweaking a couple of other things we were able to gain another three to four per cent in predictive accuracy.

“We also observed that some of the technical competencies are much easier to classify than some of the soft skills – which is not a surprising result – but we were able to look at the differences and make some decisions around this.

“When you look at a fully automated method, there is always going to be variation according to what kind of question you are working with. Using QDA Miner and WordStat together helps you understand what those difference are.

“With Wordstat you can start out with some raw text, and you can do some mining of it, you can create dictionaries, you can expand them with synonyms and build yourself a really good dictionary in very little time. If you work in example mode, you can mine the examples you need for your dictionary. The software will tell you the words and phrases that characterise those answers.

“To make the most of automated coding, you have to focus the questions more, and move people away from asking questions that are very general. In educating people about these, I have started to call then the ‘what did you do on your summer vacation’ question’. You can never anticipate where everyone is going to go. I have noticed there is also a role that the length of the response plays. Ideally, the question should be answerable in a sentence or two. You cannot do much with automated classification if the answer goes on for a couple of pages.”

A version of this review first appeared in Research, the magazine of the Market Research Society, September 2008, Issue 507

SPSS 16 reviewed

In Brief

What it does

Comprehensive desktop analysis software for crosstabs, charts and statistics, with integrated data editing, data processing, presentation and publishing capabilities.

Supplier

SPSS (An IBM Company)

Our ratings

Score 4 out of 5Ease of use

Score 4 out of 5Compatibility with other software

Score 3 out of 5Value for money

Cost

Single user prices: Base SPSS system, £1072, standalone SPSS SmartViewer £132, add-on modules from £473. Annual maintenance and support from £214. Volume and educational discounts available.

Pros

  • Now cross-platform – PC, Mac or Linux
  • Clever data editing including anomaly detection
  • Greatly improved charting
  • Output directly to PDF

Cons

  • Wide range of options can be confusing to novice users
  • Output can look straggly and utilitarian

In Depth

This year the statistical software SPSS is forty years old. While SPSS now heavily promotes this program in the so-called business and predictive analytics arena, MR users continue to be well served by the latest issue, SPSS 16. Indeed, there are several very handy new features for questionnaire-based data and the stuff market researchers tend to do.

The big change is that the software has now been re-written in Java. Going to Java has given the developers the opportunity to make a few changes to the dialogue windows – though (before any experienced users break out into a cold sweat) not to where things are or how they work, but in terms of being able to resize items dynamically, stretch windows and see more displayed as a result. It means, for example, that long labels no longer get truncated in selection menus, which has long been an irritation. However, practised users will probably be surprised just how similar SPSS 16 is to recent native Windows versions, considering the interface has effectively been rebuilt from scratch.

SPSS has always been strong in allowing you to edit and clean your data on a case-by-case basis. While there seems to be a recent trend among some researchers not to bother, especially online, those who take these matters seriously should be rather pleased to see this version introduces a heuristic anomaly detector in the data validation menu. Set it going on all the variables you think matter, and it will pull out any cases where the answers stick out from the rest. It uses a clustering, or rather an un-clustering algorithm, and looks for items that don’t cluster. More conventionally, there is also a complete rule-based validation routine, with several handy built-in rules to look for large number of missing variables or repeated answers (mainlining through grids, for example) and the option to set up your own cross-variable checks too.

There are some handy new tools in the data prep area, such as easy recodes that take date and time values and chop them up into discrete time intervals such as months and quarters, or let you group according to day of week, mornings and afternoons and so on. There is ‘visual binning’ which lets you create categories from numeric variables by showing you a histogram of your new categories, and lets you even them out using sliders on screen. A new ‘optimal binning’ function lets you do the same to values, using another variable to determine the fine-tuning of the slices, such as to split income with respect to age.

Version 16 also makes it easier to edit and clean up the metadata – the text labels and names. There is a find and replace feature and a spell checker too, with dictionaries for both UK and US English and for other major languages. The move to Java has made possible other languages and writing systems too, as SPSS 16 now fully supports the Unicode standard.

On the output side, greatly improved charting came in with version 14, and the improvements continue. The visual method for defining charts is one of the most elegant I have seen. Where many tools, like Excel, simplify chart building with a wizard, here the workflow all takes place in the one chart-building window. It avoids the tunnel mentality of the wizard, where you emerge blinking on the other side with no idea of how you got there.

Two items are of particular interest to market researchers. Top marks to SPSS for the ‘panel’ chart option on all the charts, which lets you add a categorical variable such as demographics. It produces neat, side-by-side charts for each category, all the same size and sharing one legend. ‘Favourites’ make it easy to store the chart outline for any chart you have perfected in a gallery for you to use again, saving time and helping you achieve consistency in your reporting.

Behind the scenes, there is also a full chart scripting language, which can be used to automate repetitive chart production. Also of interest to MR users is the new built-in support for going straight to PDF from the output viewer. It offers a fantastic alternative to producing PowerPoint decks merely to communicate data. You can output everything or a selection. Best of all, the complete heading and folder structure of the output viewer is replicated in the PDF as bookmarks, to make navigation easy.

Much of the power and versatility of SPSS has always derived from the ability to write SPSS syntax directly. When you use the graphical interface, the syntax needed to drive the SPSS processor and create your outputs are created for you and can be saved and reused. Advanced users and programmers who use syntax directly will find many more commands and options at their disposal – so it is often possible to create highly customised outputs using syntax. The chart scripting options are just one recent syntax extension. Another intriguing one is a new ‘begin program’ command, which lets you run other external applications and scripts written in the open source language Python. So if the hundreds of statistical tests and models available within SPSS turn out still to be not enough, it is possible to spawn out to ‘R’ (see r-project.org), the open source statistical initiative, and apply any of the hundreds offered in R, using your SPSS data, and presenting the results in your SPSS output.

I was hoping that SPSS 16 would make the program and data structures less disdainful of multiple-response data. In science, and in business, this kind of data is rare, but in market research, multi-coded data abounds. Alas, even in version 16 it is still handled in the same arms-length way through multiple-response sets created from dichotomies. Rather confusingly, there are different multiple sets in the tables and in the special multiple-response frequencies and cross-tabs area. Once you have set them up, there is still that trap for the unwary that they do not get saved in the data, or saved at all without some effort.

My other grumble is that, despite the output improvements, the overall look of the reports that come out is still very utilitarian and is full of irrelevant set-up detail. Cross-tabs in particular are wilfully straggly and unfinished in appearance.

It surely cannot be an issue for the core SPSS users, otherwise you imagine it would have changed long ago, but it is another deterrent to market researchers, where effective communication of results has to be a core strength.

But for the sheer range of statistical tests and models available from one desktop application, SPSS deserves a place in every MR department, agency or consulting practice.

A version of this review first appeared in Research, the magazine of the Market Research Society, March 2008, Issue 501