Data integration: some scenarios

by Tim Macer

You have just been appointed Insight Manager at C&P, a consumer products company. You find you are sitting on hundreds of surveys commissioned by your predecessors, some going back a decade or more. Beyond that, there are sales and marketing databases, production planning systems and asset registers, all with useful data in them but not very easy to get at. The reports that comes out of them are either too detailed or miss the crucial demographics you need. Requesting custom reports from the Information Management team takes time and seems to involve more effort and cost than it’s worth. Alongside this, your research budget had already been pegged at last year’s level and you have now been told that you, along with every other department, must find another 15% of savings.

These days, there is a renewed desire among research buyers to get more out of existing data – yet it is often easier said than done, especially because of the technical hurdles that need to be overcome. Though my example is a hypothetical one, I am going to follow this through to show how our insight manager (who I am going to call Jan) was able to save some money and still provide the insights she was being asked to provide.

1. Turning the back data into a research library

Jan found there was very little standardisation in the way research projects were delivered: it often depended on the agency used, and C&P had a diverse mix of them. The trackers were more organised, with regular reporting on a monthly or quarterly basis and reports being sent by email. The ad hoc surveys were chaotic. There was a mostly complete catalogue of the surveys in an Excel spreadsheet, but this often contained little more than a title and some fieldwork dates.

Jan set up a project to categorise each survey, starting with the most recent two years. The exact deliverables were identified – Powerpoint deck, tables in Word or PDF and so on. Sometimes there were data files too in a variety of formats. Two agencies favoured Quanvert, others used E-Tabs or Espri; another used their own software. Two agencies were providing access to the surveys through there own web portals.

The team created an information sheet for each project. This contained a summary of the research brief and information on the key findings and listed the deliverables available (Powerpoint, PDF tables, Quanvert database and so on). They also located the source questionnaire where possible, which was usually a Word document.
Jan booked some time with a member of C&P’s web team. He used one of the many open-source content management tools to create a simple research library based around the information sheet on the company’s intranet. This was searchable and allowed Jan to add other taxonomies to make searching easier. All of the ‘assets’ – the questionnaire and any reports – were all moved to the intranet server and linked to the web library, allowing users to view or download them.

2. Standardising the deliverables

When Jan counted up, there were nine different research analysis tools being used: some were desktop programs and others web based. There were also some hefty set-up and annual licence costs. Central IT policies were making it difficult to get approval for odd pieces of software, and there were cross-charges for testing and applying upgrades too. Jan decided to standardise on one software tool and set up a small working group. She invited the IT manager and Business Intelligence (BI) manager to contribute. BI has standardised on Business Objects as its reporting architecture – which is a relational database reporting tool, and the BI manager could not understand why this could not be used for MR data. He also suggested building a series of data cubes and using an OLAP tool to allow users to run queries and reports. The IT manager was keen to develop a reporting system for Jan using Microsoft Sharepoint. Jan contacted the DP managers at two of the research agencies she used and prepared her case for a specialist MR reporting tool.
She was able to explain the principal difficulty in using enterprise reporting products such as Business Objects, Cognos or Hyperion is that each survey has a different dataset, a different structure, a large number of variables and relatively few records. A lot of specialist programming was required for each new survey and once a report was defined, additional programming from BI would be needed to make changes.
By comparison, using a web-based cross-tab and reporting tool designed for survey data meant that set-up for any new survey was fast and required no programming. She learned that it is notoriously difficult to achieve a consistent sampled or filtered base on reports using database reporting tools and that weighting or handling levels in data was out of the question. Even getting the percentages right on different reports could be a trial. She persuaded her colleagues that there was a case for a specialist MR tool, and chose an off-the-shelf web tool ‘interTAB’, where the supplier was also willing to provide hosting for the surveys at a reasonable cost.
Some discussions with the research companies and DP suppliers to C&P established that they would be able to upload their data directly to the web server, using the Triple-S format. Two also offered to licence the converter program so they could produce native InterTAB databases. Jan also identified key surveys from the last two years where the data would be converted and uploaded. Like many analysis tools, interTAB software was able to read in SPSS formatted data. Quanvert databases were converted to SPSS for importing.

By the time all the old software licences expire and are not renewed, the costs for interTAB will be very slightly less than before, with members of the research team and several internal clients each having their own logon. Jan has also shrunk the bill for extra analysis from research providers, which was also sizeable.
The software was surprisingly easy to link to the research library. Each study in the tab software has its unique URL or web address. By capturing these and embedding these into the library catalogue, users have a simple click through link into relevant study in the analysis tool.

3. Linking Customer Surveys to other data assets

C&P use a wide range of database marketing-driven promotional activities. As a manufacturer supplying products to customers via wholesale then retail channels, its link to customers are tenuous. However, it has amassed a large amount of information about customers who had clipped coupons and provided address and some basic demographic data. It runs several loyalty clubs too, where there is rich demographic information. In addition, the marketing department has recently signed up to receive loyalty card data from a major grocery retailer, for which the BI team has created a data warehouse and built a range of OLAP data cubes to allow querying of the data. While this contains actual purchase information and good demographics, it does not provide the personal data needed to link it at a respondent level with any surveys.

The marketing manager was looking to Jan for advice on how to analyse and interpret this data, and Jan could see endless benefits from being able to link survey data to these other data sources. But where to start?

Jan contacted Tom, an ex-colleague and now a consultant with a strong background in statistics and database work. They made several changes to the data warehouse, starting by incorporating a number of segmentation models that has been developed recently from consumer research, using the demographics as the linking factor.
They also worked with IM to provide a simplified query interface that would allow researchers to extract sample datasets according to any criteria, so they these could be loaded into cross-tab and statistical analysis software. Tom recommended using CHAID as a discovery tool to uncover hidden relationships in the data warehouse data. This was not part of the interTAB software, so Jan purchased a separate statistical product for this which also offered a range of other multivariate analytical capabilities. Jan also discovered that the statistics software provided an import from Triple-S data but not an export. She persuaded the manufacturer to create a Triple-S export, and this provided her with a reliable, error free route to get the data from the data warehouse into interTAB. InterTAB coped surprisingly well with datasets of 200,000 plus cases.

As Tom worked with the data, he realised there was potential to improve on the segmentation models used by the marketing team by using the data warehouse data and survey data in parallel: testing models built from the surveys on the transactional data, and vice versa. This meant extracting data from the survey tool so it could be loaded into the stats package. This too required a small program change from interTAB’s developers so that users could output a list of variables in the same Triple-S format for the stats package to read.

4. Using existing data assets for sampling

The marketing data, particularly the coupon and loyalty club data, interested Jan as a source of sampling data for product specific surveys, particularly where users of a particular product were sought. Jan was reluctant to build a panel as such, but began to realise that, because the same consumer may be represented across multiple marketing datasets, some rudimentary controls were needed to ensure that she was not always sampling the same respondents.
She explored the possibility with the IT team of creating a database to store all potential respondents and then keep this updated with actual participation. The costs and timescales were huge: there was no budget for this. However, Jan discovered there was more than one off-the-shelf system that could provide this capability. By looking at the cost of samples being purchased for the surveys, Jan was able to build a case for buying a system – essentially a large data warehouse into which data could be fed from multiple sources. The de-duping capabilities were the key to success, so that individuals could be matched and the extra data added to their existing record.
The direct marketing teams had a number of requests and they were concerned about ownership of the data. Special permission had to be sought where the data had been obtained from an external data provider. In some cases this was too costly to be viable, so this data had to be excluded – the research team then had concerns over the viability of the data as representative sample, if there were arbitrary exclusion.
The marketing team also wanted to have all the survey data fed back into their systems – particularly purchasing intention data. Jan was not willing to compromise the research by asking such consent from respondents, so a different compromise had to be sought. This was provided through the segmentation work that Tom was doing, by building predictive models that could apply a purchasing propensity score to the marketing data based on the survey, and linked through the demographics. It meant was no direct link between a respondent and their record. The marketing team liked the fact that the propensity score would populate the entire database, not just the arbitrary sample actually interviewed.

5. Supersurveys and meta-analysis

Jan has yet to realise her most ambitious initiative. She has just stated work with a firm of research data integrators to create a data warehouse of consolidated data from ad hoc product surveys. This will allow the research team to perform meta-analyses of all similar questions asked in any survey, and produce norms for rating scales and benchmarks the performance of particular questions. Jan also wants to identify other common discriminating or predictive factors that work across entire product categories or markets.

Her team have made a start by identifying all the common variables across the last two years of product surveys, and creating standard taxonomies of questions, e.g. branding, packaging, flavour, colour and so on. They have already identified many arbitrary discrepancies across the surveys with respect to demographic variables, rating scale batteries and scale answer lists used, and are working to standardise these in any new surveys.

Most of the variations are minor, and will not prevent the team from creating their ‘supersurvey’ database into which any new surveys will also be added. Jan fully expects this database to enable her team to answer many urgent research questions without the need to commission new research, and where new research is required, to be able to field shorter, more targeted surveys at considerably less cost.

Based on an article published in the Data Matters supplement to Research Magazine, February 2009.

Share This