Cabela's Moves from CHAID to CART (1,105 words)
by Kimberly Rengle
What do you do if your models are overestimating the level of response you'll get? That was exactly the problem that Cabela's, a cataloger of hunting, fishing and outdoor gear, was having at the end of 1996 when it determined that the models used for segmenting its mailing list were not stable and tended to over-estimate expected performance.
Cabela's relies on its spring and fall master catalogs, as well as other promotions, for showcasing and selling its entire product line. Although the mailings were still profitable, the company decided to explore ways of combining data-mining methods to create more effective models. Cabela's made the move from a CHAID-based system to CART, which can handle continuous variables and more closely detail relationships in the data, says Lynn Karrick, Cabela's database marketing manager.
Building Better Models
Cabela's manages its databases in-house and relies on historical customer data to determine mailing lists for specific catalogs and promotions. The database marketing team is tasked with extracting a 200,000-case subset from millions of records. This sample then serves as a foundation for building the predictive model and testing data that will determine recipients for the new mailing. Generally, Cabela's selects a customer database that targets a certain profitability level for the mailing, and the 1998 spring catalog was no exception. Unlike previous years, however, segmentation of this spring's database was further optimized using Salford Systems' CART regression tree software.
"I have always been able to construct good, profitable models for Cabela's, but our packaged system had data integrity problems and did not always give us stable performance within customer segments," Karrick says. "In addition, since our CHAID-based system could not handle continuous variables, such as a customer's amount of last purchase, our data was grouped into 'bucketed' segments that worked but weren't exactly robust."
Process for Success
To achieve a better model, Karrick combined his experience developing logistic-regression models with the fine-tuning data insights from CART models. The process began as it had before. Since Cabela's product composition is similar from catalog to catalog, the results from the previous year's mailing were used to determine a good cross-section of the buying population, and a 200,000-case sample was extracted.
The data was then built and the cases were split in half. One half became the "learning" or "modeling" sample on which a logistic-regression model was created; the other half was set aside as a "test" or "validation" sample that would test how well the model would stand up to "new" data. First, the model defined about 250 variables (i.e., last date of purchase, total dollars spent, household income). In this pre-processing phase of the modeling, CART was used to examine any abnormalities in the variables.
CART, which modeled the 100,000-case learning sample in less than five minutes, built tree-structured models that helped Karrick transform the data to fit sensibly into the logistic-regression model. To build accurate customer profiles using CART, Karrick says, "I can just look at the ends of the tree branches and follow the characteristics up to the top. Even with complex tree models, the profiles I build can be easily understood by my non-technical colleagues."
Once the model was refined, Karrick explored the full set of variables using bootstrap sampling (aggregation), a technique that chooses random records from the modeling dataset and recasts the learning sample to highlight anomalies. Each of Karrick's bootstraps produced a listing of the variables and the number of times each variable occurred.
Karrick further explored these variables using CART analyses, especially when he saw inconsistencies or where he suspected important variables were not well-represented. CART's analyses of the variables were factored into the logistic-regression model, and each analysis method confirmed the results of the other, giving Karrick confidence in the stability of his model.
"CART helped me distill the 250 variables into the 10 most relevant ones," says Karrick. "In addition, I could learn a great deal more about the characteristics of each variable and get a better picture of our customer profiles"—such as the fact that all its best customers fall into a certain economic bracket, reside in certain regions or buy the same products.
Real Field Experience
Cabela's learned that even sophisticated hardware and software cannot replace the human factor. Karrick, who was born into a family of sportsmen that had been ordering from the Cabela's catalog since his childhood, used his knowledge of that market when he realized that one variable he was expecting—whether prospects were more likely to buy in the spring or the fall—was absent from his model.
"Much of the equipment we sell is very seasonal," says Karrick. "In the spring, our customers are ready for fly fishing, hiking, camping, boating and other warm-weather activities; in the fall comes deer-hunting season, snowshoeing, ice-fishing and other cold-weather activities. As a result, our data should have indicated a variable for customers who order in the first four months of the year." Karrick used CART to generate trees that shed more light on that variable. "My second analysis showed that seasonal buyers factor in largely, so I was able to refine my spring model to use a variable my initial analysis overlooked."
Making the Cut
The final step in building the Spring 1998 mailing-list model was to score the cases and to determine the cut-off point for contacts that would be sent the catalog and those that would not. In previous years' models, the final scores were segmented into 10 parts with an equal number of contacts in each, and a cut-off was determined. A shortcoming to that method was that low-level performance segments showed little difference between one another: They were simply a large concentration of a buying population broken into more than one part. This made for a higher risk of excluding profitable contacts while including less-profitable ones. CART models were able to solve this challenge and define a good cut-off point.
Rather than making splits of an equal number in each segment, CART split the scores where there was a significant difference between performance within customer groups. This helped define points where the company could maintain its target profitability. The most important score was the one dividing those that would receive the mailing from those that would not.
Cabela's is still waiting for the final results on the success of its spring mailing model. However, Karrick plans to use CART for myriad analyses, including determining which buyers will purchase from certain product groups. n
Kimberly Rengle is a San Diego-based writer specializing in data-mining and technology issues. For information on Salford Systems, contact Kerry Martin at (619) 543-8880, or visit www.salford-systems.com.
- People:
- Kerry Martin
- Lynn Karrick