My Profile
Active Members
TodayLast 7 Days
more...
Awards & Gifts
Online Exams
Fresher Jobs
Our fresher job section is exclusively for fresh graduates! Find jobs for freshers in major Indian
cities including Bangalore, Chennai, Hyderabad, Pune or Kochi
Resources
Find educational articles, blogs, discussion threads and other resources.
Colleges
Find details about any college in India or search for courses.
|
Data mining
Posted Date: 30 Apr 2008 Resource Type: Articles/Knowledge Sharing Category: Computer & Technology
|
Posted By: durga Member Level: Silver Rating: Points: 6
|
|
|
|
Data mining. The process of efficient discovery of nonobvious valuable patterns from a large collection of data. Data mining, if you haven't heard of it before, is the automated extraction of hidden predictive information from databases A data warehouse is a repository for relevant business data. While traditional databases primarily store current operational data, data warehouses consolidate data from multiple operational and external sources in order to attain an accurate, consolidated view of customers and the business
Data Mining uses technologies such as neural networks, decision trees or standard statistical techniques to search large volumes of data. In doing so, Data Mining builds models for patterns that accurately predict customer behavior. Scoring uses a model to predict future behavior. The score assigned to each individual in a database indicates that person’s likelihood of exhibiting a particular customer behavior. Campaign Management uses information in a data warehouse or marketing database to plan, manage and assess marketing campaigns designed to impact customer behavior. A customer segment is a group of prospects or customers who are selected from a database based on characteristics they possess or exhibit. Scoring on the fly or dynamic scoring is the ability to score an already-defined customer segment within a campaign-management tool. Rather than scoring an entire database, dynamic scoring works with only the required customer subsets, and only when needed.
Embedded Data Mining. An implementation of data mining where the data mining algorithms are embedded into existing data stores and information delivery processes rather than requiring data extraction and new data stores. Database Management System (DBMS). A software system that controls and manages the data to eliminate data redundancy and to ensure data integrity, consistency and availability, among other features. The Foundations of Data Mining Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: • Massive data collection • Powerful multiprocessor computers • Data mining algorithms Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods. In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly. Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (1960s) "What was my total revenue in the last five years?" Computers, tapes, disks IBM, CDC Retrospective, static data delivery Data Access (1980s) "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level Data Warehousing & Decision Support (1990s) "What were unit sales in New England last March? Drill down to Boston." On-line analytic processing (OLAP), multidimensional databases, data warehouses Pilot, Comshare, Arbor, Cognos, Microstrategy Retrospective, dynamic data delivery at multiple levels Data Mining (Emerging Today) "What’s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry) Prospective, proactive information delivery Table 1. Steps in the Evolution of Data Mining. The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines and broad data integration efforts, make these technologies practical for current data warehouse environments. Data Mining defined
Data Mining, by its simplest definition, automates the detection of relevant patterns in a database. For example, a pattern might indicate that married males with children are twice as likely to drive a particular sports car than married males with no children. If you are a marketing manager for an auto manufacturer, this somewhat surprising pattern might be quite valuable. However, Data Mining is not magic. For many years, statisticians have manually "mined" databases looking for statistically significant patterns. Today, Data Mining uses well-established statistical and machine learning techniques to build models that predict customer behavior. The technology enhances the procedure by automating the mining process, integrating it with commercial data warehouses, and presenting it in a relevant way for business users. The leading Data Mining products, such as those from companies like SAS and IBM, are now more than just modeling engines employing powerful algorithms. Instead, they address the broader business and technical issues, such as their integration into today’s complex information technology environments. In the past, the hyperbole surrounding Data Mining suggested that it would eliminate the need for statistical analysts to build predictive models. However, the value that an analyst provides cannot be automated out of existence. Analysts will still be needed to assess model results and validate the reasonability of the model predictions. Since Data Mining software lacks the human experience and intuition to recognize the difference between a relevant and an irrelevant correlation, statistical analysts will remain in high demand. The purpose of Data Mining Data Mining helps marketing professionals improve their understanding of customer behavior. In turn, this better understanding allows them to target marketing campaigns more accurately and to align campaigns more closely with the needs, wants and attitudes of customers and prospects. If the necessary information exists in a database, the Data Mining process can model virtually any customer activity. The key is to find patterns relevant to current business problems. Typical questions that Data Mining answers include: • Which customers are most likely to drop their cell-phone service?
• What is the probability that a customer will purchase at least $100 worth of merchandise from a particular mail-order catalog? • Which prospects are most likely to respond to a particular offer? Answers to these questions can help retain customers and increase campaign response rates, which, in turn, increase buying, cross-selling and return on investment (ROI). Scoring the model Data Mining builds models by using inputs from a database to predict customer behavior. This behavior might be attrition at the end of a magazine subscription, cross-product purchasing, willingness to use an ATM card in place of a more expensive teller transaction, and so on. The prediction provided by a model is usually called a score. A score (typically a numerical value) is assigned to each record in the database and indicates the likelihood that the customer whose record has been scored will exhibit a particular behavior. For example, if a model predicts customer attrition, a high score indicates that a customer is likely to leave, while a low score indicates the opposite. After scoring a set of customers, these numerical values are used to select the most appropriate prospects for a targeted marketing campaign. The role of Campaign Management software Database marketing software enables companies to deliver to customers and prospects timely, pertinent, and coordinated messages and value propositions (offers or gifts perceived as valuable). Today’s Campaign Management software goes considerably further. It manages and monitors customer communications across multiple touch-points, such as direct mail, telemarketing, customer service, point-of-sale, e-mail and the Web. Campaign Management automates and integrates the planning, execution, assessment and refinement of possibly tens to hundreds of highly segmented campaigns running monthly, weekly, daily or intermittently. The software can also run campaigns that are triggered in response to customer behavior or milestones – such as the opening of a new account. Increasing customer lifetime value Consider, for example, customers of a bank who only use the institution for a checking account. An analysis reveals that after depositing large annual income bonuses, some customers wait for their funds to clear before moving the money quickly into their stock-brokerage or mutual fund accounts outside the bank. This represents a loss of business for the bank. To persuade these customers to keep their money in the bank, marketing managers can use Campaign Management software to immediately identify large deposits and trigger a response. The system might automatically schedule a direct mail or telemarketing promotion as soon as a customer’s balance exceeds a predetermined amount. Based on the size of the deposit, the triggered promotion can then provide an appropriate incentive that encourages customers to invest their money in the bank’s other products. Finally, by tracking responses and following rules for attributing customer behavior, the Campaign Management software can help measure the profitability and ROI of all ongoing campaigns. Integrating Data Mining and Campaign Management The closer Data Mining and Campaign Management work together, the better the business results. Today, Campaign Management software uses the scores generated by the Data Mining model to sharpen the focus of targeted customers or prospects, thereby increasing response rates and campaign effectiveness. Unfortunately, the use of a model within Campaign Management today is often a manual, time-intensive process. When someone in marketing wants to run a campaign that uses model scores, he or she usually calls someone in the modeling group to get a file containing the database scores. With the file in hand, the marketer must then solicit the help of someone in the information technology group to merge the scores with the marketing database. This disjointed process is fraught with problems: • The large numbers of campaigns that run on a daily or weekly basis can be difficult to schedule and can swamp the available resources. • The process is error prone; it is easy to score the wrong database or the wrong fields in a database. • Scoring is typically very inefficient. Entire databases are usually scored, not just the segments defined for the campaign. Not only is effort wasted, but the manual process may also be too slow to keep up with campaigns run weekly or daily. The solution to these problems is the tight integration of Data Mining and Campaign Management technologies. Integration is crucial in two areas: First, the Campaign Management software must share the definition of the defined campaign segment with the Data Mining application to avoid modeling the entire database. For example, a marketer may define a campaign segment of high-income males between the ages of 25 and 35 living in the northeast. Through the integration of the two applications, the Data Mining application can automatically restrict its analysis to database records containing just those characteristics. Second, selected scores from the resulting predictive model must flow seamlessly into the campaign segment in order to form targets with the highest profit potential. The integrated Data Mining and Campaign Management process This section examines how to apply the integration of Data Mining and Campaign Management to benefit the organization. The first step creates a model using a Data Mining tool. The second step takes this model and puts it to use in the production environment of an automated database marketing campaign. Step 1: Creating the model An analyst or user with a background in modeling creates a predictive model using the Data Mining application. This modeling is usually completely separate from campaign creation. The complexity of the model creation typically depends on many factors, including database size, the number of variables known about each customer, the kind of Data Mining algorithms used and the modeler’s experience. Interaction with the Campaign Management software begins when a model of sufficient quality has been found. At this point, the Data Mining user exports his or her model to a Campaign Management application, which can be as simple as dragging and dropping the data from one application to the other. This process of exporting a model tells the Campaign Management software that the model exists and is available for later use. Step 2: Dynamically scoring the data Dynamic scoring allows you to score an already-defined customer segment within your Campaign Management tool rather than in the Data Mining tool. Dynamic scoring both avoids mundane, repetitive manual chores and eliminates the need to score an entire database. Instead, dynamic scoring marks only relevant customer subsets and only when needed. Scoring only the relevant customer subset and eliminating the manual process shrinks cycle times. Scoring data only when needed assures "fresh," up-to-date results. Once a model is in the Campaign Management system, a user (usually someone other than the person who created the model) can start to build marketing campaigns using the predictive models. Models are invoked by the Campaign Management System. When a marketing campaign invokes a specific predictive model to perform dynamic scoring, the output is usually stored as a temporary score table. When the score table is available in the data warehouse, the Data Mining engine notifies the Campaign Management system and the marketing campaign execution continues. The Scope of Data Mining Data mining derives its name from the similarities between searching for valuable business information in a large database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities: • Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. • Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions. The most commonly used techniques in data mining are: • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique. • Rule induction: The extraction of useful if-then rules from data based on statistical significance. • Data visualization: The visual interpretation of complex relationships in multidimensional data. • Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms. The appendix to this white paper provides a glossary of data mining terms. An Architecture for Data Mining To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse. Figure 1 - Integrated Data Mining Architecture The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access. An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on). Some successful application areas include: • A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations. • A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Using a small test mailing, the attributes of customers with an affinity for the product can be identified. Recent projects have indicated more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches. • A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. Using data mining to analyze its own customer experience, this company can build a unique segmentation identifying the attributes of high-value prospects. Applying this segmentation to a general business database such as those provided by Dun & Bradstreet can yield a prioritized list of prospects by region. • A large consumer package goods company can apply data mining to improve its sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments. Each of these examples have a clear common ground. They leverage the knowledge about customers implicit in a data warehouse to reduce costs and improve the value of customer relationships. These organizations can now focus their efforts on the most important (profitable) customers and prospects, and design targeted marketing strategies to best reach them. The following key examples provide a flavor of the interest, activities, and opportunities we encountered: • Pilot Software and DIG are exploring the potential of data mining with Pilot's state-of-the-art multidimensional database product line. A multidimensional database is one that has been designed for on-line analytical procession and is structured as a multidimensional hypercube. On-line analytical processing (OLAP) refers to array-oriented database applications that allow users to view, navigate through, manipulate, and analyze multidimensional databases. This combination of technologies is expected to give Pilot a unique market advantage in meeting customers' demands for more and better information. • DBIS units around the world are constantly exploring ways to provide customers with more and better business information. In the Analytical Services group at DBIS North America, analysts work with data mining techniques such as CART and CHAID, and actively pursue research in the theory and applications of neural networks. These analysts are investigating ways to use data mining to automate more of the repetitive, people-intensive portions of analytic tasks. Increased automation would allow the group to bring products to market more quickly, cutting costs and increasing revenues. It would also free analysts to focus on aspects of their work that cannot be automated. • In Europe, DBIS analysts have been investigating PC-based data mining systems. They see data mining as a technology that can expand the scope of existing tools, bringing advanced analysis capabilities directly to customers. DBIS Japan, in its new alliance with Tokyo Shoko Research (TSR), is also providing new advanced technology for information management and analysis. The Data Intelligence Group (DIG) has been invited to present opportunities for collaboration with TSR in data mining projects. • Data mining technology is a next step in the evolution of A. C. Nielsen's current software products and services. For example, data mining is seen as an essential component of the new solutions being developed by Nielsen and Pilot Software, Inc. in Europe. Data mining techniques could also open up new revenue opportunities by adding to the analytic capabilities of such Nielsen products as Opportunity Explorer, Promotion Simulator, and Nielsen Spotlight®. Because data mining tools perform best on very large databases, they are a natural candidate for adding value to Nielsen's new large data engines. Analysts at Nielsen have been investigating data mining technologies offered by IBM, Lockheed, Triada, and Pattern Associates. They are eager to find ways to detect anomalous data that can lead to distortions in analyses. • IMS analysts see data mining as a way to explore their high-dimensional databases to help clients better understand complex purchasing behavior. Analysts are investigating data mining products offered by third parties. DIG is participating in a project at IMS America to evaluate approaches in analyzing the volatility of physicians' prescribing patterns. IMS America analysts have also developed a neural network system for identifying defective data that can distort projections; the system is successful and in use. In Europe, IMS analysts use data visualization tools to provide customized analytic services for clients. Data visualization allows IMS to discover previously unknown facts and to confirm hypotheses about the pharmaceuticals market. • At Moody's, the Public Finance Department is excited about the possibility of using data mining to deliver advanced tools to its own analysts' desktops. For example, data mining tools such as CART can assist analysts by identifying bonds that are likely to require a review. Data mining tools can also aid in socioeconomic forecasting and in selecting variables for comparing bond issuers. The Corporate Department has researched data mining techniques, and several of its analysts perform econometric modeling and risk analysis using neural networks, k-nearest-neighbor techniques, and genetic algorithms. We see distinct possibilities for synergy among D&B units in the development and application of data mining technology. While each unit has unique business goals, markets, and customer problems, many different problems can be addressed using similar core data mining technologies. For example, a classification tool such as CART can be used equally well to identify municipal bonds whose underlying ratings criteria have changed significantly, or to identify physicians whose prescription-writing patterns have changed. A tool that selects the best variables to use in creating a credit-scoring model for the telecommunications industry could equally well select the variables to use in comparing performance profiles of retail sales channels.
OLAP On-line analytical processing. Refers to array-oriented database applications that allow users to view, navigate through, manipulate, and analyze multidimensional databases. Data Mining Can Bring Pinpoint Accuracy to Sales
Two popular types of applications that leverage companies' investments in data warehousing are data mining and campaign management software. Data mining enables companies to identify trends within the data warehouse (such as "families with teenagers are likely to have two phone lines," in the case of a telephone company's data). Campaign management software enables them to leverage these trends via highly targeted and automated direct marketing campaigns (such as a telemarketing campaign intended to sell second phone lines to families with teenagers). Data mining and campaign management have been successfully deployed by hundreds of Fortune 1000 companies around the world, with impressive results. But recent advances in technology have enabled companies to couple these technologies more tightly, with the following benefits: increased speed with which they can plan and execute marketing campaigns; increased accuracy and response rates of campaigns; and higher overall marketing return on investment. Data mining automates the detection of patterns in a database and helps marketing professionals improve their understanding of customer behavior, and then predict behavior. For example, a pattern might indicate that married males with children are twice as likely to drive a particular sports car than married males with no children. A marketing manager for an auto manufacturer might find this somewhat surprising pattern quite valuable. The data mining process can model virtually any customer activity. The key is to find patterns relevant to current business problems. Typical patterns that data mining uncovers include which customers are most likely to drop a service, which are likely to purchase merchandise or services, and which are most likely to respond to a particular offer. The data mining process results in the creation of a model. A model embodies the discovered patterns and can be used to make predictions for records for which the true behavior is unknown. These predictions, usually called scores, are numerical values that are assigned to each record in the database and indicate the likelihood that the customer will exhibit a particular behavior. These numerical values are used to select the most appropriate prospects for a targeted marketing campaign. Campaign management and data mining, when closely integrated, are potent tools. Campaign management software enables companies to deliver to customers and prospects timely, pertinent, and coordinated offers, and also manages and monitors customer communications across all channels. In addition, it automates and integrates the planning, execution, assessment and refinement of possibly tens to hundreds of highly segmented campaigns running monthly, weekly, daily or intermittently. Unfortunately, for most companies today, the use of data mining models within campaign management is a manual, time-intensive process. When a marketer wants to run a campaign based on model scores, he or she has to call a modeler (usually a statistician) to have a model run against a database so that a score file can be created. The marketer then has to solicit the help of an IT staffer to merge the scores with the marketing database. This disjointed process is fraught with problems and errors and can take weeks. Often, by the times the models are integrated with the database, either the models are outdated or the campaign opportunity has passed. The solution is the tight integration of data mining and campaign management technologies. Under this scenario, marketers can invoke statistical models from within the campaign management application, score customer segments on the fly, and quickly create campaigns targeted to customer segments offering the greatest potential. Here is how it works: Step 1: Creating the Model A modeler creates a predictive model using the data mining application. He or she then exports the model to a campaign management application, possibly by simply by dragging and dropping the data from one application to the other. This process of exporting a model tells the campaign management software that the model exists and is available for later use. Step 2: Dynamically scoring the data Once a model has been put into the campaign management system, marketers can then reference the model's score just as they would reference any other piece of data. Records can be selected based on the score, in conjunction with other characteristics in the data. When the campaign is run, the records in the database are scored dynamically using the model. Dynamic scoring avoids manual integration of scores with the database, and eliminates the need to score an entire database. Instead, dynamic scoring marks only relevant customer subsets and only when needed. This shrinks marketing cycle times and assures fresh, up-to-date results. Once a model is in the campaign management system, the user can start to build marketing campaigns based upon it simply by choosing it from a menu of options. Any company that is creating or has created a data warehouse should be considering the use of integrated data mining and campaign management applications, which unlock the data and put it to use. By discovering customer behavior patterns and then acting upon them quickly, companies can stave off competition; and increase customer retention, cross-selling and up-selling, all of which ultimately contribute to higher overall revenues. Who is Developing the Technology? Researchers, primarily in the fields of computer science and statistics, have been responsible for the development of most of the data mining technology currently available. From a business standpoint, this has been a problem since (academic) researchers are good at developing and evaluating data mining technologies, but they tend to get caught up in minute details of the technology. They are not interested (nor, should they be) in the fact that the core technology is only a small part of delivering a business solution, and that compromises must be made in order to deliver a usable piece of software. Another group of data mining researchers are what I call, "downsized data miners." These are people, primarily with research backgrounds, who worked on data mining research until cutbacks and company downsizing forced them into product development. When downsized data miners develop software, the end product is usually a complex tool (as opposed to a problem solving application) or intermediate software product. Lately some downsized data miners have claimed that they will be deploying business solutions however most software is currently in some form of pre-release (Beta, Alpha, even pre-alpha!). These complex data mining tools compete with other high-end analysis tools (e.g., SAS or S-Plus) that require users to have sophisticated skills. Ultimately very few of these researchers will directly impact the development of database marketing as a business solution. On the other side of the coin from the researchers are the developers who are trying to create database marketing software applications for business users5. Unlike data mining tools, these applications do not require users to know how to set up statistical experiments or build data models. The developers of database marketing applications start with the business problems and try to determine if some piece of data mining technology might be useful in solving the problem The technology associated with a data mining software application, just one small part of the overall product, will be built using techniques developed by researchers. Although current software products could be more sophisticated, the future for these software companies is the future of data mining. . A Possible Scenario for the Future of Data Mining What does the future have in store for data mining? In the end, much of what is called data mining will likely end up as standard tools built into database or data warehouse software products. As a motivation for this statement, I would like to use the field of spell checking software as an example. Just look back ten years to the infancy of computer word processing. Many companies made spell checking software. You would usually buy a spell checker as a separate piece of software for use with whatever word processor you might have. Sometimes the spell-checker wouldn't understand a particular word processor's file format. Some spell-checkers might have even required you to dump your document as an ASCII file before it would check the spelling (on the ASCII file). In that case, you would have had to manually make corrections in the original document. Eventually the spell checkers became more user friendly and understood every possible document format. Functionality also increased. The future of spell checking probably looked pretty rosy. So, where are the spell checking companies today? Where is the spell checking software? If you look at your local computer store you won't find much there. Instead you will find that your new word processor comes with a built-in spell checker. As word processor software increased in sophistication and functionality, it was a natural progression to include spell checking into the standard system. The future of data mining may very well parallel the history of spell checking. The functionality of database marketing products will increase to integrate with relational database products (no more dumping a RDBMS into a flat file!) and with key DSS application environments, it will stress the business problem rather than the technology, and present the process to the user in a friendly manner. Database marketing will start losing some of the hype and begin to provide real value to users. This will make database marketing an important business in and of itself. The larger RDBMS and data warehouse companies have already expressed an interest in integrating data mining into their database products. In the end, this new market and its business opportunities will drive mainstream database companies to database marketing. Ten years from now there may be only a few independent data mining companies left in existence. The real survivors will likely be the ones with the foresight to develop a strong relationship with the mainstream database industry. Data Mining and Campaign Management in the real world Ideally, marketers who build campaigns should be able to apply any model logged in the Campaign Management system to a defined target segment. For example, a marketing manager at a cellular telephone company might be interested in high-value customers likely to switch to another carrier. This segment might be defined as customers who are nine months into a twelve-month contract, and whose average monthly balance is more than $150. The easiest approach to retain these customers is to offer all of them a new high-tech telephone. However, this is expensive and wasteful since many customers would remain loyal without any incentive. Instead, to reduce costs and improve results, the marketer could use a predictive model to select only those valuable customers who would likely defect to a competitor unless they receive the offer.
Evaluating the Benefits of a Data Mining Model
The chart to the left, called a "gains chart," suggests some benefits available through Data Mining. The diagonal line illustrates the number of responses expected from a randomly selected target audience. Under this scenario, the number of responses grows linearly with the target size. The top curve represents the expected response if you allows the model scores to determine the target audience. The target is now likely to include more positive responders than in a random selection of the same size. The shaded area between the curve and the line indicates the quality of the model. The steeper the curve, the better the model. Other representations of the model often incorporate expected costs and expected revenues to provide the most important measure of model quality: profitability. A profitability graph like the one shown below can help determine how many prospects to include in a campaign. In this example, it is easy to see that contacting all customers will result in a net loss. However, selecting a threshold score of approximately 0.8 will maximize profitability. For a closer look at how the use of model scores can improve profitability, consider an example campaign with the following assumptions: • Database size: 2,000,000 • Maximum possible response: 40,000 • Cost to reach one customer: $1.00 • Profit margin from a positive response: $40.00 As the table below shows, a random sampling of the full customer/prospect database produces a loss regardless of the campaign target size. However, by targeting customer using a Data Mining model, the marketer can select a smaller target that includes a higher percentage of good prospects. This more focused approach generates a profit until the target becomes too large and includes too many poor prospects. Campaign Size Cost Random Selection Targeted Selection Response Revenue Net Response Revenue Net 100,000 $100,000 2,000 $80,000 ($20,000) 4,000 $160,000 $60,000 400,000 $400,000 8,000 $320,000 ($80,000) 30,000 $1,200,000 $800,000 1,000,000 $1,000,000 20,000 $800,000 ($200,000) 35,000 $1,400,000 $400,000 2,000,000 $2,000,000 40,000 $1,600,000 ($400,000) 40,000 $1,600,000 ($400,000) Conclusion: The Benefits of integrating Data Mining and Campaign Management For marketers: • Improved campaign results through the use of model scores that further refine customer and prospect segments. Records can be scored when campaigns are ready to run, allowing the use of the most recent data. "Fresh" data and the selection of "high" scores within defined market segments improve direct marketing results. • Accelerated marketing cycle times that reduce costs and increase the likelihood of reaching customers and prospects before competitors. Scoring takes place only for records defined by the customer segment, eliminating the need to score an entire database. This is important to keep pace with continuously running marketing campaigns with tight cycle times. Accelerated marketing "velocity" also increases the number of opportunities used to refine and improve campaigns. The end of each campaign cycle presents another chance to assess results and improve future campaigns. • Increased accuracy through the elimination of manually induced errors. The Campaign Management software determines which records to score and when. For statisticians: • Less time spent on mundane tasks of extracting and importing files, leaving more time for creative – building and interpreting models. Statisticians have greater impact on corporate bottom line. ________________________________________ Scoring Your Customers 1. Introduction Once a model has been created by a data mining application, the model can then be used to make predictions for new data. The process of using the model is distinct from the process that creates the model. Typically, a model is used multiple times after it is created to score different databases. For example, consider a model that has been created to predict the probability that a customer will purchase something from a catalog if it is sent to them. The model would be built by using historical data from customers and prospects that were sent catalogs, as well as information about what they bought (if anything) from the catalogs. During the model-building process, the data mining application would use information about the existing customers to build and validate the model. In the end, the result is a model that would take details about the customer (or prospects) as inputs and generate a number between 0 and 1 as the output. This process is illustrated below: After a model has been created based on historical data, it can then be applied to new data in order to make predictions about unseen behavior. This is what data mining (and more generally, predictive modeling) is all about. The process of using a model to make predictions about behavior that has yet to happen is called "scoring." The output of the model, the prediction, is called a score. Scores can take just about any form, from numbers to strings to entire data structures, but the most common scores are numbers (for example, the probability of responding to a particular promotional offer). Scoring is the unglamorous workhorse of data mining. It doesn't have the sexiness of a neural network or a genetic algorithm, but without it, data mining is pretty useless. (There are some data mining applications that cannot score the models that they produce -- this is akin to building a house and forgetting to put in any doors.) At the end of the day, when your data mining tools have given you a great predictive model, there's still a lot of work to be done. Scoring models against a database can be a time-consuming, error-prone activity, so the key is to make it part of a smoothly flowing process. 2. The Process Scoring usually fits somewhere inside of a much larger process. In the case of one application of data mining, database marketing, it usually goes something like this: 1. The process begins with a database containing information about customers or prospects. This database might be part of a much larger data warehouse or it might be a smaller marketing data mart. 2. A marketing user identifies a segment of customers of interest in the customer database. A segment might be defined as "existing customers older than 65, with a balance greater than $1000 and no overdue payments in the last three months." The records representing this customer segment might be siphoned off into a separate database table or the records might be identified by a piece of SQL that represents the desired customers. 3. The selected group of customers is then scored by using a predictive model. The model might have been created several months ago (at the request of the marketing department) in order to predict the customer's likelihood of switching to a premium level of service. The score, a number between 0 and 1, represents the probability that the customer will indeed switch if they receive a brochure describing the new service in the mail. The scores are to be placed in a database table, with each record containing the customer ID and that customer's numerical score. 4. After the scoring is complete, the customers then need to be sorted by their score value. The top 25% will be chosen to receive the premium service offer. A separate database table that contains the records for the top 25% of the scoring customers will be created. 5. After the customers with the top 25% of the scores are identified, the information necessary to send them the brochure (name and address) will need to be pulled out of the data warehouse and a tape created containing all of this information. 6. Finally, the tape will be shipped to a company (sometimes referred to as a "mail house")where the actual mailing will occur. The marketing department typically determines when and where the marketing campaigns take place. In past years, this process might be scheduled to happen once every six months, with large numbers of customers being targeted every time the marketing campaign is executed. Current thinking is to move this process into a more continuous schedule, whereby small groups of customers are targeted on a weekly or even daily basis. When marketing campaigns are infrequent, manual selection and scoring of the data is not a significant impediment to the process. There is usually significant lead time to allow for the various parties to do their work before the actual mailing will take place. When someone in marketing needs to have a segment of customers selected for the campaign, they simply call someone in IT. When the scores are needed, the statistician who created the model is asked to apply the model to the customers in the desired segment. Because the processing is performed manually, the possibility of an error being introduced into the system is considerable, as follows: • The definition of the segmentation can be incorrect and select the wrong customers for scoring. This kind of error is usually due to an incorrect translation from the marketing user's vocabulary to the syntax of an SQL statement executed by someone in IT. • Make sure that the correct customers are scored. The correct database table needs to be scored. There is confusion sometimes regarding which table, among hundreds, is supposed to be scored. When the names of the tables are cryptic, as they often are (for example, JF432_IPG), the possibility of using the wrong data for scoring is possible. • Make sure that the correct model is used to do the scoring. Assuming that the targeted selection of customers is a success, the number of models available could be quite large. In addition, multiple models might be similar (for example, one model predicts responses to a particular catalog for women aged 50-55, whereas another model predicts responses for men aged 50-55). • Make sure that the scores are put in the right place. Just as confusion sometime exists with the data that is going to be scored, there can also be some confusion about the tables that contain the scores. • Make sure that you understand how the scores are ordered. Are high values good or bad? If you want to select the best customers, you will need to know what score values represent those customers. When the frequency of the marketing campaigns is increased so that they occur on a daily or weekly basis, there are two significant impacts on the campaign. First, the decreased time between mailings means that there is much less room for error when carrying out the individual steps in the process. If a mistake is found, there is less time to correct it compared to the less frequent campaigns. Second, the sheer number of scoring "events" will increase dramatically, due to both the increased frequency of the campaigns and an increase in the number of segments that need to be scored. If the marketing campaigns that rely on the scores are run on a continuous (daily) basis, this means a lot of phone calls between marketing and IT, as well as between marketing and the modelers. The best approach to solving this problem is to use the campaign management software that is integrated with the scoring engine (see section 5 for a discussion of how this integrated software might work). If integrated software is not available, care will need to be taken so that difficulties are minimized. 3. Scoring Architectures and Configurations The software systems that are used to carry out the scoring process are usually simpler than the applications used to build the models. This is because the statistical functions and optimization procedures that were used to create the model are no longer needed; all that is required is a piece of software that can evaluate mathematical functions on a set of data inputs. Scoring involves invoking a software application (often called the "scoring engine"), which then takes a model and a dataset and produces a set of scores for the records in the dataset. There are three common approaches to scoring engines: • A scoring engine software application that is separate from the model-building application. • A scoring engine that is part of the model-building application. • A scoring engine that is produced by compiling the model "code" (for example, C++ or Java) that is output by the data mining application. In this case, a model is itself the scoring application because it is an executable piece of software (once it is compiled). The type of model generated will depend upon the data mining system that is used. Some data mining systems can produce multiple types of models, whereas others will generate only a single type. In the first two cases, the scoring engine is a software application that needs to be run by the user. It might have a graphical user interface or it might be a command line program, in which the user specifies the input parameters by typing them onto a console interface when the program is run. There are usually three inputs to the scoring engine: the model that is to be run, the data that is to be scored, and the location where the output scores should be put.
Conclusion Comprehensive data warehouses that integrate operational data with customer, supplier, and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. However, there is a growing gap between more powerful storage and retrieval systems and the users’ ability to effectively analyze and act on the information they contain. Both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses, but brute force navigation of data is not enough. A new technological leap is needed to structure and prioritize information for specific end-user problems. The data mining tools can make this leap. Quantifiable business benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users.
|
Responses
|
No responses found. Be the first to respond and make money from revenue sharing program.
|
|
Watch TV Channels
Watch Asianet TV onlineKairali TV in InternetSurya TV onlineAmritha TV Channel
|