You are moving your shopping cart towards the billing counter. The barcode of each product in your cart has already been automatically transmitted one by one to one of the billing computer systems. This is your first visit to this department store. You are allotted a loyalty card that has a customer ID number and are asked to fill a short form that asks details like your age, gender, income and occupation. A few days later you receive an email from the store informing you about some good offers on headphones; you had purchased an iPod among other things during your last visit. This is data mining at work.

MCI Communications Corp., the long-distance carrier, sifts through almost one trillion bytes of customer phoning data to develop new discount calling plans for its different types of customers. Land’s End, a retail chain, identifies which of its two million odd customers can be sent exclusive mailers on certain specific clothing items that will go well with their existing wardrobe.

Fig. 1: Basic steps involved in the process of data mining
Fig. 2: A bitcoin mining hardware set-up that uses custom chips known as ASICs to focus their processing power on bitcoin algorithms (Source:

Data warehousing and data mining
Data warehousing is all about capturing data on a periodic basis from various sources and storing it on computers in an organised manner, that is, in virtual warehouses.

Let us take its application in marketing to customers as an illustration. Some common sources of data are details of purchases made at retail outlets, marketing research interviews with customers or simply day-to-day customer transactions (such as amount of money deposited or withdrawn from a bank). Usually, the starting point is a customer database that needs to be elaborate in terms of variables such as demographics like age, family size and income. Customers leave traces of their purchasing behaviour in store scanning data, catalogue purchase records and customer databases.

Customers’ actual purchases reflect their true and actual preferences, and could often be more reliable than answers given during market research surveys. Data mining refers to the analysis of this raw data, which is collected to generate useful and meaningful information, such as finding patterns in terms of products purchased most often and building customer profiles in terms of low, medium and heavy buyers. This information can then be used for business actions like special promotions and loyalty programmes.

A brief history

Fig. 3: A cabinet from Blue Gene/L, a massively parallel processor based supercomputer (Source:

Data mining was born in the minds of professors and doctorates who used their expertise in statistics to develop algorithms that form the foundation of data mining today. In the initial days, these algorithms were custom-coded using programming languages such as FORTRAN.

Early applications of data mining were more academic in nature. As this work evolved, different vendors fabricated new levels of tooling to make application of these data mining algorithms more user-friendly.

However, even today, data mining remains more of a standalone activity conducted by a dedicated staff, and its terminology is still quite academically-oriented. This has resulted in data mining as being perceived to be mysterious and somewhat aloof from day-to-day IT projects.

When a business problem is encountered, the IT team starts working on it by developing and understandsing data warehouse aspects. The data mining team works differently and contributes by suggesting some kind of analyses or recommendations that will lead to a certain type of action.

Infrastructure requirements
In order to be effective and efficient, the data mining system requires some mandatory inputs. It needs large investments in hardware, software, database, programming communication links and skilled personnel. The volume of data is usually very high in a data mining application. Handling large amounts of data requires software with advanced competencies, and the system needs to be user-friendly and easily accessible by different user departments.

Data warehousing needs to be done carefully as, say, critical marketing actions may be taken based on data analysis. Since data is collected at multiple locations and times, there needs to be a quality control process in place.

Fig. 4: Crime and Criminal Tracking Network and Systems (CCTNS) has been established to integrate database on crimes by connecting 14,000 police stations in the country (Source:
Fig. 5: Data mining can enable evaluation of the effectiveness of medical treatments (Source:

Moreover, the database must be continuously updated, as it is a dynamic situation with some old customers leaving, new ones joining and others changing their interests.

Data mining cannot be carried out in isolation. It needs to be integrated with the company’s overall marketing strategy.

Finally, data is nothing but numbers. It has to be made to run around, go places and analysed innovatively in order to get good returns on investment.

Currently, data mining applications are offered for all sizes of systems for mainframe, client/server and PC platforms. System prices range from as low as a few thousand dollars for the smallest applications to as high as a million dollars a terabyte for the largest. Enterprise-wide applications generally range from 10GB to over 10TB.

There are two key technical considerations. First is the size of the database—the larger is the data being processed and maintained, the bigger will be the required system.

Second is query intricacy—the more complex the queries and the larger the number of queries being processed, the more powerful the system required. Relational database storage and management technology is sufficient for most data mining applications less than 50GB. However, this needs to be significantly increased to take on bigger applications.

Some vendors have added widespread indexing capabilities to advance query performance, while others use new hardware architectures such as massively parallel processors (MPPs) to accomplish order-of-magnitude improvements in query time.

Data mining can be applied to a number of diverse areas and has already proven to be an immensely useful tool.


Please enter your comment!
Please enter your name here