Data Mining Methods for Beginners
The term data mining has become so widely used that we are most likely going to find at least one shared article on the topic in our social media news feed. In fact, the extent of its overuse has often led to misunderstandings of what it is or they are explained in a difficult-to-understand manner that, in the end, we might as well be reading gibberish.
Technically, data mining is the process of finding certain information from a compilation of data and presenting the usable information in the hopes of resolving a specific problem. In a nutshell, data mining is the act of examining large database sets to create new information. There are different services involved in the process, such as text mining, web mining, audio and video mining, visual data mining, and social network data mining.
There are many major data mining techniques in development, and recent data mining projects include association, clustering, prediction, sequential patterns and decision tree. This guide will provide a brief examination of each of these techniques.
Table of Contents
Association is perhaps one of the more popular data mining techniques used today. In association, the user is attempting to discover a pattern based on uncovered links between items of a singular transaction. This is the reason why the association technique is also commonly referred to as relation technique. This technique is widely used in market basket analysis with the aim of identifying a set of products that consumers frequently purchase in one transaction.
Retail companies use the association technique to study the psychological decision-making process behind their customer’s purchases. For example, when looking at past sales data, companies might discover that customers who buy chips will also buy beer. Therefore, the company will put beers and chips in the same shopping aisle or in relatively close distances from one another. This could be a way of efficient shopping for customers and ultimately increase sales.
Classification is a traditional data mining technique based on machine learning. Essentially, this technique is utilized for classifying each item in a dataset into one single predefined set of groups. The classification technique uses mathematical approaches such as decision trees, linear programming, statistics and neural network.
In this technique, the user develops software to learn how to organize items into groups. For instance, classification technique can be applied in the application that “looking at records of employees who have left the company, predict who will leave the company next.” In this case, we separate records of employees into two classes named “remain” and “gone.” Our data mining software then classifies the employees into their groups based on their probability of exiting the company.
In data mining, clustering refers to the process of categorizing a particular set of objects by looking at their characteristics and separating them based on their similarities. The clustering technique sets the classes and places each object in their respective class, whereas in the classification technique, objects are assigned into predefined categories.
To make things clearer, let’s look at the example of a library’s book management system. In a library, there is a large selection of books on a number of topics. The posing challenge is how to organize the books in a way so visitors can pick up several books on a certain topic without having to walk around the whole library. With the clustering technique, one cluster – or in this case, shelf – contains all books that are about a particular topic, and the cluster is given a meaningful, understandable name. If readers need to take a book on that topic, they just have to head to the aisle where those books are located instead of searching the entire building.
As the name suggest, the prediction technique aims to discover the link between independent variables and define the relationship between independent and dependent variables. For example, this technique is can be used to predict future profits if sales are set as the independent variable and profits as the dependent variable. Using past sale and profit data, the user can draw a regression curve for predicting profits.
Sequential patterns analysis aims to uncover or identify common patterns, regular events or trends in transactions data over a certain period. In sales, using past transactions data, a business can find a set of items that their customers purchase in one visit during certain months or seasons. Businesses use this information to offer better deals or discounts based on historical purchasing frequency.
A decision tree is one of the most widely used forms of data mining due to its model’s simplicity and understandability. With the decision tree technique, the root of the tree is a question or condition that can have multiple responses. Each response then leads to a set of questions or conditions that help in determining the data so the user can make a better final decision. For example, we can look at the following sequence of questions and answers and make a decision of whether we want to play basketball outdoors or indoors:
- Outlook → Is it sunny? → If so, the how humid is it? → If high humidity, then I’ll play indoors → If low humidity, then I’ll play outdoors
- Outlook → Is it raining? → If not, how windy is it? → If high winds, then I’ll play indoors → If low winds, then I’ll play outdoors
- Outlook → Is there overcast? → If so, then I’ll play outdoors
Beginning at the root node, if the outlook is overcast then I will play basketball outdoors. If it’s raining, I’ll only play basketball outdoors only if it’s not windy. And if the sun is out and shining, I will only play basketball outdoors if it’s not too humid.
These are the six basic techniques used in data mining. Though some of them may appear to be similar in practice, they all have different aims in terms of data collection. We are free to two or more data mining technique simultaneously to form a process that meets what a business’ needs.