Data Mining: Digging for Big Data Gold!
There is no doubt whatsoever that we are living in the information age. Connectivity and the Internet of Things have made it possible to access "tons" of data. And to store all of that information, we now have Big Data. "Big Data" is a reference to the systems that manipulate large data sets. The most common difficulties encountered in these cases are related to capturing, storing, searching, sharing, analyzing and viewing. The trend to manipulate vast quantities of data is often due to the need to include additional information in a large set of related data for purposes such as analyzing business trends, preventing the spread of infectious diseases and combating organized crime.
In view of the enormous volume of information that has emerged, data measurement units have evolved from kilobytes (KB) to megabytes (MB), gigabytes (GB), terabytes (TB), petabytes (PB) and exabytes (EB), and it is estimated that at the end of 2013 the amount of data stored in the Internet had reached 2.2 zettabytes (ZB), expected to rise to 2.5 ZB in 2014. The key question is, when we will reach the yottabyte (YB)? It takes a million trillion megabytes to fill a yottabyte.
With such a huge volume of information, the key is knowing how best to use it; in other words, how to transform information into knowledge for making decisions. The analytical methods used by business corporations have evolved from triggering alarms to predicting events:
1. Simple alarms (limited effectiveness of the commercial campaigns undertaken):
- Identification of events that indicate a trend in customer behavior (e.g. dissonance and direct impact on the risk of abandonment, need for new products following a new launch, etc.)
- Variables are continuously updated
2. Segmentation of customer profiles (risk or purchase profiles) (high cost to guarantee that commercial campaigns have the desired impact as the 'entire' segment has to be targeted):
- Identification of customer segments based on their value and behavior
- Determination of risk profiles or those with a greater propensity toward certain products and services
3. Propensity prediction models (maximum effectiveness as specific customers are targeted with sufficient time to retain them):
- Construction of models to predict the propensity to buy or abandon based on the customer's current characteristics
- The models are more detailed when used with variables (continuously updated) and it is faster to spot possible deviations in customer propensity
Let us pause to look at this more closely. There is a correlation between the score obtained by the models and the probability of an event occurring, and this has a direct impact on the result of the campaigns undertaken. Most of the effort is concentrated on identifying, creating and transforming the variables for analysis. Different analytical methods are used for different purposes:
1. Decision trees: These are the best option when the model has to follow a specific business logic to be explained. They consist in classifying individuals into groups with different behavioral patterns, based on a set of input variables. This is a supervised modeling method used when business acumen is a key factor in the structure of the model. Advantages of decision trees:
1. First and foremost, they are intelligible and easy to explain.
2. Excellent predictive capacity in the case of categorical variables. The input variables need to be clearly defined and cut in just the right place.
3. Highly flexible in terms of the different types of input variables and the manipulation of missing data. Moreover, they are not overly affected by outliers.
4. Very easy to implement, maintain and review.
5. Overadjustment has to be controlled by evaluating the model with a sample test to guarantee its precision.
2. Neural networks: These are a good alternative but require more exploratory work than the other methods. They combine the attributes of an observation to assist decision-making. The modeling process consists in training the neural network to combine attributes with the appropriate structure and weightings. Advantages of neural networks:
1. They follow a heuristic training process that enables them to adjust the weightings of the input attributes (e.g. backpropagation).
2. The input variables have to be normalized to 0 to 1 scales in order to speed up the convergence of the algorithm.
3. The more intermediate layers there are, the greater the adjustment of the result and the risk of overadjustment. It is important to have training and test samples.
4. One of the justifications for not using neural networks is usually that they are like a "black box" that is very difficult to interpret. One solution might be to create decision trees to "open" the box and understand the workings (substitution models).
3. Logistic regressions: This is one of the most popular methods because the models are efficient and have a high predictive capacity. Logistic regression is a parametric modeling method, which means that the relationship between the explanatory variables and the transformed target variable (logit) is linear. Advantages of logistic regressions:
1. There are no limitations as to how many independent or explanatory variables may be continuous or categorical.
2. Once the dependent variable has been defined as the occurrence or not of an event, the logistic regression model expresses it in terms of probability.
3. Logistic regressions require less effort than neural networks. There is no need to explore different structures and check different overadjustments.
4. If multiple models are required, this is the best option.
4. Support vector machines (SVM): This option is used for propensity models but they are difficult to interpret, maintain and implement. These are statistical learning models based on spatial separation through hypermaps selected to maximize the gap (gain). They are used to classify the information related to a non-linear problem by turning it into a high-dimensional linear problem. Advantages of SVM
1. Certain abandonment prediction models reveal that there is no statistical difference between the logistic model and neural networks, while support vector models offer the highest level of accuracy.
2. Although conceptually quite simple from the mathematical point of view, this model is a little more difficult to implement than decision trees, although there are numerous libraries in different formats to facilitate the task.
In conclusion, and after more than 12 years as a marketing analyst, I have to say that the only key to creating a successful business strategy in the 21st century is a perfect combination between good information, analytical capacity, common sense, flexibility and the speed at which campaigns are executed!