Recently, so much and so much have been said about the analysis of information that one can finally get lost in the problem. It is good that many people pay attention to this actual topic. The only bad thing is that by this term everyone understands what he needs, often without having a general picture of the problem. The fragmentation of this approach is the cause of a misunderstanding of what is happening and what to do. Everything consists of pieces weakly interconnected and not having a common rod. Surely, you often heard the phrase “patchwork automation.” Many have already encountered this problem many times and can confirm that the main problem with this approach is that it is almost never possible to see the whole picture. With the analysis the situation is similar.
In order to understand the place and purpose of each analysis mechanism, let’s consider all of this in its entirety. It will be based on how a person makes decisions, because to explain how a thought is born, we are not able to concentrate on how information technology can be used in this process. The first option – the decision maker (DM), uses the computer only as a means of extracting data, and already makes conclusions on its own. To solve such problems, reporting systems, multidimensional data analysis, charts and other visualization methods are used. The second option: the program not only extracts data, but also conducts various preprocessing, such as cleaning, smoothing, and so on. And to the data processed in this way, it applies mathematical analysis methods – clustering, classification, regression, etc. In this case, the decision maker receives not raw, but seriously processed data, i.e. the person is already working with models prepared by the computer.
Due to the fact that in the first case, practically everything connected with the decision-making mechanisms is placed on a person, the problem with choosing an adequate model and choosing processing methods is beyond the limits of analysis mechanisms, i.e., the basis for making a decision is either an instruction (for example how to implement mechanisms to respond to deviations), or intuition. In some cases, this is quite enough, but if the decision maker is interested in knowledge that is deep enough, if I may say so, then just the data extraction mechanisms will not help here. More serious processing is needed. This is the second case. All applied preprocessing and analysis mechanisms allow DMs to operate at a higher level. The first option is suitable for solving tactical and operational tasks, and the second for replicating knowledge and solving strategic problems.
The ideal case would be the possibility to apply both approaches to the analysis. They allow you to cover almost all the needs of the organization in the analysis of business information. Varying techniques depending on the tasks, we will be able in any case to get the most out of the available information.
Often, when describing a product analyzing business information, terms such as risk management, forecasting, market segmentation are used … But in reality, solving each of these tasks comes down to using one of the analysis methods described below. For example, forecasting is a regression task, market segmentation is clustering, risk management is a combination of clustering and classification, other methods are possible. Therefore, this set of technologies allows to solve most business problems. In fact, they are atomic (basic) elements from which the solution of a particular task is assembled.
The primary source of data should be the database of enterprise management systems, office documents, the Internet, because it is necessary to use all the information that may be useful for making decisions. And we are talking not only about internal information for the organization, but also about external data (macroeconomic indicators, competitive environment, demographic data, etc.).
Now we will separately describe each fragment of the scheme.
Although analysis technologies are not implemented in the data warehouse, it is the basis on which to build an analytical system. In the absence of a data repository, the collection and systematization of the information necessary for analysis will take up most of the time, which largely negates all the advantages of the analysis. Indeed, one of the key indicators of any analytical system is the ability to quickly get results.
The next element of the scheme is the semantic layer. Regardless of how the information will be analyzed, it is necessary that it is understandable by the decision maker, since in most cases the analyzed data are located in different databases, and the decision maker does not have to understand the nuances of working with the DBMS, then you need to create a mechanism that transforms subject area in the challenges of database access mechanisms. This task is performed by the semantic layer. It is desirable that it be the same for all analysis applications, thus it is easier to apply different approaches to the task.
Reporting systems are designed to answer the question of what is happening. The first use case: regular reports are used to monitor the operational situation and analyze deviations. For example, the system prepares daily reports on product residuals in a warehouse, and when its value is less than the average weekly sale, it is necessary to respond to this by preparing the purchase order, i.e. in most cases these are standardized business operations. Most often, some elements of this approach are implemented in one form or another in companies (even if only on paper), but it should not be allowed to be the only available approach to data analysis. The second application of reporting systems: handling ad hoc requests. When a decision maker wants to test any thought (hypothesis), he needs to get food for thought confirming or refuting the idea, since these thoughts come spontaneously, and there is no exact idea of what kind of information will be needed, a tool is needed that allows you to quickly and in a convenient form to get this information. The extracted data is usually presented either in the form of tables, or in the form of graphs and diagrams, although other representations are possible.
Although various approaches can be used to build reporting systems, the most common today is the OLAP mechanism. The main idea is to present information in the form of multidimensional cubes, where the axes are measurements (for example, time, products, customers), and indicators are placed in the cells (for example, the amount of sales, the average purchase price). The user manipulates measurements and receives information in the necessary section.
Due to the simplicity of understanding, OLAP has become widespread as a mechanism for analyzing data, but it must be understood that its capabilities in the area of deeper analysis, such as forecasting, are extremely limited. The main problem in solving forecasting problems is not at all the possibility of extracting data of interest in the form of tables and diagrams, but the construction of an adequate model. Then everything is quite simple. At the input of the existing model, new information is supplied, it is passed through it, and the result is the forecast. But building a model is a completely non-trivial task.
Of course, you can put in the system several ready-made and simple models, for example, linear regression or something similar, quite often they do it, but this does not solve the problem. Real-life tasks almost always go beyond such simple models. And consequently, such a model will only detect obvious dependencies, the detection value of which is insignificant, which is well known and so, or it will build too rough predictions, which is also completely uninteresting. For example, if you, when analyzing the stock price in the stock market, proceed from the simple assumption that shares will cost as much tomorrow as today, you will guess in 90% of cases. And how valuable is such knowledge? Only the remaining 10% are of interest to brokers. Primitive models in most cases give a result at about the same level.
The correct approach to building models is their step-by-step improvement. Starting with the first, relatively rough model, it is necessary to improve it with the accumulation of new data and the application of the model in practice. Actually, the task of building forecasts and similar things are beyond the scope of reporting system mechanisms, therefore, one should not expect positive results in this direction when applying OLAP. To solve the problems of a deeper analysis, a completely different set of technologies is used, united under the name Knowledge Discovery in Databases.
Knowledge Discovery in Databases (KDD) is the process of transforming data into knowledge. KDD includes issues of data preparation, selection of informative features, data cleansing, application of Data Mining (DM) methods, post-processing of data, interpretation of obtained results. Data mining is the process of detecting previously unknown, non-trivial, practically useful and available for the interpretation of knowledge necessary for making decisions in various spheres of human activity in “raw” data.
The attractiveness of this approach lies in the fact that regardless of the subject area, we use the same operations:
Extract data. In our case, this requires a semantic layer.
Clear data. The use of “dirty” data for analysis can completely negate the analysis mechanisms applied in the future.
Transform the data. Different methods of analysis require data prepared in a special form. For example, somewhere as inputs only digital information can be used.
To conduct, in fact, the analysis – Data Mining.
Interpret the results.