What is Data Mining and Text Mining?

Data mining, also known as knowledge discovery in databases, refers to a set of methods to extract or discover (mine) implicit, previously unknown, and potentially useful information or knowledge from large amounts of data. The term “data mining” is a misnomer since its more appropriate name is “knowledge mining.” There are also several other names, such as knowledge extraction, data analysis, pattern analysis, data archaeology, data dredging, information harvesting, business intelligence and so forth. An aim in data mining is to invent a set of methods that seek regularities or patterns from a large-scale database automatically. Once strong patterns are found, it is possible to use the pattern as generalization to make accurate predictions on future data. Naturally during the mining process, a large number of patterns may be found but only a small portion of this set is interesting and useful. The others tend to be spurious, contingent on accidental coincidences in the particular dataset used. At this point, the important issue is how to select those interesting patterns from a large pile of mined patterns. Moreover, data found in real situation tend to be imperfect with garbling parts and/or missing values. Methods in data mining need to be robust enough to cope with these imperfect data and to extract regularities that are interesting and useful.Several core techniques in data mining come from statistical analysis and machine learning, which can take the data in and infer whatever structure underlying such data is.

In contrast with data mining that deal with structured data, text mining handles the unstructured or semi-structured textual data (such as journal articles, news articles and online web contents), not formalized database records. It usually involves the process of structuring the input text, by the way of syntactic (parsing), semantic, discourse, and/or pragmatic analysis, to derive patterns within the structured data, and finally evaluation and interpretation of the output. Text mining usually aims to obtain the results with high relevance, novelty, and interestingness. One dominant different characteristic between data mining and text mining is preprocessing. While preprocessing in data mining seems simple by just focusing on data cleansing, data integration, data transformation, and data reduction, preprocessing operations in text mining requires more in the identification and extraction of representative features for texts written in a natural language. Not concerned in data mining, these preprocessing operations are responsible for transforming unstructured data stored in text collections into a more explicitly structured intermediate format. By this characteristic, text mining needs the exploitation of techniques and methodologies from the areas of natural language processing and human language processing, including information retrieval, information extraction, text classification, text clustering, and corpus-based computational linguistics.

However, it is quite common to find many data mining techniques used in text mining works and vice versa. The architectures in both areas are also very similar. For example, both data mining and text mining rely on preprocessing steps, mining algorithms, and presentation and visualization techniques to enhance the interpretation of discovered patterns.