Text Mining

the application of algorithms and methods from the fields machine learning and statistics to texts with the goal of finding useful patterns. For this purpose it is necessary to pre-process the texts accordingly. Many authors use information extraction methods, natural language processing or some simple pre- processing steps in order to extract data from texts. To the extracted data then data mining algorithms can be applied (see [NM02, Gai03]).(Hotho, Nürnberger, and Paaß 2005)Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text(Hotho, Nürnberger, and Paaß 2005)for the first time mentioned in Feldman et al. [FD95] (Hotho, Nürnberger, and Paaß 2005)Text Mining [1] is the discovery by computer of new,previously unknown information, by automaticallyextracting information from different written resources.(Gupta and Lehal 2009)Definitionsrefers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text.(Gupta and Lehal 2009)can work with unstructured or semi-structured data sets such as emails, full-text documents and HTML files etc.(Gupta and Lehal 2009)knowledge discovery from text (KDT) (Hotho, Nürnberger, and Paaß 2005) Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in Text (KDT)(Gupta and Lehal 2009)Text Analytics(Ppts APomares)a.k.aText Data Mining(Ppts APomares)information retrieval, machine learning, statistics, computational linguistics and especially data mining.(Hotho, Nürnberger, and Paaß 2005)tries to find interesting patterns from large databases.(Gupta and Lehal 2009)Data Mining(Gupta and Lehal 2009) is to explore interesting information and potential patterns from the contents of web page, the information of accessing the web page linkages and resources of e-commerce by using techniques of data mining, which can help people extract knowledge, improve web sites design, and develop e- commerce better.(Gupta and Lehal 2009)Web miningFor mining large document collections it is necessary to pre-process the text documents and store the information in a data structure, which is more appropriate for further processing than a plain text file(Hotho, Nürnberger, and Paaß 2005)Graph MiningComputational Linguistics(Gupta and Lehal 2009)remove words from the dictionary and thus from the documents.Filtering(Hotho, Nürnberger, and Paaß 2005)Try to map verb forms to the infinite tense and nouns to the singular form. Lemmatization(Hotho, Nürnberger, and Paaß 2005)Information retrieval is the finding of documents which contain answers to questions and not the finding of answers itself [Hea99](Hotho, Nürnberger, and Paaß 2005)Information RetrievalTry to build the basic forms of words, i.e. strip the plural ’s’ from nouns, the ’ing’ from verbs, or other affixes. Stemming(Hotho, Nürnberger, and Paaß 2005)The general goal of NLP is to achieve a better understanding of natural language by use of computers [Kod99].(Hotho, Nürnberger, and Paaß 2005)Natural Language Processing NLPdetermines the part of speech tag, e.g. noun, verb, adjective, etc. for each term.Part-of-speech tagging (POS) The goal of information extraction methods is the ex- traction of specific information from text documents. These are stored in data base-like patterns (see [Wil97])(Hotho, Nürnberger, and Paaß 2005)Related Areasaims at grouping adjacent words in a sentence.Text chunkingLinguistic(Hotho, Nürnberger, and Paaß 2005)Tries to resolve the ambiguity in the meaning of single words or phrases.Word Sense Disambiguation (WSD) IE addresses the problem of transforming a corpus of textual documents into a more structured database, the database constructed by an IE module can be provided to the KDD module for further mining of knowledge (Gupta and Lehal 2009)Information Extractionproduces a full parse tree of a sentence.ParsingIn this case, only the selected keywords are used to describe the documents. Text MiningPhasesThe entropy gives a measure how well a word is suited to separate documents by keyword search.Preprocessingare necessary in order to analyze large quantities of data efficiently.(Hotho, Nürnberger, and Paaß 2005)Index Term Selection(Hotho, Nürnberger, and Paaß 2005)DatabasesThe most commonly used criteria is the entropyThe entropy can be seen as a measure of the importance of a word in the given domain context.is an area of artificial intelligence concerned with the development of techniques which allow computers to ”learn” by the analysis of data sets.(Hotho, Nürnberger, and Paaß 2005)Machine Learninga.k.a Kahrhunen-Lo`eve procedure, eigenvector analysis, and empirical orthogonal functions depending on the context in which one is being used(Berry et al. 2008)Statistics has its grounds in mathematics and deals with the science and practice for the analysis of empirical data. It is based on statistical theory which is a branch of ap- plied mathematics(Hotho, Nürnberger, and Paaß 2005)Statisticsrecently it has been used primarily in statistical data analysis and image processing(Berry et al. 2008)Principal Components Analysis (PCA)is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data(Hotho, Nürnberger, and Paaß 2005)For text and data mining that focuses on covariance matrix analysis (COV)(Berry et al. 2008)Dimensionality reduction”Knowledge Discovery in Databases (KDD)Given a database with M documents and N distinguishing attributes for relevancyranking, let A denote the corresponding M-by-N document-attribute matrix modelLatent Semantic Indexing (LSI)Created by very large scale interacions of individuals or structured creations of particular kinds of content by dedicated organizations. i.e. news-wire services (Reuters, AP)(Aggarwal, 2012)Provide unprecedented challenges to data mining algorithms from an efficiency perspective(Aggarwal, 2012)Ubiquitous in recent years because wide variety of applications in social networks, news collection. In general continuous creation of massive streams(Aggarwal, 2012)Text StreamUsers continuously communicate with one another with the use of text messagesSub AreasText stream miningSocial networks(Aggarwal, 2012)Interesting due to text messages are reflective of user interests, an the same applies to chat and email networksNews aggregator services i.e. Google News(Aggarwal, 2012)ApplicationsRecieves news articles continously over timeCollect large volume of documents from networks in small time frameWeb crawlers(Aggarwal, 2012)Combines search results from major search engines like Google, Yahoo! and BingMethods for online summarizations need to be designed(Aggarwal, 2012)OpportunitesThere are estimates that 85% of business information lives in the form of text(Hotho, Nürnberger, and Paaß 2005)As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. (Gupta and Lehal 2009), (Tan 1999)Humans have the ability to distinguish and apply linguistic patterns to text and humans can easily overcome obstacles that computers cannot easily handle such as slang, spelling variations and contextual meaning. However, although our language capabilities allow us to comprehend unstructured data, we lack the computer’s ability to process text in large volumes or at high speeds.(Gupta and Lehal 2009)MotivationsAs the most natural form of storing information is text, text mining is believed to have a commercial potential higher than that of data mining(Tan 1999)(1) business understanding2, (2) data understanding, (3) data preparation, (4) modelling, (5) evaluation, (6) deployment CRoss Industry Standard Process for Data Mining (Crisp DM)(Hotho, Nürnberger, and Paaß 2005)Methodologies