出版時(shí)間:2009-8 出版社:人民郵電出版社 作者:(以)費(fèi)爾德曼,(美)桑格 頁(yè)數(shù):410 字?jǐn)?shù):506000
Tag標(biāo)簽:無(wú)
前言
The information age has made it easy to store large amounts of data. The prolifera- tion of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, although the amount of data available to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes. Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, natural language processing (NLP), information retrieval (IR), and knowledge management. Text mining involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate represen- tations, the techniques to analyze these intermediate representations (such as distri- bution analysis, clustering, trend analysis, and association rules), and visualization of the results. This book presents a general theory of text mining along with the main tech- niques behind it. We offer a generalized architecture for text mining and outline the algorithms and data structures typically used by text mining systemg The book is aimed at the advanced undergraduate students, graduate students, academic researchers, and professional practitioners interested in complete cov- erage of the text mining field. We have included all the topics critical to people who plan to develop text mining systems or to use them. In particular, we have covered preprocessing techniques such as text categorization, text clustering, and information extraction and analysis techniques such as association rules and link analysis. The book tries to blend together theory and practice; we have attempted to provide many real-life scenarios that show how the different techniques are used in practice. When writing the book we tried to make it as self-contained as possible and have compiled a comprehensive bibliography for each topic so that the reader can expand his or her knowledge accordingly.
內(nèi)容概要
本書(shū)是一部文本挖掘領(lǐng)域名著,作者為世界知名的權(quán)威學(xué)者。書(shū)中涵蓋了核心文本挖掘操作、文本挖掘預(yù)處理技術(shù)、分類、聚類、信息提取、信息提取的概率模型、預(yù)處理應(yīng)用、可視化方法、鏈接分析、文本挖掘應(yīng)用等內(nèi)容,很好地結(jié)合了文本挖掘的理論和實(shí)踐?! ”緯?shū)非常適合文本挖掘、信息檢索領(lǐng)域的研究人員和實(shí)踐者閱讀,也適合作為高等院校計(jì)算機(jī)及相關(guān)專業(yè)研究生的數(shù)據(jù)挖掘和知識(shí)發(fā)現(xiàn)等課程的教材。
作者簡(jiǎn)介
Ronen Feldman 機(jī)器學(xué)習(xí)、數(shù)據(jù)挖掘和非結(jié)構(gòu)化數(shù)據(jù)管理的先驅(qū)人物。以色列Bar-Ilan大學(xué)數(shù)學(xué)與計(jì)算機(jī)科學(xué)系高級(jí)講師、數(shù)據(jù)挖掘?qū)嶒?yàn)室主任,Clearforest公司(主要為企業(yè)和政府機(jī)構(gòu)開(kāi)發(fā)下一代文本挖掘應(yīng)用)合作創(chuàng)始人、董事長(zhǎng),現(xiàn)在還是紐約大學(xué)Stern商學(xué)院的副教授。
書(shū)籍目錄
Ⅰ. Introduction to Text Mining?、?1 Defining Text Mining Ⅰ.2 General Architecture of Text Mining SystemsⅡ. Core Text Mining Operations?、?1 Core Text Mining Operations?、?2 Using Background Knowledge for Text Mining?、?3 Text Mining Query LanguagesⅢ. Text Mining Preprocessing Techniques?、?1 Task-Oriented Approaches?、?2 Further ReadingⅣ. Categorization?、?1 Applications of Text Categorization?、?2 Definition of the Problem?、?3 Document Representation Ⅳ.4 Knowledge Engineering Approach to TC?、?5 Machine Learning Approach to TC?、?6 Using Unlabeled Data to Improve Classification?、?7 Evaluation of Text Classifiers?、?8 Citations and NotesⅤ. Clustering Ⅴ.1 Clustering Tasks in Text Analysis?、?2 The General Clustering Problem?、?3 Clustering Algorithms?、?4 Clustering of Textual Data?、?5 Citations and NotesⅥ. Information Extraction?、?1 Introduction to Information Extraction?、?2 Historical Evolution of IE: The Message Understanding Conferences and Tipster Ⅵ.3 IE Examples?、?4 Architecture of IE Systems?、?5 Anaphora Resolution?、?6 Inductive Algorithms for IE Ⅵ.7 Structural IE?、?8 Further ReadingⅦ. Probabilistic Models for Information Extraction?、?1 Hidden Markov Models?、?2 Stochastic Context-Free Grammars?、?3 Maximal Entropy Modeling?、?4 Maximal Entropy Markov Models?、?5 Conditional Random Fields?、?6 Further ReadingⅧ. Preprocessing Applications Using Probabilistic and Hybrid Approaches?、?1 Applications of HMM to Textual Analysis?、?2 Using MEMM for Information Extraction?、?3 Applications of CRFs to Textual Analysis?、?4 TEG: Using SCFG Rules for Hybrid Statistical–Knowledge-Based IE?、?5 Bootstrapping?、?6 Further ReadingⅨ. Presentation-Layer Considerations for Browsing and Query Refinement?、?1 Browsing?、?2 Accessing Constraints and Simple Specification Filters at the Presentation Layer?、?3 Accessing the Underlying Query Language?、?4 Citations and NotesⅩ. Visualization Approaches?、?1 Introduction?、?2 Architectural Considerations?、?3 Common Visualization Approaches for Text Mining?、?4 Visualization Techniques in Link Analysis?、?5 Real-World Example: The Document Explorer SystemⅪ. Link Analysis Ⅺ.1 Preliminaries?、?2 Automatic Layout of Networks?、?3 Paths and Cycles in Graphs?、?4 Centrality?、?5 Partitioning of Networks Ⅺ.6 Pattern Matching in Networks?、?7 Software Packages for Link Analysis?、?8 Citations and NotesⅫ. Text Mining Applications?、?1 General Considerations?、?2 Corporate Finance: Mining Industry Literature for Business Intelligence?、?3 A “Horizontal” Text Mining Application: Patent Analysis Solution Leveraging a Commercial Text Analytics Platform?、?4 Life Sciences Research: Mining Biological Pathway Information with GeneWaysAppendix A: DIAL: A Dedicated Information Extraction Language forText Mining A.1 What Is the DIAL Language? A.2 Information Extraction in the DIAL Environment A.3 Text Tokenization A.4 Concept and Rule Structure A.5 Pattern Matching A.6 Pattern Elements A.7 Rule Constraints A.8 Concept Guards A.9 Complete DIAL ExamplesBibliographyIndex
章節(jié)摘錄
Similarity Functions for Simple Concept Association Graphs Similarity functions often form an essential part of working with simple concept asso- ciation graphs, allowing a user to view relations between concepts according to differ- ing weighting measures. Association rules involving sets (or concepts) A and B that have been described in detail in Chapter II are often introduced into a graph format in an undirected way and specified by a support and a confidence threshold. A fixed confidence threshold is often not very reasonable because it is independent of the sup- port from the RHS of the rule. As a result, an association should have a significantly higher confidence than the share of the RHS in the whole context to be considered as interesting. Significance is measured by a statistical test (e.g., t-test or chi-square). With this addition, the relation given by an association rule is undirected. An asso- ciation between two sets A and B in the direction AB implies also the association B A. This equivalence can be explained by the fact that the construct of a statisti- cally significant association is different from implication (which might be suggested by the notation AB). It can easily be derived that if B is overproportionaUy represented in A, then A is also overproportionally represented in B. As an example of differences of similarity functions, one can compare the undi- rected connection graphs given by statistically significant association rules with the graphs based on the cosine function. The latter relies on the cosine of two vectors and is efficiently applied for continuous, ordinal, and also binary attributes. In case of documents and concept sets, a binary vector is associated to a concept set with the vector elements corresponding to documents. An element holds the value 1 if all the concepts of the set appear in the document. Table X.1 (Feldman, Kloesgen, and Zilberstein 1997b), which offers a quick summary of some common similarity functions, shows that the cosine similarity function in this binary case reduces to the fraction built by the support of the union of the two concept sets and the geometrical mean of the support of the two sets. A connection between two sets of concepts is related to a threshold for the cosine similarity (e.g., 10%). This means that the two concept sets are connected if the support of the document subset that holds all the concepts of both sets is larger than 10 percent of the geometrical mean of the support values of the two concept sets.
媒體關(guān)注與評(píng)論
“……我購(gòu)買(mǎi)了這本書(shū)。這本書(shū)絕對(duì)是非常值得擁有的參考書(shū)?!薄 狶.Venkata Subramaniam,IBM印度研究實(shí)驗(yàn)室 “一本由該領(lǐng)域最重要專家鳊寫(xiě)的文本挖掘?qū)д?。這本書(shū)寫(xiě)得非常好。完美地結(jié)合了文本挖掘的理論和實(shí)踐,既適合研究人員又適合實(shí)踐者……極力推薦那些沒(méi)有任何計(jì)算語(yǔ)言學(xué)背景而想鉆研文本挖掘領(lǐng)域的人閱讀本書(shū)?!薄 猂ada Mihalcea,北得克薩斯大學(xué) 文本挖掘已經(jīng)成為令人興奮的新興研究領(lǐng)域。本書(shū)由世界知名的權(quán)威學(xué)者編寫(xiě),除了講解核心文本挖掘和鏈路檢測(cè)算法及技術(shù)之外,還介紹了高級(jí)預(yù)處理技術(shù)。并考慮了知識(shí)表示方面的因素以及可視化方法。此外。書(shū)中還探討了有關(guān)技術(shù)在實(shí)踐中的應(yīng)用,很好地兼顧了文本挖掘的理論和實(shí)踐
圖書(shū)封面
圖書(shū)標(biāo)簽Tags
無(wú)
評(píng)論、評(píng)分、閱讀與下載
250萬(wàn)本中文圖書(shū)簡(jiǎn)介、評(píng)論、評(píng)分,PDF格式免費(fèi)下載。 第一圖書(shū)網(wǎng) 手機(jī)版