What is text formalization in information extraction

SELF-LEARNING INFORMATION EXTRACTION FOR INVOICES IN DOCUMENT MANAGEMENT SYSTEMS

Transcript

1 Faculty of Computer Science Institute for System Architecture, Chair of Computer Networks Large sample SELF-LEARNING INFORMATION EXTRACTION FOR INVOICES IN DOCUMENT MANAGEMENT SYSTEMS Christian Elmer Matriculation No .: Responsible university professor: Prof. Dr. rer. nat. habil Dr. H. c. Schill Supervised by: Dipl.-Medien-Inf. Klemens Muthmann Submitted on August 31, 2010

2

3

4 iv

5 DECLARATION OF INDEPENDENCE I, Christian Elmer, hereby declare that I have written the present paper on the topic: Self-learning information extraction for invoices in Document Management Systems, using only the information sources listed in the bibliography. Dresden, August 31, 2010 Christian Elmer v

6 vi

7 TABLE OF CONTENTS 1 Introduction Objectives Structure Basics Information Retrieval Information Extraction Rule-based Information Extraction Classification Naïve Bayes Hidden Markov Model Summary Analysis of the Document Corpus Invoice Data Requirements Summary Related Work Structured Content Semi- and Unstructured Content Named Entity Recognition Hidden Markov Model Summary Draft Rule-based Approach Classification Multi-Class versus Binary Classification Feature Vector vii

8 6 Implementation Process control Reading in documents Indexing Data storage Rule-based implementation Automatic classification Evaluation Standard dimensions System configuration Rule-based approach Evaluation of execution times Automatic classification Execution Evaluation of execution times Summary and outlook Rule-based design Classification A List of Abbreviations 57 B Graphics 59 List of Figures 63 List of Tables 65 viii Table of Contents

9 1 INTRODUCTION Today, documents such as invoices or delivery notes are increasingly being stored digitally in order to ensure long-term archiving and to be able to search for specific documents using key words. If documents in paper form are stored in folders, a search is only possible using the creation date, as these documents are usually only sorted chronologically. A different or additional sorting is hardly possible, especially when it comes to incoming documents. A document management system that manages large digital document collections can in principle offer these capabilities. In order for a search based on index data to be possible in these document collections, identification features such as an invoice number or the name of the biller must be extracted from the digitized documents and saved in a structured form. Since the indexing mechanisms of today's search engines do not consider the context of specific document types, these classic information retrieval methods cannot be used satisfactorily in companies. Thus, with large document bases in companies, there is only manual indexing to enable a search in these documents. However, the manual indexing of documents means a lot of work, because each document has to be searched digitally by a user on the screen or the original in paper form has to be searched for the required indices. Since these are different document types or the same document types that have a different structure, the position of an index must be localized and transcribed for each document. The structure of the last edited document does not help the user to examine the new document. Partial automation of the indexing of specific document types can considerably reduce the high effort involved in manual indexing. 1.1 OBJECTIVES With the Intellix (intelligent, adaptive indexing of heterogeneous document bases) project, the Computer Networks Professorship at the Technical University of Dresden and DocuWare AG Germering have set themselves the goal of a flexibly configurable solution for the 1st

10 to develop partially automated indexing of heterogeneous document bases. [Bss10] The aim of this thesis is to develop an automatic indexing for documents and to evaluate it in a test implementation. The invoice document class is used as a template or test object for the analysis and design as well as for the evaluation. This concretization of the general problem of indexing documents is made possible by a further sub-task of the Intellix project, document type classification. Thus, depending on the document type, different extraction plans, as can be seen in Figure 1.1, can be applied to different document types. Suppose a company has a document base made up of different types of documents, such as delivery notes, invoices and contracts. First metadata and the raw text are extracted from these documents by a text recognition program. A type classification is carried out with the help of the raw text, the metadata and the scanned original documents. Now you can carry out the indexing with an extraction plan specially adapted to a type. Figure 1.1: Intellix concept as a process model, source: [BSS10] 1.2 STRUCTURE After the basics of information retrieval, information extraction and classification are dealt with in Chapter 2, the calculations at hand are analyzed in relation to the problem and the properties of the A closer look at invoice data. In Chapter 4, methods on the subject of information extraction from structured and unstructured content from related scientific work are presented, with which a development of the problem of this work will be possible. Based on the findings of the chapters presented, an application is designed in Chapter 5 that enables the automatic indexing of invoices. The core of this application, finding and extracting the required indexes, is first developed with a rule-based approach. This approach is usually tied to a specific domain. It is therefore difficult to adapt and cannot be optimized through interaction with the user. For this reason, a machine learning process is also integrated into the design, which can be adapted to new types of documents through interaction with the user. 2 Chapter 1 Introduction

11 The design is implemented using the Unstructured Information Management Architecture and the Weka machine learning framework. This implementation described in Chapter 6 is required in order to be able to carry out an evaluation of the techniques used. In this evaluation, which is described in Chapter 7, invoices from a test data set are indexed and the results are compared with a gold standard. 1.2 Structure 3

12 4 Chapter 1 Introduction

13 2 BASICS Documents from a corpus of documents full of invoices, which are part of this work, should be retrieved using information retrieval methods based on search queries. In order to accomplish this, these invoices must be indexed by information extraction and stored in a structured form, for example in a database. Certain documents can then be found again using search queries. Indexing can be rule-based or automatic classification. With rule-based algorithms, rules are created manually, for example using a text pattern, and compared with the document to be indexed. Automatic classification is implemented using learning algorithms. This means that a program learns from experience and can thus adapt its behavior to the observed environment. With regard to the task at hand, it would look like a classifier first analyzes a certain amount of documents and, on the basis of this analysis, can extract the required information from these and similar documents. Furthermore, this classifier can learn through an interaction with the environment, i.e. the user of the application, on the basis of further training examples. The book by Nils J. Nilson [Nil97] gives an introduction to the subject of machine learning. In this chapter, the terms information retrieval and information extraction are first defined and their differences are clarified. Then two methods are presented with which one can extract information from documents, on the one hand rule-based text extraction and on the other hand automatic classification. Two different classification methods are discussed below, the Naïve Bayes classifier and the Hidden Markov model. 2.1 INFORMATION RETRIEVAL Information Retrieval (IR) refers to the retrieval of required information from unstructured or semi-structured documents within large document collections, which are mostly available in digital form. [MRS09, p. 1] Before information retrieval, in this problem definition, the retrieval of invoices 5

14 can be applied by means of search queries, these invoices must be indexed using information extraction methods. IR is usually equated with a content-oriented search in documents. [Fuh04, p. 7] For this type of search one can differentiate between the following levels of abstraction: Syntax A search is made for certain features (successive symbols, structure in a graphic, ...). Semantics The meaning of an object is considered. Pragmatics Consider the use of an object for a specific purpose. Most users require a pragmatic way of working from IR systems, as the user is often looking for information for a specific purpose. When indexing invoices, one mainly follows the syntactic approach, since one searches for documents there using character strings, such as an invoice number, which is rarely found in a textual context. As a result, there is no need to use semantic analyzes in the information extraction process to find the invoices again. A pragmatic way of working is largely automatic, as clearly defined indices, such as the invoice date or the address data of the sender, are specified from the invoices, so that mostly relevant information can be found through search queries. One of the given indices, namely the keywords (distinctive words in a calculation), differs from the other indices. When extracting keywords from documents, it can be advantageous to also consider terms from the other documents in the corpus. In the following section, a method that is suitable for extracting keywords is presented. Vector space model In a classic retrieval model, the Boolean retrieval model, it is examined whether a term contained in a query occurs or does not occur in a document. This procedure finds documents in which the term you are looking for does or does not exist. However, there is no guarantee that the documents found are relevant. The vector space model is used to ensure the relevance of the result set of a query. Weighted terms are a prerequisite for this. The term frequency, which describes the relative frequency of a term in a document, is a way of giving terms a weight. In the vector space model, a vector is spanned over all weighted terms of a document and compared with a query vector using a similarity function that calculates the distance between the two vectors. In this way, the documents found have a certain relevance with regard to the search query, depending on the weighting of the terms of the search query in these documents. The term frequency is used in a weighting method, the term frequency - inverse document frequency (TF-IDF). As mentioned above, the term frequency indicates the relative frequency of a term in a document and the inverse document frequency measures the general meaning of the term for the entire document corpus. This method can also be used to determine the importance of the document found in relation to the entire document corpus. [Fer03, p. 61 ff.] The TF-IDF procedure can be used to extract key words from invoices or to find invoices again using key words, as it has proven itself in these areas of application. [Fer03, p. 71] 6 Chapter 2 Basics

15 2.2 INFORMATION EXTRACTION In order to be able to find documents again, they have to be indexed by means of information extraction (IE), as mentioned in section 2.1. An example of this would be cataloging books and journals in a library. The media are indexed using the title, authors, ISBN numbers, keywords and, in some cases, the content, using IE and stored in a relational database. You can proceed in a similar way with the invoices that are processed in this thesis. Features of these invoices, which are as unique as possible for an invoice, such as the invoice number or the invoice date, are extracted and saved as an attribute-value pair. Indexing A process that extracts certain data from unstructured and semi-structured documents and saves it in a structured form is called indexing. The typical process of indexing documents looks like this according to [Fuh04]: 1. Combine all documents in one document corpus. 2. Linguistic processing of the words (for example by stemming 1). 3. Extract certain data as indices from the documents by means of information extraction. Combining the documents into a document corpus means that the various types of documents are scanned and, if possible, saved with a unique identification number as the file name. Now the text and some metadata are extracted from the scanned documents. If documents are already available in text format, such as s, then the last-mentioned step is not necessary. In the second processing point, a spell check can be used on the text documents, among other things. This has the advantage that text recognition errors can be automatically recognized and eliminated. In connection with this work, the first point is already partially given. This means that the documents are already in a digitized form, but the text still has to be extracted with a text recognition program. 2.3 RULE-BASED INFORMATION EXTRACTION Rule-based information extraction uses techniques such as pattern matching and the construction of semantic rules. According to [Gri97], Figure 2.1 shows an approach to automatically extract information from natural language text using rules. Lexical analysis breaks down the entire document into sentences and tokens (mostly words). Then each token can be compared with a dictionary in order to determine whether this token is a word from the language of this dictionary. Next 1 Basic Form Reduction - Different morphological variants of a word are traced back to their common word stem 2.2 Information extraction 7

16 the Named Entity Recognition (NER) tries to identify entities such as names, places, dates or amounts of money. In the following step, pattern recognition, specific character strings are either recognized using manually defined regular expressions or methods are used that recognize relationships between entities. For example, from the phrase Token withdraws as token, the first token can be identified as a person and the second token as an item. The following core reference analysis tries to assign the same object to different words that appear in different sentences or parts of sentences. For example, if the word he appears in a section, an attempt is made to identify the person in the text who is meant by it. Finally, an attempt is made, by means of logical conclusions, to recognize events in which the partial information is distributed over several sentences. For simple, semi-structured use cases, rule-based text extraction is a worthwhile approach, because regular expressions can be formed relatively well due to the partially existing structure in the document. On the other hand, the rule-based approach is not very flexible. If the structure or even the language of the documents changes, the rules must also be changed. Regular expressions are therefore often adapted to a specific domain. In this work, information extraction from a certain type of document, namely from invoices, is required. This reduction in the corpus of documents and the restriction that invoices in companies are often only available in the respective national language and in English make it a special domain. The rule-based information extraction can therefore be applied to this task. 2.4 CLASSIFICATION The classification deals with the problem of automatically classifying objects into a predefined set of classes. It is a machine learning process and it is usually implemented using learning algorithms. [MRS09, Fer03] According to [Fer03] a classification is defined as follows: Definition 1 (classification) Let O be a set of objects. A decomposition of O = C 1 [C 2 [::: [C n into pairwise disjoint subsets or classes (C i \ C j =; 8i, j 2 f1,:::, ng, i 6 = j) is called classification or classification of the objects from O. Strictly hierarchical classification systems are formed in that a finer classification is formed from a coarser one by breaking down its classes further into disjoint subsets. This definition describes the division of objects into classes.When classifying documents, a distinction is made between text classification, the classification of documents in predefined classes, and text clustering, the independent grouping of documents. This thesis deals exclusively with text classification, since the assignment of terms from a document to given indices is a special form of text classification. In text classification, one typically works on a set of documents D = fd 1, d 2,:::, d n g and a set of classes C = fc 1, c 2,:::, c n g. Furthermore 8 Chapter 2 Basics

17 Document Local Text Analysis Lexical Analysis Named Entity Recognition Pattern Recognition Discourse Analysis Core Reference Analysis Conclusions Pattern Generation Extracted Patterns Figure 2.1: A rule-based IE architecture for unstructured text 2.4 Classification 9

18 there is a feature vector X d = (x 1 x 2::: x n) T, which describes the various features of the respective document d. The manual assignment of features and classes to the documents of the training set creates a classification function that maps the feature vector to the classes:: X! C (2.1) A document that is not included in the training set can now be assigned to a class using this function. The classification of documents takes place in two steps. The first step consists of training the classifier, in which it builds up a concept description using a training data set. This process is called supervised learning because the classifier is told which class an example belongs to. In the second step, examples that have not yet been classified are assigned to a class using the concept description of the classifier. Multi-class versus binary classification There are two ways to define the number of classes. One possibility is to define a separate classifier for each class. This classifier then knows exactly two classes, one class is the object you are looking for and the second class is all other objects. This classification method is called binary classification and is shown in Figure 2.2 (a). The other possibility is to define just one classifier that contains all possible classes. This classification method is called multi-class classification, as shown in Figure 2.2 (b). The binary classification is an often used and easily understandable classifier, but with this method only the so-called one-versus-all approach is possible, in which the attributes of the object sought are compared with the attributes of all other objects. With the multi-class classification, the all-versus-all approach is also possible, in which the attributes of pairwise disjoint objects are always compared with one another. [HPRZ03] With the indexing described in section 2.2, it is obvious to use the binary classification. For each index, the two classes are divided into the membership and non-membership of a term in this index. This has the advantage that independent attributes can be defined for each index. (a) Binary classification of objects ok from the object set O (b) Multi-class classification of objects ok from the object set O c 1 c 2 c 1 c 5 c 3 c 2 c 6 c 4 c 7 Figure 2.2: Possible division of the classes 10 Chapter 2 Basics

19 The next two sections describe two concrete learning methods for classification. On the one hand, the Naïve Bayes, a simple classifier with a naive approach, but which generally delivers good results. On the other hand, there is the Hidden Markov Model, a stochastic model which, in contrast to Naïve Bayes, also takes into account the structure and content of documents. Further learning methods are the support vector machine or the k-nearest neighbor algorithm Naïve Bayes The Naïve Bayes classifier is one of the Bayesian learning methods, which is based on the Bayes theorem. Bayes' theorem for two random events A and B reads as follows: P (AjB) = P (BjA) P (A) P (B) (2.2) The conditional probabilities P (A) and P (B) are a priori Probabilities. Whereas P (AjB) and P (BjA) are a-posteriori probabilities. In order to classify an example document d i with the feature vector x i, the a posteriori probability for a class c i is interesting. If you insert ci and xi as events in (2.2), you get the following calculation theorem: P (ci jx i) = P (x ijc i) P (ci) P (xi) (2.3) Da is the best class for the feature vector xi of the example document di is searched for, i.e. the most likely or maximum posteriori (MAP) class c map, one obtains the following simplified assumption [MRS09]: c map = arg max P (ci jx) = arg max P (X jc i) P (ci) cici = arg max P (x 1, x 2,:::, xn jc i) P (ci) ci = arg max ci P (ci) Y j P (xj jc i) (2.4) The assumption shown above means that all features are conditionally independent of each other and is therefore somewhat naive. In natural language text, however, this assumption is wrong, since in reality there are dependencies between words at different levels. In natural language texts there are syntactic, semantic and thematic dependencies between the individual words. Nevertheless, this classifier works quite well, since it only outputs the maximum a-posteriori probability of a class, but not the probabilities themselves. It is 2.4 classification 11

20 namely, it is not the correctness of the a-posteriori probabilities that is decisive for the quality of the Naïve Bayes classifier, but rather its low variance. [FGG97] When developing anti-spam filters, different variants of the Naïve Bayes classifier are very popular. They are used in open source projects as well as in commercial products, although there are some better classifiers in the area of ​​text classification such as boosting or support vector machines. The reasons for choosing this classifier lie in the simplicity of its algorithm, so that it is easy to implement, in its linear complexity and in an accuracy that can compete with sophisticated classifiers when it comes to filtering spam. [MAP06] Since the calculations for this problem consist of little natural language text, or the indices to be extracted are usually not found in these text sections, the Naïve Bayes classifier is suitable for indexing this document class Hidden Markov Model The Hidden Markov Model (HMM ) describes a two-step stochastic process. The first stage is a finite automaton with states from the state set S and edges between any states that are labeled with transition probabilities from the set A. In the second stage, an output is then additionally generated from the set of the emission probability distribution B for each state. [Fin03] Definition 2 [Rie09] gives a formal description of an HMM. Figure 2.3 illustrates this definition using a simple example of an HMM. Definition 2 (Hidden Markov Model) A (discrete) Hidden Markov Model = (S, A, B ,, V) is a stochastic model which is described by a Markov process (Markov chain) and an emission process. Here S = fs 1,:::, s n g - the set of hidden states, A = fa ij g - the state transition probabilities with s i a ij! s j, B = fb 1,:::, b n g - the set of the emission probability distribution, - the initial probability distribution, V - the feature space or the output alphabet. The first-order Markov chain with the set S of states and the discrete time steps t = f0, 1, 2,::: g describes a stochastic process X through a sequence of random variables X 1, X 2,:::, X n with P (X t + 1 = s jt + 1 jx t = s jt, X t 1 = s jt 1,:::, X 0 = s j0) = P (X t + 1 = s jt + 1 jx t = s jt) The main area of ​​application of the HMM is pattern recognition, which is why it is mentioned in this paper, since indexing documents also involves recognizing semantic and syntactic relationships in the text of these documents. Especially in Named Entity Recognition (NER), HMMs have proven to be a useful extraction method (see Section 4.2.2). 12 Chapter 2 Basics

21 π 1 a 11 a 22 a 33 a 12 a 23 S 1 S 2 S 3 hidden visible a 13 b 1 b 2 b SUMMARY Figure 2.3: Example of an HMM In this chapter, some methods for the information retrieval process were discussed. In order to carry out the procedure for indexing documents presented here, two basic methods were presented. On the one hand the manual creation of rules which are then used for automatic text extraction and on the other hand the automatic classification. The classification is a machine learning process that creates a classification function from a training data set in order to classify further documents. However, the classification can also be used to classify terms or phrases of a document instead of the documents themselves. In order to carry out the classification on semi-structured documents that contain little natural language text, the Naïve Bayes classifier presented in the section is best suited. Since invoices are exactly such documents, this classification method is used in the draft. 2.5 Summary 13

22 14 Chapter 2 Basics

23 3 ANALYSIS The source documents are incoming invoices, which are mostly available as images in PDF format. The text of these invoices is extracted by Optical Character Recognition (OCR) software and is then available as unstructured text. The following sections first describe the properties of the document corpus and then analyze the invoice data to be extracted. In the last section, a gold standard is determined based on the text recognition errors of the OCR software. This gold standard is required in order to be able to assess the practical suitability of the design. 3.1 THE DOCUMENT CORPUS The document corpus, which was provided by DocuWare AG, contains 41 documents, which are exclusively invoices. These invoices are available as scanned images in PDF or TIF format and as text files generated by a text recognition program from DocuWare AG. 38 of the invoices come from Germany, 2 from the USA and 1 from China. The situation is similar for the languages ​​of the documents, because 37 of the invoices are in German and 4 in English. There are 26 different invoice issuers among the 41 invoices; there are 9 invoices for which the sender is DocuWare AG. 3.2 INVOICE DATA In order to find certain invoices from a large document collection of scanned invoices using information retrieval (IR) methods, suitable indices must be found for the user to search for. These indices are specified by the DocuWare Document Management System (DMS). Since 15

24 If the documents are invoices from different billers, they were created according to different format templates and thus have different structural features. Nevertheless, there are general properties and patterns according to which the invoice data can be identified. In the following sections these properties and patterns of the invoice data, which are to be extracted in this thesis, are presented. Invoice date The invoice date indicates the date on which the invoice was issued. Depending on the region, a date in a public document has different formats. For example, the typical date format in German-speaking countries is day.month.year and in the United States of America it is month / day / year. An invoice can contain other data, for example a date when the invoice has to be paid or when certain goods were delivered. The position at which the date is in the invoice can be an indication of whether it is an invoice date or not. When looking through the training documents, it was found that the invoice date is usually the first occurrence of a date in an invoice, as it is typically in the invoice header. Invoice number There is no format template for the invoice number that would enable it to be distinguished from other alphanumeric number combinations. The invoice number can even contain special characters such as a hyphen or a slash. Paragraph 14 (4) of the Sales Tax Act only states that the invoice number: [...] is a consecutive number with one or more series of numbers that is assigned once by the invoice issuer to identify the invoice. [...] 1 So, at least in the German legal area, the only definition of the invoice number is that it must contain a series of numbers. Any number of letters and special characters can also appear in the invoice number. When looking through the training documents, it is noticeable that before the invoice number there is always a word that begins with the character string invoice. This connection can help to extract the invoice number. Sender The sender data is broken down into the full name and full address of the entrepreneur or company providing the service. This address is subdivided into the street including house number or a PO box and the place including the postcode, with the postcode always in front of the place. Within an invoice, the name and address of the biller are always in the letterhead, either directly above the recipient data or in a separate area at the top right. Thus, the data of this index can be found in the upper part of the calculation. 1 quote from: 14.html, last access: Chapter 3 Analysis

25 Invoice amount There are a finite number of formatting options for amounts of money. Since almost all common currencies are based on the decimal system, they are either full amounts of money, i.e. whole numbers, or they contain a decimal separator with exactly two digits or a hyphen after this separator. This decimal separator is either a comma or a period. Thousands separators can then appear for larger amounts. This separator can consist of a comma or a point, as well as a space or an apostrophe. Several amounts of money can appear in an invoice, for example because individual items are listed. If you select the largest of all the extracted amounts, there is a high probability that you will have selected the total invoice amount. When looking through the training documents, it is noticeable that in front of the invoice amount there is usually a word that contains the character string amount. Usually, a currency code is placed after a currency amount in documents. If this occurrence can also be observed in invoices, this currency code can be used to identify the invoice amount. When looking through the invoices, however, you notice that the currency abbreviation is usually in the header of a table and therefore cannot be directly assigned to the invoice amount. Sales tax ID The sales tax identification number (UID) is a unique EU-wide identification of a person liable for sales tax. This means that the sales tax identification number (UID) is generated in each country of the European Union according to a certain pattern and assigned to a company. This makes it easy to identify a UID in an invoice. Keywords Keywords are used to prepare documents for a general search. So there are words that are characteristic of individual calculations. In the present invoices, these keywords are mostly specific invoice items or descriptions of services that were provided by the invoice issuer. 3.3 REQUIREMENTS To automate the indexing, the accuracy and completeness of the results must be 100%, since additional interaction with the user is necessary if the indexes are missing or incorrectly extracted. However, since the text files of the invoices are faulty due to the processing of the OCR software, the requirement for completeness in this work is reduced. The gold standard of the completeness of the developed design cannot be better than there are indices recognized without errors by the text recognition. Table 3.1 shows the result of a manual analysis of the text recognition errors. The values ​​in this analysis are divided into two tables, because for the automatic classification not the entire address is extracted, but individual elements of the address, namely the name of the biller, the street of his address and the city including the postcode. The third column with the name Flawless indicates 3.3 Requirements 17

26 Table 3.1: Error rate in text recognition Rule-based Index Error rate error-free Address 0.27 0.73 Number 0.17 0.83 Amount 0.07 0.93 Date 0.07 0.85 UID 0.10 0.93 Total 0, 14 0.86 Classification Index Error Rate Error Free Company Name 0.10 0.90 Street 0.10 0.90 Post Code 0.17 0.83 Number 0.17 0.83 Amount 0.07 0.93 Date 0.07 0 , 85 UID 0.10 0.93 Total 0.11 0.89 How much of the respective index was correctly recognized by the text recognition program. This value can be assumed as the gold standard for completeness, since it indicates the maximum amount of recognized indices that an automatic indexing can determine without additional OCR error detection. However, the gold standard for the accuracy of an automatic indexing remains at 100% because all error-free indexes should be detected. 3.4 SUMMARY Although rule-based indexing will provide reasonably good accuracy, the accuracy and completeness of a larger collection of documents will be relatively small.The regular expressions are developed on a relatively small amount of calculations. Since this training set is quite homogeneous, many German-language bills and few different billers cannot be assumed that similar results can be achieved on a larger, heterogeneous collection of documents. The biggest problem lies in regional dependencies. Some rules will use certain words to determine an index, for example the word invoice to identify the invoice number that follows. So you would have to design and implement a new regular expression for every new calculation of a different language. The rule-based approach therefore only covers a certain domain. An automatic classification can help here. 18 Chapter 3 Analysis

27 4 RELATED WORK This chapter describes possible approaches to information extraction (IE) from differently structured content. Work on IE from documents with structured content [AGMU03] as well as with semi-structured and unstructured content [KJM01] is considered. In the case of IE consisting of semi-structured and unstructured content, an automatic classification using the Naïve Bayes algorithm is applied. These techniques can be used in this elaboration, since the invoices themselves have a semi-structured structure, but do not follow any formatting templates, since invoices from different billers have a different structure. In contrast to this, however, some generally applicable properties and patterns of the individual invoice data, as described in Section 3.2, can be recognized. Parts of the invoices can also consist of unstructured text, such as a payment claim in the form of a natural language text. In this context, papers dealing with Named Entity Recognition (NER) [Rie09, Fae07, BWPH06] are also presented, as there are also invoice data that consist of proper names, such as the sender data. The Hidden Markov Model (HMM) is a good algorithm to implement the NER. A paper [BSW99] is also presented on this algorithm. 4.1 STRUCTURED CONTENT Due to the increasing amount of information on the Internet in recent years, the need for targeted information procurement from this medium is also increasing. Many websites are generated according to user needs. They therefore contain a very specific format template with changing content. It is therefore a question of structured content. The elaboration by Arasu and Garcia-Molina [AGMU03] deals with this topic, information extraction from structured data from websites. In this thesis, the problem is discussed of automatically extracting information with an unsupervised learning method, which comes from dynamically generated web pages and originally comes from databases. The basic idea in dealing with this problem is that dynamically generated websites are structured according to a certain scheme. It is now assumed that a value x, 19

28, which comes from a database, is inserted into the website by means of a pattern T (template). This coding of x into the pattern T is described with the function (T, x) and is illustrated in Figure 4.1. Now the idea is to determine the pattern T of a dynamically created website. With the help of T one can then extract the value x. Figure 4.1: Creation of dynamic websites, source: [AGMU03] If company invoices were available in a structured manner, which would be the case if they were created according to a given known pattern, this approach could be used. However, since the invoices that are available for this work are not generated according to a uniform pattern, other methods have to be found to extract the required invoice data. Either you can still find a certain structure in the existing invoices, which can lead to problems if other invoices later appear that do not correspond to this pattern, or you use other methods, such as the IE from semi-structured and unstructured documents, to process the extract the required data from the documents. 4.2 SEMI- AND UNSTRUCTURED CONTENT In the case of semi-structured texts, the technique of automatic classification for IE is ideal. The paper by Kushmerick, Johnston and Mcguinness [KJM01] uses two examples to describe how to extract relevant information with the help of text classification. In the first example, a contact database is to be expanded using business cards. In the second example, s are to be discovered in which an address change is reported and the new address is also extracted. Both of these examples work with both semi-structured and unstructured texts. In the first example, the Naïve Bayes algorithm is used to assign categories to the individual lines of the business card. This simple use of the Naïve Bayes algorithm brings an accuracy of 71% with 505 data sets, which were cross-validated for half of the training and test data. A modified version of the Hidden Markov Model has increased this to 74%. 20 Chapter 4 Related Work

29 In the second example, the Naïve Bayes and the Probabilistic Indexing Algorithm [FB91] were tested again and it was concluded that the last-mentioned algorithm delivers better results. The task of identifying and processing change of addresses (CoA) s is divided into two sub-tasks. First, s are divided into CoA and non-Coa messages, so the messages themselves have to be classified first. In the second step, the new address should be extracted from the CoA messages. When classifying messages as well as extracting the address, two further methods are added. On the one hand, word phrases consisting of two or three words are formed, such as new address, and on the other hand, words before and after a found address are also considered. This window of seven words before the address and three words after the address is intended to help extract the new address. With an overall accuracy of 96%, the implementation of this example is very reliable. Named Entity Recognition Named Entity Recognition (NER) describes a method for recognizing named entities in unstructured text. Named entities are understood to be relevant atomic information that is assigned to defined categories and concepts [Rie09]. In the work of Rieger [Rie09], he deals with the development of a rule-based approach to the NER. Knowledge from databases is used so that these rules can be created automatically. This knowledge includes both the database schema and the associated instance data. Previous approaches to automatic rule creation were mostly limited to certain domains and therefore cannot be easily transferred to other domains. Rieger now developed a complete automation of the rule creation based on structured data in his work. His algorithm achieved very good results in closed domains with few entities. The only problematic are larger domains with many entities, since different entities can be similar and thus their patterns collide. In the paper by Fäßler [Fae07] a NER system called JULIE Named Entity Tagger (JNET) is presented. This NER system, originally conceived exclusively for use on biomedical texts, works with the machine learning library MALLET [McC] and a configurable feature set. This makes it possible to use JNET on other domains as well. JNET is integrated into the Unstructured Information Management Architecture (UIMA) and thus uses the flexibility that this framework offers for extracting information from unstructured data. During the evaluation on various document bases, F-dimensions of 0.819 to 0.845 were determined. The use of JNET would be a possibility to solve part of the task of this elaboration, for example the extraction of the address of the sender. Buyko, Wermter, Poprat and Hahn [BWPH06] also deal in their work with the NER on a biomedical domain. You modify the open source framework OpenNLP Tools 1, which provides Java classes for Natural Language Processing (NLP). OpenNLP Tools includes five main NLP components, a sentence recognition, a tokenizer, a POS tagger 2, chunking 3 and a parser. All these components fall back on the same machine learning approach, the maximum entropy method Point-of-Speech Tagger, the assignment of words in a text to parts of speech 3 Chunking, analysis and breakdown of sentences into blocks 4.2 Semi- and unstructured content 21

30 4.2.2 Hidden Markov Model Bikel, Schwarz and Weischedel [BSW99] present an algorithm called IdentiFinder TM, which extracts names, dates, times and numerical entities with the help of HMM. This is achieved by first setting up word classes based on sample information (see Table 4.1), which are then used to classify running text word by word. The IdentiFinder TM focuses on natural language text. However, since invoices contain very little natural language text, this algorithm will not be able to deliver optimal results in the work presented here. However, an HMM adapted to the problems of this work could give satisfactory results. Table 4.1: Word classes, examples and meaning according to [BSW99] WORD FEATURE EXAMPLE TEXT INTUITION twodigitnum 90 Two-digit year fourdigitnum 1990 Four-digit year containsdigitandalpha A Product-code containsdigitanddash Date containsdigitandslash 11/9/89 Date containsdigitandcomma 1.00dig Monetary amount contains 1.00dig Monetary amount Monetary amount, percentage othernum Other number allcaps BBN Organization capperiod M. Person name initial first Word first word of sent. No useful capitalization information initcap Sally Capitalized word lowercase can Uncapitalized word other, Punctation marks, all other words 4.3 SUMMARY In summary, with the methods for information extraction presented here, an automatic classification using the Naïve Bayes classifier delivers satisfactory results. However, better results could be achieved through a hidden Markov model adapted to a specific application or through a probabilistic indexing algorithm in the case of semi-structured and unstructured texts. In the area of ​​Named Entity Recognition, an extraction of entities on unstructured texts with the help of structural data can deliver satisfactory results, but these get worse as soon as heterogeneous databases with many entities are involved. However, the problem presented here must be based on the last-mentioned prerequisites. 22 Chapter 4 Related Work