A humble introduction to Machine Learning, Information Extraction, and Bootstrapping Method

Nghia Bui, Saigon, Christmas 2016

Machine Learning

From the beginning days of computers, people have wondered whether computers can be programmed to learn. If we can make them to learn, the effect would be so amazing. For example, ML is frequently used in cancer diagnosis and detection. Cruz and Wishart [CW06] showed that ML methods can be used to substantially (15-25%) improve the accuracy of predicting cancer susceptibility, recurrence and mortality. Indeed, recently [Ng16], it was reported that IBM’s Watson gave proper diagnosis for Japanese leukemia patient after doctors were stumped for months. The supercomputer, sifted through 20 million cancer research papers, was able to find out the proper diagnosis within 10 minutes, and also suggested a new treatment that has since been more effective.

With an undoubted impression about applications of ML, let our discussion continue with the formal definitions and basic concepts of this subfield of computer science.

Definitions and Basic Concepts

One common definition of machine learning came from the computer gaming pioneer Arthur Samuel (1901-1990), which is cited in [Sim13, p.89] as “machine learning is a field of study that gives computers the ability to learn without being explicitly programmed”. This definition, however, does not define what is “learn” and may not offer a clear understanding of “without being explicitly programmed”. There is a more concrete and formal definition provided by Tom Mitchell [Mit97, p.2]:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Since there is no restriction on the triple <E, T, P>, the definition is so broad that it covers a lot of instances of machine learning, including the most difficult ones and, surprisingly, also the most trivial ones. For example, consider the learning of a face recognition system:

  • Experience E: a set of facial pictures, each one is labeled with a certain person name.
  • Task T: given a set of unlabeled facial picture, recognize the person name for each.
  • Performance measure P: the more times of correct recognition, the better.

This is clearly a very difficult learning. Indeed, one who doesn’t have any knowledge about ML would have no idea about how to make computers recognize the face in any given picture through a set of labeled pictures. Intuitively, we can say that the learning of a face recognition system is very “machine learning”. However, in another example, as provided by Mitchell [Mit97, p.5], where a database system that allows users to update data entries is considered:

  • Experience E: database updates.
  • Task T: answer database queries.
  • Performance measure P: correctness of the answers.

The learning is quite straightforward and doesn’t seem to be “machine learning” at all, but it still fits the definition. This demonstrates the fact that machine learning should not be understood as the usual informal conversational meaning of the word “learning”. Rather, it is about the improvement of computers through experience.

It’s the introducing of experience (alternatively called training data or training dataset) that helps the definition from Mitchell to explain well what is “learn” (the use of training data to improve performance when doing tasks) and what is meant by “without being explicitly programmed” (training data are not explicit program instructions).

The definition also allows us to derive a useful definition of machine learning problems and machine learning methods:

A machine learning problem is a problem that needs to be solved by human, it is about designing a machine learning method which involves choosing a set of training data E, designing an algorithm to infer some knowledge K, and determining the way computers use K to improve their performance at a given class of tasks T, measured by a given performance measure P.

What new in this definition is the introducing of knowledge K. This is by no means redundant. In fact, most of ML methods are mainly focused on the inferring of K from E. The other methods may not really involve the knowledge inferring, but actually we can consider the knowledge inferred in those methods is trivial, i.e., it merely memorizes the training data.

With the presence of the concept of knowledge K, the definition directly points out some essential steps involved when designing any ML method. Notice that the first two steps—as discussed below—are not necessarily executed in a rigid order. Normally we execute them back and forth, because the training data E and the type of knowledge K are very tight-coupled, the choosing of E affects heavily on the choosing of K, and vice-versa, as we will see.

Choosing a set of training data E. Obviously, we should always prefer choosing E so that the inferring of K is easy as much as possible. But in reality, there are no many choices, we are usually provided one or several fixed types of training data. In many common cases we also have to choose a significant volume of training data so that K is comprehensive enough to perform well with unseen data. Unseen data (also called subsequent data) are the data that computers have to handle after learning, i.e., when doing the class of tasks T. Some types of training data have the concept of noise, in this case, the amount of noisy data chosen should be minimized. Fortunately, there are excellent learning methods that still work quite well with noisy training data. One of such methods, for example, is Artificial Neural Networks [Mit97, p.81].

Identifying the type of knowledge. We have to identify exactly what type of knowledge (sometimes called target function, rule or pattern, when appropriated) that needs to be inferred from the training data E. But unlike E, with K we can freely have many options. Typically, K is chosen to be relatively close to the class of tasks T and the performance measure P. In other words, performance P of doing T should be straightforwardly improved by using K. This also means that K is often not close to E. In fact, the inferring from E to K is usually the most central and difficult issue in ML. The next two steps are about this issue.

Choosing the representation of K. In order to infer the knowledge K, at first, we must have its representation—a rudimentary shape. For example, if the training dataset E is a set of L vectors X[1], X[2] ... X[L] in which each vector has N dimensions and is labeled by a real value Y[i], and if K is a knowledge telling a real value from any given N-dimension vector, then K could be represented as a linear function: f(v[1], v[2] ... v[N]) = w[0] + w[1]*v[1] + w[2]*v[2] + ... + w[N]*v[N]

Designing an algorithm to infer K. The representation of K is just its primitive shape, hence can leave unknown parameters. Now our job is searching the values of those parameters in order to make the knowledge K explains well the training data E, in other words, to make the knowledge K best fits the training data E. Perfectly fitting is usually impossible thus the best thing we can do is to approximate K. Back to the example above, w[i] (i=0..N) are the unknown parameters, the function f will best fit E when: f(X[k]) ≈ Y[k] ∀ k=1..L. Each set of specific values for the parameters (w[0], w[1] ... w[N]) is called hypothesis. The set of all possible hypotheses is called hypothesis space. As we can see in this example (not an uncommon example) the hypothesis space is extremely large since the number of hypotheses is infinite.

Determining the use of K to improve performance P at T. Most of ML methods do not concern this step, simply because K was already chosen as close to T and P, so the use of K should be straightforward. Interestingly, however, there is an exception. In a special class of ML methods so-called instance-based learning, K and E are the same, K merely memorizes E, nothing is actually inferred. When the subsequent dataset comes, its relationship to the whole training data is examined in order solve the tasks T. The two most popular instance-based learning methods are nearest neighbor and locally weighted regression. These methods assume that data can be represented as points in Euclidean space, therefore points located in the same small region will tend to have closed relationship. Instance-based learning and the two common methods are best explained in [Mit97, p.230].

Another important concept of ML is inductive bias. Given a learning method inferring a knowledge K from E, typically it is very difficult—even impossible—to reason whether the performance at tasks in T, measured by P, is improved or not. A clear example is, when the knowledge K best fits the training data E, it is not assured that K will also fit the subsequent data, since those data can be arbitrary. Therefore, we need inductive bias.

Inductive bias of a machine learning method <E, K, T, P> is a set of additional assumptions sufficient to justify the improvement at class of tasks T, measured by P, when using K.

Obviously, understanding the inductive bias involved when designing a ML method is important since it can help us to judge the performance of our method. The ML method in the example where we assume that the knowledge K is a linear function (not a nonlinear function) is clearly biased. This type of bias is called restriction bias since we restrict the hypothesis space from the full one (including both linear and nonlinear functions) to a smaller one (including only linear function) and after that the searching process starts. There is another type of bias called preference bias in which we search the whole of hypothesis space, but when searching we apply some heuristics that tend to prefer certain hypotheses over the other ones. Decision tree learning [Mit97, p.52]—a famous learning method—is an example of this type of inductive bias. More on types and examples of inductive bias can be found in [Ham14].

Overfitting is also an important concept mentioned in most of ML literature. Generally, we say that a hypothesis overfits the training data if there exists another hypothesis that fits the training data less well, but actually performs better over the subsequent data. A typical phenomenon of overfitting is when a knowledge begins to memorize (100% fitting) the training data rather than being inferred/generalized from those data. Clearly, an overfitting knowledge would have a very poor performance because it doesn’t work or works inefficiently with subsequent data. Therefore, understanding overfitting and how to prevent it is a critical mission in many ML methods. Thorough presentation of overfitting and solutions for avoiding it are discussed in [Mit97, p.67].

Common Types of Problems

Categorizing ML problems into common types is useful, because:

  • It supports human communication: given a common name of the type of the problem under discussion, it would be easier for people to convey their messages.
  • It suggests possible methods: for each common type, various methods have been dedicatedly developed for it, therefore, knowing the type of a problem would give us a valuable clue to solve it.

Due to the plenty of <E, K, T, P> and relationships between them, there are a lot of ways to categorize ML problems. However, most of literature on the field mentions the categorization based on the training data E and the class of tasks T. This is reasonable because a ML problem usually begins with the considering of E and T.

While the categorization based on training data divides ML problems into a few and limited types such as supervised learning, unsupervised learning, semi-supervised learning, and weakly-supervised learning; the categorization based on tasks is more abundant, but the most common types of tasks are classification, regression, clustering, and association.

Interestingly, the two categorizations are not totally independent. For example, if a problem is supervised learning, then it is usually also classification or regression, but not clustering. We will understand more about this connection in the following sections where these common types of problems are elaborated.

Supervised Learning

This is the most popular type of problems. In supervised learning, the training dataset E is a set of examples where each one is a pair of an input and an output, the task T is implied by the examples: predict the output of any given input including both seen and unseen ones.

Generally, there is no any restriction on the structures and values of input as well as output. However, it is often that the values of all output are either discrete or continuous; all input are structured as a uniform structure which is a vector of features (or attributes), for every input the value of each feature is also either discrete or continuous, which means, the training data in a common supervised learning can be represented in the table form, as illustrated in the table below.

A training dataset in supervised learning

This training dataset provides examples telling how the conditions of Sky, AirTemp, and Wind affect the decision of EnjoySport. The header row, excluding the last cell, describes the names of three features: Sky, AirTemp, and Wind; in the last cell, EnjoySport is the name of the output. Each row of the rest represents an example where the first three cells contain the values of the three features of the input, respectively; the last cell holds the output value; all of these values are discrete: Sunny, Rainy, Warm, Yes, No … are discrete values. Task T in this case, of course, is to classify any given input which is a triplet, e.g., <Sunny, Cold, Strong>, into an output value of either Yes or No.

When the output values are discrete like this (Yes/No) [1], they are alternatively called class labels, and the problem is called classification. In case the values are continuous, e.g., real values ranging from 0.0 to 1.0 that represent the probability of the decision of enjoying sport, we say that the problem is regression.

This type of ML problems is named as “supervised” because there is a “supervisor” who supervises the “learner”—the computer program in construction—by providing clear examples in which the input in every example is labeled with an output. The next type of problems that we will discuss, where there is no such a supervisor, is unsupervised learning.

Unsupervised Learning

The examples contained in training data of unsupervised learning are not labeled at all and thus have no concepts of input and output. Because of this reason, the task T cannot be classification nor regression anymore. Rather, it is about finding the regularities in the training data; in other words, inferring a function to describe hidden structure from unlabeled data.

One such particular common task is clustering. The objective is grouping “similar” examples in the training data into groups. This requires measuring the distances between examples, hence clustering problem is usually solved by applying the methods of instance-based learning [Mit97, p.230].

Association is also a popular task in unsupervised learning. Its goal is finding the association rules hidden in the training data by discovering patterns across the examples. One such rule, to name but a few, is: when the Sky is Rainy then it is very likely that the AirTemp is Cold, in this case we say that there is an association rule between the two features Sky and AirTemp.

Nevertheless, unsupervised learning is an advanced topic and is not the focus of this article. More on the topic can be found in [HS99] and [BvLR04, p.72].

Semi-supervised Learning

Labeling data in practice requires a lot of human effort since most of—if not all—supervised learning methods only work well with a significant amount of labeled data. This raises the need of semi-supervised learning which typically uses a small amount of labeled data and either (1) requires human in-the-loop to give labels for additional examples proposed by the learner, or (2) requires a large amount of unlabeled data.

The requirement of human involvement in-the-loop is the characteristic of active learning. This type of learning has an advantage that we as human don’t have to figure out and prepare a plentiful amount of labeled training data. Instead, it is the job of the learner to propose the extra data that really needs to be labeled. Note that those extra data should be relatively small otherwise it will break the advantage of active learning. Indeed, the identifying of those extra data is the most important concern in every research about active learning. Such a research is presented in [Ton01].

The later approach in semi-supervised learning, which makes use of a large amount of unlabeled data, is more attractive due to some reasons. Firstly, in practice, acquiring unlabeled data is relatively inexpensive. Secondly, unlike active learning which essentially still uses only labeled training data and thus is only best suited for the tasks of supervised learning (e.g., classification and regression), a learning method where its training dataset is partly labeled and partly unlabeled can solve the tasks of both supervised and unsupervised learning.

Regarding the tasks involved in unsupervised learning (e.g., clustering and association), it is clear that unlabeled data, when used in conjunction with a small amount of suitable labeled data, can produce considerable improvement in learning accuracy. For example, in clustering, even a little volume of labeled data can still offer some valuable hints on grouping similar examples, without the need of trying to measure similarities/distances. [CSZ06] presented a deep research on this subject.

How about the use of both labeled and unlabeled data in the tasks of supervised learning? This type of problems is so particular that it is dedicatedly categorized into a separated type so-called weakly-supervised learning.

Weakly-supervised Learning

Given just a little amount of labeled data and a large amount of unlabeled data, how can we solve the tasks of classification/regression? The state-of-the-art method addressing this situation is bootstrapping method.

In a nutshell, a bootstrapping method is about gradually growing the labeled data by labeling the unlabeled examples, picking the best candidates, and appending them to the labeled set. This process is iterated until the accuracy of the learner converges a predefined threshold. We have to accept the fact that the accuracy in bootstrapping method (and generally in weakly-supervised learning) cannot be comparable to supervised learning where the labeled data are plentiful. Therefore, bootstrapping method typically requires human in-the-end to remove inaccuracy results (wrongly labeled).

There are two typical instances of bootstrapping method in which the step of labeling the unlabeled examples differs. They are expanding method and self-learning method.

Expanding method is inspired from instance-based learning. Each example in the unlabeled set is labeled as the label of the “closest” example in the labeled set. Just like instance-based learning, in expanding method, no knowledge is inferred, hence the task here is merely labeling the unlabeled data [2].

Self-learning method, on the other hand, tries to infer a knowledge [3] from the labeled set just as a typical supervised learning method, and then applies that knowledge to the unlabeled set in order to assign labels to the unlabeled examples. In this manner, the inferred knowledge would be continuously refined after every iteration. Though the (new) inferred knowledge may assign different labels to the previously-assigned examples, whether these examples are re-labeled is not mandatory.

Bootstrapping method and also the two instances still leave a lot of rooms for research, especially in the context of specific domains. For example, what is the appropriate measure of similarity in expanding method? How is the knowledge inferred in self-learning method? How are the best candidates selected so we don’t add noises to the labeled set? How can we know whether the accuracy of the learner reaches a predefined threshold?

We will continue to review the bootstrapping method, but in the specific context of Information Extraction, which is the topic of the next section.

Information Extraction

How can we make computers extract useful messages or facts conveyed in free text? If we could answer this question, the impact would be incredible. Think about Google Search, this system definitely must be able to acquire what the public text on the Internet mention about, in order to index those text and give correct responses to search queries. Another example, consider the automatically recognizing of terrorism events from secretly obtained documents, where a government organization is interested in identifying the main actors of the event, the location and time that the event will happen.

The two examples above not only give us a first glance of what is Information Extraction (IE)—the study of finding facts from free text—but also demonstrate the motivations and challenges of this exciting research subject. Google Search nowadays is clearly an essential part of Internet, one cannot imagine how life would be without the best search engine of the world. But as an effective search engine [4], Google Search has been facing the rapid proliferation of text available in a myriad of repositories on the Internet. Also, detecting terrorism events is a critical mission, given the fact that these days many governments are trying to combat the escalating of terrorism. However, even in a limited domain like this, IE is still a non-trivial task due to the complexity and ambiguity of natural languages. There are many ways to express the same fact, which can be distributed across multiple sentences and documents, as discussed in [HYG02].

This resulted in a growing need for effective and efficient techniques for analyzing free text, and led to the emergence of IE technologies.


We were not managed to find any formal definition of IE. Nonetheless, most of literature in this area, e.g., [PSPY13, p.23], elegantly state that “information extraction is about deriving structured factual information from unstructured text”. This definition forces us to understand what is “unstructured text” and what is “structured factual information”.

Unstructured text typically refers to free text in the form of natural languages (e.g., English). Of course, these text still have their own structures which are constrained by the rules and grammar of natural languages. But those structures are not strict nor uniform enough to be directly used by standard data processing such as database population, data mining, or even the Machine Learning methods described in previous sections. The high degree of freedom in natural languages also sometimes makes human difficult to grasp the facts hidden in unstructured text. In contrast, structured factual information are facts in a clear, strict, and predefined form, e.g., database record, and thus can be easily used by both human and computers for searching and discovering knowledge in the free text where those information are derived.

Notice that IE is not merely about text converting. Although it does convert text from unstructured form into structured form, that’s not the main point. The main point here lies in the term “extraction” which also means the process of identifying relevant things and ignoring irrelevant things.

Basic Tasks

Depending on the types of the structured factual information that need to be extracted, we will have different IE tasks. Typically, we are interested in three types of facts: entities, relationships, and events. This leads to four basic tasks in IE described as follows:

  • Named Entity Recognition (NER) is about the identifying and classifying instances of predefined types of named entities, such as organizations (e.g., “World Trade Organization”), persons (e.g., “Alan Kay”), place names (e.g., “New York”), temporal expressions (e.g., “19 September 2016”), numerical and currency expressions (e.g., “100 million dollars’), and so on. This task can also include extracting additional information for the detected entities following a template. For example, in the case of persons, it may include extracting the title, position, nationality, gender, and other attributes of the person.
  • Co-reference Resolution (CO) addresses the problem of identification of multiple (co-referring) mentions of the same entity. Entity mentions can be (1) named (e.g., “World Trade Organization” and “WTO” may refer to the same real-world entity); or (2) pronominal (e.g., in the text “Ricky was a young boy, he had a heart of stone”, the pronoun “he” refers to “Ricky”); or (3) nominal (e.g., in the text “Apple is the world’s largest information technology company. The company employs 115,000 permanent full-time employees.”, the definite noun phrase “The company” refers to “Apple”).
  • Relation Extraction (RE) requires detecting and classifying predefined relationships between entities identified in text. For example, employeeOf(Bill Gates, Microsoft) is a relation between a person and an organization, extracted from the sentence “Bill Gates works for Microsoft”; or liveIn(Trump, New York) is a relation between a person and location, extracted from the text “The permanent residence of Mr. Trump is in New York”.
  • Event Extraction (EE) is the task of identifying events and also the details about them, ideally identifying who did what to whom, when, where, through what methods, and why.

NER is considered as the most basic and common task of IE. Some literature even refer to IE merely as NER, e.g., [Fre00]. EE, on the other hand, is the most complicated task since it often involves both of the other three tasks and thus is deserved being the hardest one of the four basic IE tasks.

Evaluation Metrics

In many cases, the performance of an IE system is not difficult to evaluate. Given an input text, as long as the content of the text is well understood by human, the expected output of the system can be defined very precisely.

For example, consider the testing of a NER system where the task to find location names in several paragraphs. Since the input text here is relatively small, by manually scanning we expect that the output is the list of 5 exact cities {“New York”, “London”, “Paris”, “Berlin”, “Tokyo”}. If the output list produced by our system includes irrelevant names like “Moscow” or “Hanoi”, then clearly its correctness is poor. On the other hand, if that list lacks some expected names such as “New York” or “London”, we say that the completeness of our system is low.

Hence, correctness and completeness is the two common evaluation metrics in IE. Actually, these metrics were adopted from the Information Retrieval research community where the terms correctness and completeness are formally called precision and recall, respectively. Let’s denote:

  • key as the number of names in the expected output (5 in the example).
  • correct as the number of relevant detected names.
  • incorrect as the number of irrelevant detected names.

We have:

precision = correct / (correct + incorrect)

recall = correct / key

As we can see, recall is measurable only when key is available. This means, in case the input text under testing are so big that we are not able to know exactly the expected output, precision is the only metric which can be used to evaluate our system.

There are cases where a combination of the two metrics above into a single one is desired. Such a popular combination is f-measure:

F = ( (β^2+1) * precision * recall ) / ( β^2 * precision + recall )

Here β is a non-negative value used to reflect the importance of precision relative to recall. If β=1.0 then the degrees of importance are equal; a lower value of β will give more importance on precision. Depending on our specific situation, a suitable value of β is chosen.

While all of the metrics discussed so far measure the goodness, there are other metrics concerning the badness. For instance, the metric so-called slot error rate (SER [MKSW99]), which is defined as:

SER = (incorrect + missing) / key

where missing is the number of relevant names lacking in the output list. Obviously, the lower value of SER is, the better the performance of our system is.

Of course, in more complex tasks of NER (e.g., extracting not only city names but also additional information about population, area …) as well as in the other three basic tasks, all of metrics above have to be implemented more sophisticatedly, but the principles of correctness and completeness still apply.

From Knowledge Engineering to Trainable Systems

In order to understand how an IE system is basically built, let’s again consider the example of extracting locations from free text. If we were placed in the position of the builder, what should we do?

Dictionaries of Words and Extraction Patterns

Intuitively, at first we must have a list of popular locations, i.e., a dictionary of words in that specific domain, e.g., {“Hanoi”, “Saigon”, “Bangkok”, “Jakarta” …} so that we will simply scan the input text to find the appearances of those words. This, however, raises two difficult problems, which, interestingly, are related to completeness and correctness:

  • The set of relevant words might be open-ended. Regarding the common nouns such as city names, one may list all of cities available on the world; but how about the proper nouns such as “office”, “downtown”, “northwest region” …? Clearly, it is impossible to enumerate by hand all words that might describe locations, therefore the completeness of our system is affected.
  • Words in the dictionary might have various meanings. In natural languages, this situation is very common. For example, the word “Washington” means not only a state of America but also a person. Hence, if the input text contains “Washington was born into the provincial gentry of Virginia” and this is the only occurrence of “Washington” in the text, then that word would be wrongly detected as a location just because it is included in the dictionary. This wrongly detecting, of course, would reduce the correctness of our system.

To cope with the two problems above, another dictionary so-called dictionary of extraction patterns is needed. Basically, an extraction pattern expresses a specific context in which every segment of text that is matched by this context would be considered relevant, and hence the information extracted from this matching are also considered relevant. To allow these matching and extracting, every pattern contains some slots, or placeholders. For example, the pattern “live in X” has one slot “X”, it expresses the context of a location; thus, a segment of text such as “live in downtown” would be matched, and the location “downtown” is extracted.

It is easy to see how extraction patterns are used to address the two problems above. In the first problem, noun phrases outside the dictionary of words (e.g., “downtown”) still could be discovered by extraction patterns. Regarding the second problem, extraction patterns will strictly specify the meanings of extracted words.

But the use of the dictionary of extraction patterns doesn’t mean the dictionary of words is redundant. In fact, both of the two dictionaries are necessary for the purpose of measuring the reliability of the extracted information: information which are extracted from both dictionaries would be more preferred.

Good dictionaries also contain negative elements in order to reduce the risk of wrong detection. For example, a dictionary of location words might include {“dog”, “cat” …} as negative words; as such, a dictionary of location patterns can contain “X was born” as a negative pattern so that the word “Washington” in the meaning of a person will not be extracted.

Knowledge Engineering: the Manual Construction of Dictionaries

In early days of IE (late 1980s), all the efforts of developing IE systems focused on the manual construction of dictionaries. This construction is called knowledge engineering (KE) and is performed by both domain experts and system engineers. The knowledge here refers to dictionaries of words and extraction patterns. The engineering is usually done in an iterative manner, starting with a small set of knowledge which are tested on the available text and extended until a desired trade-off between correctness and completeness is reached.

Although the systems built in this approach had a major drawback: the lacking of portability—knowledge engineered in a specific domain and a specific language cannot be easily adapted to other ones—they did demonstrate that the approach may be sufficient for solving real-world IE tasks that are narrow in scope.

The subsequent efforts of the community in the 1990s aimed to improve the portability of KE-based systems, but concerned the aspect of system design and architecture, instead of algorithms and techniques. This architecture basically applied the principle of separation of concerns [Par72] in order to create more modular IE systems. The idea is simple: general-purpose knowledge and domain- & language-specific knowledge are carefully separated in their own components which are respectively called generic core engine and knowledge bases. In this manner, the adapting to new domains and languages can be done just by plugging-in new knowledge bases. The FASTUS system described in [HAB+97] is an excellent example of such a flexible IE system.

By the turn of the century, although the modular design of IE systems had been widely accepted, the process of handcrafting the knowledge bases still remained a very time-consuming and difficult task. This stimulated research on trainable IE systems that apply Machine Learning (ML) techniques to mitigate some of that burden.

Trainable IE Systems: Automatically Constructing of Knowledge Bases

The constructing of knowledge bases could become a ML problem if we replace the term “constructing” by “inferring”. In terms of ML, we would say a trainable IE system is a system being able to infer knowledge from training data. That knowledge, of course, will be used by the system in order to improve its performance when doing its IE tasks, measured by its evaluation metrics.

Training data in the context of IE are special: they are not usually represented in the table form, but instead in the annotation form. Thus, the IE community calls training data as annotated training data. Those data are just text—so they can be alternatively called annotated text—augmented with annotations at suitable positions following a certain format. For example, in the simple annotated text “Washington[person] was born into the provincial gentry of Virginia[location]”, the annotations “[person]” and “[location]” are used to denote that the two preceding words are about a person (“Washington”) and a location (“Virginia”), respectively.

As a convention, generally the term “annotated text” refers to data in both annotation form and table form. But notice that, the two forms are different. Indeed, certain applications of ML methods in IE require the casting of annotated data into the table form, because those methods were designed to work with this form only. This casting is called feature engineering, a non-trivial and labor-intensive task.

Unsurprisingly, the first efforts on applying ML techniques to IE focused on supervised learning that had been already offered a broad array of tools and algorithms. The motivation of this trend, of course, is to shift the human effort in the construction of knowledge bases away from knowledge engineering and toward annotating training data, which serve as input to many available supervised learning methods. This encourages further modularity in IE system development, since annotating training data, in principle, requires only domain experts who can be viewed as end-users of the systems. In other words, without the need of system engineers, an end-user is still able to create and customize modules for an IE system, by providing annotated training data.

It is true that supervised learning methods was successfully applied in many notable IE systems. To name but a few, Nymble [BMSW97] is a NER system adopting the famous statistical model Hidden Markov Model (HMM). Also, Conditional Random Fields—addressing some issues of HMM—has been applied as a state-of-the-art supervised approach in IE [Set04]. However, these systems still suffered the inherent problem of supervised learning: the bottleneck of human effort in labeling/annotating training data. In the area of IE, this bottleneck is even more intensive due to the complicated and ambiguous characteristics of natural languages.

This led to the need of research for the applications of semi-supervised learning methods in IE, which is still an interesting research area so far. On the one hand, active learning, the learning that requires human in-the-loop, found its application in the IE system described in [JGMR03]. On the other hand, given the fact that unannotated text are quite abundant nowadays, bootstrapping method—which does not require human in-the-loop but require a large amount of unannotated text—has been getting more attention lately.

Bootstrapping Method

In terms of IE, bootstrapping method refers to an algorithm that starts from a small set of annotated text and gradually grows that set bigger by annotating a large set of unannotated text, choosing some best annotations, and then adding them to the annotated set. This growing would be stopped when a certain condition is detected.

This section will cover several research on bootstrapping method, which are most notable and related to our work.

Yarowsky 1995

Probably the earliest and most famous bootstrapping approach in IE is the Yarowsky algorithm [Yar95], which solves a special task: word sense disambiguation. Given an ambiguous word (e.g., “plant”) and its two meanings (e.g., “A: living thing” and “B: manufacturing institution”), a few seed annotated examples for each meaning are chosen. The aim is to classify a large pool of unannotated examples into one of those meanings (A or B). These are illustrated as in the table below.

An illustration of training data in Yarowsky algorithm

For each unannotated example, the algorithm scans the list of annotated examples in order to find the most similar one and assign its meaning to the unannotated example. This requires a good measure of similarities between examples. In the observation of Yarowsky, two examples are similar when they share at least one context word. Let’s say we are considering the unannotated example “animal and plant tissues”, then the most similar annotated example would be “by rapid growth of aquatic plant and animal life” due to the shared context word “animal”. Hence the meaning of “plant” in “animal and plant tissues” would be “A: living thing”.

A quantitative representation of the measure of similarities, and more about the algorithm, can be found in [Yar95]. Notice that, an excellent accuracy (exceeding 96%) was reported.

Riloff 1999

The mutual bootstrapping described in [RJ99] put another landmark in the research of bootstrapping method for the context of IE. This approach is so attractive due to its capability to learn a dictionary of words, i.e., semantic lexicon, and, as a side effect, also learn a dictionary of extraction patterns.

Given a small set of seed words belonging to a certain category and an unannotated text, i.e., training corpus, the objective is to learn other words—taken from the corpus, of course—which also belong to that category, and additionally learn a set of related extraction patterns.

To do that, the mutual bootstrapping selects the best extraction pattern available in a large pool—which was generated from the corpus before—and then apply that pattern back to the corpus in order to extract new words, which will be all added to the seed-word set, which in turn affect the selecting of the next best pattern, and so on, in many iterations. This creates a mutual impact between words and extraction patterns. Hence the method is named as “mutual”.

In this manner, the similarity between the words added to the seed set is justified by the fact that they are all extracted from the same pattern. And because this pattern is the best one, the similarity between (1) those new words and (2) the current (old) words in the seed set can be warrantied.

As we can see, indeed, the method produces two output: a dictionary of words (the seed-word set) and a dictionary of patterns collected as the best ones that were selected in the iterations.

In order to improve the correctness of learned words, the authors presented the so-called multi-level bootstrapping method which adds an outer bootstrapping wrapping the mutual bootstrapping. At the end of each inner (mutual) bootstrapping, all the new words learned in that bootstrapping would be examined, only some best ones are retained in the seed-word set, and the mutual bootstrapping would start again. This hopes to reduce the risk of learning irrelevant words, but with the cost of more computation due to the outer bootstrapping.

Riloff 2002

Three years later, Riloff, together with another colleague [RT02], introduced some significant improvements to the multi-level bootstrapping. This resulted in a new algorithm named as Basilisk (Bootstrapping Approach to SemantIc Lexicon Induction using Semantic Knowledge).

The first innovation of Basilisk, interestingly, is the removing of the outer bootstrapping. Learned words would be examined in a more fine-grained fashion: in every iteration inside the mutual bootstrapping, after the picking of the best pattern, immediately all the new words extracted by this pattern are examined, and only some best words are added to the seed set. This elimination of the outer bootstrapping not only reduces the cost of computation, but also offers more accuracy since bad words can be avoided sooner. Note that, in any bootstrapping method, once a bad element is added to the seed set, other bad ones will also be quickly added.

The second innovation is based on the observation that, at each iteration, the new words are not necessarily extracted from only one best pattern. Words extracted by the same pattern are likely to be similar, but it doesn’t mean words extracted from different patterns are always dissimilar. Indeed, as long as both of those different patterns are good enough, they are still expected to extract a set of similar words. Therefore, picking more than one best pattern in every iteration would be reasonable.

The last innovation, not to solve any problem of multi-level/mutual bootstrapping though, is the simultaneously learning multiple categories. Learning more than one category at the same time makes sense because when a word is learned for a category, the learning for other categories can ignore the word. Obviously, this requires an assumption that different categories do not share the common words.

Carasso 2007

Among all the published work in the area of bootstrapping method, to our knowledge, only the paper of Carasso [Car07] developed an approach that is used for the domain of operational logs instead of natural languages.

The approach of Carasso is still heavily inspired from the mutual bootstrapping of Riloff. But because operational logs are special in a way that most of linguistic rules and heuristics do not work anymore, there are some differences in the work of Carasso.

Firstly, extraction patterns cannot be fully generated before the bootstrapping as in the method of Riloff. Rather, they would be gradually generated, depending on the new learned words at each iteration. Secondly, regular expressions are used to represent extraction patterns. Indeed, the main contribution of Carasso in the paper is his method of generating potential regular expressions from a given word and a given log event.

However, Carasso’s main aim was to learn good extraction patterns only, therefore the selecting of best words requires human in-the-loop to remove the irrelevant ones extracted by the best pattern at each iteration of the bootstrapping. Because of that, he called his method as “Semi-Automatic Discovery of Extraction Patterns for Log Analysis”.

Sonal Gupta 2015

The most recent research on bootstrapping method is a PhD dissertation of Stanford University titled as “Distantly Supervised Information Extraction Using Bootstrapped Patterns” [Gup15]. The method described in this dissertation, again, is inspired from the work of Riloff.

The author showed that, extraction patterns, when being represented as lexico-syntactic surface word patterns and dependency patterns, can offer an effective learning. Also, she presented some advanced statistical methods for the purpose of measuring and selecting best words as well as best extraction patterns. It was reported that these methods outperform the other existing ones.


[1] Also, when the output values are Boolean values as in this case, the learning is specially called concept learning (learning the concept of EnjoySport). In general, the number of output values is not necessary two, for example, they can be four values: No, Football, Badminton, and Swimming.

[2] In the terms of semi-supervised learning, this is called transductive learning.

[3] Inferring knowledge from both labeled and unlabeled data is called inductive learning.

[4] Search engines mainly make use of Information Retrieval (IR) which is a different subject but usually confused with IE. The task of IR is to select from a collection of documents a subset which are relevant to a particular query. However, IE is intensively used as a subcomponent of an IR system in order to improve the document indexing.


[CW06] Joseph Cruz and David Wishart. Applications of machine learning in cancer prediction and prognosis. Cancer Informatics, 2006.

[Ng16] Alfred Ng. IBM’s Watson gives proper diagnosis for Japanese leukemia patient after doctors were stumped for months. http://www.nydailynews.com/news/world/ibm-watson-proper-diagnosis-doctors-stumped-article-1.2741857, 2016.

[Sim13] Phil Simon. Too Big to Ignore: The Business Case for Big Data. Wiley, 1st edition, 2013.

[Mit97] Tom Mitchell. Machine Learning. McGraw-Hill, 1st edition, 1997.

[Ham14] Laura Hamilton. The Inductive Biases of Various Machine Learning Algorithms. http://www.lauradhamilton.com/inductive-biases-various-machine-learning-algorithms, 2014.

[HS99] Geoffrey Hinton and Terrence J. Sejnowski. Unsupervised Learning: Foundations of Neural Computation. MIT Press, 1st edition, 1999.

[BvLR04] O. Bousquet, U. von Luxburg, and G. Raetsch. Advanced Lectures on Machine Learning. Springer-Verlag, 1st edition, 2004.

[Ton01] Simon Tong. Active Learning: Theory and Applications. PhD thesis, Stanford University, Department of Computer Science, 2001.

[CSZ06] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-Supervised Learning. MIT Press, 1st edition, 2006.

[HYG02] Silja Huttunen, Roman Yangarber, and Ralph Grishman. Complexity of event structure in ie scenarios. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), 2002.

[PSPY13] Thierry Poibeau, Horacio Saggion, Jakub Piskorski, and Roman Yangarber. Multi-source, Multilingual Information Extraction and Summarization. Springer-Verlag, 1st edition, 2013.

[Fre00] Dayne Freitag. Machine learning for information extraction in informal domains. Technical report, Kluwer Academic, Justsystem Pittsburgh Research Center, 2000.

[MKSW99] John Makhoul, Francis Kubala, Richard Schwartz, and Ralph Weischedel. Performance measures for information extraction. Technical report, BBN Technologies, GTE Corp., 1999.

[Par72] David Parnas. On the criteria to be used in decomposing systems into modules. Technical report, Carnegie-Mellon University, 1972.

[HAB+97] Jerry Hobbs, Douglas Appelt, John Bear, David Israel, Megumi Kameyama, Mark Stickel, and Mabry Tyson. Fastus: A cascaded finite- state transducer for extracting information from natural-language text. Technical report, Articial Intelligence Center, SRI International, 1997.

[BMSW97] Daniel Bikei, Scott Miller, Richard Schwartz, and Ralph Weischedel. Nymble: a high-performance learning name-finder. In Proceedings of the 5th Applied Natural Language Processing Conference, 1997.

[Set04] Burr Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. Technical report, University of Wisconsin Madison, Department of Computer Sciences, Department of Biostatistics and Medical Informatics, 2004.

[JGMR03] Rosie Jones, Rayid Ghani, Tom Mitchell, and Ellen Riloff. Active learning for information extraction with multiple view feature sets. In ECML-03 Workshop on Adaptive Text Extraction and Mining, 2003.

[Yar95] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 1995.

[RJ99] Ellen Riloff and Rosie Jones. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. In Sixteenth National Conference on Artificial Intelligence (AAAI-99), 1999.

[RT02] Ellen Riloff and Michael Thelen. A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002.

[Car07] David Carasso. Semi-Automatic Discovery of Extraction Patterns for Log Analysis. Technical report, Splunk Inc., 2007.

[Gup15] Sonal Gupta. Distantly Supervised Information Extraction Using Bootstrapped Patterns. PhD thesis, Stanford University, Department of Computer Science, 2015.


One thought on “A humble introduction to Machine Learning, Information Extraction, and Bootstrapping Method

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s