versión impresa ISSN 0327-0793
Lat. Am. appl. res. v.35 n.2 Bahía Blanca abr./jun. 2005
A new estimator based on maximum entropy
J. P. Piantanida and C. F. Estienne
Abstract ¾ In this paper, we propose a new formulation of the classical Good-Turing estimator for n-gram language models. The new approach is based on defining a dynamic model for language production. Instead of assuming a fixed probability distribution of occurrence of an n-gram on the whole text, we propose a maximum entropy approximation of a time varying distribution. This approximation led us to a new distribution, which in turn is used to calculate expectations of the Good-Turing estimator. This defines a new estimator that we call Maximum Entropy Good-Turing estimator. In contrast to the classical Good-Turing estimator, the new formulation needs neither expectations approximations nor windowing or other smoothing techniques. It also contains the well known discounting estimators as special cases. Performance is evaluated both in terms of perplexity and word error rate in an N-best rescoring task. Also comparison to other classical estimators is performed. In all cases our approach performs significantly better than classical estimators.
Keywords ¾ Languaje Models. Maximum Entropy. Good-Turing Estimation.
It is a well known fact that stateoftheart speech recognition systems use n-gram models in their language models. In order to estimate such models, it is necessary to use probability estimators which assign a probability to each n-gram. Because of the sparse characteristic of language two problems often arise. On the one hand the number of samples of a particular event is often inadequate to obtain robust estimators of such event. On the other hand, even when the amount of available training data is huge, many events do not occur at all, but this does not mean they have zero probability of occurrence, it just means they did not occur in the training set. As a consequence, the maximum likelihood estimator of the probability given by the quotient r/N where r is the frequency of occurrence of an event (n-gram) and N is the total number of events, will not be in general a good estimator of the probability. On one hand it will assign null probability to non zero occurrence events, and on the other hand it can be shown (Lindsey and Denne, 2000) that it tends to overestimate events which have low frequency of occurrence in a text. In order to deal with the problem of sparseness of data, many probabiliy estimators have been proposed on the literature. Two of the most popular are the Good-Turing estimator (Good, 1953; Nadas, 1985) and discounting estimators (Katz, 1987; Ney et al., 1995).
In this work we take a different approach. We assume a dynamic language model for speech production in the sense that the frequency of occurrence of an event is not fixed on the text, but is a random variable. Even when this view requires a careful mathematical treatment, it is possible using maximumentropy models to obtain an approximation which requires an estimator which just depends on r. Starting with classical Good-Turing estimator, we will reformulate it in order to meet our model requirements. As a result a new estimator called maximum entropy Good-Turing estimator will be obtained. This new estimator does not need approximations or empirical formulations as in the case of the classical Good-Turing estimator (Good, 1953; Gale, 2000).
In the next section we briefly describe classical Good-Turing estimation and maximumentropy models. In Section III we formally state our Good-Turing maximum entropy model and we discuss some issues of relevance. Experimental results are shown in Section IV. Finally some concluding remarks are given in Section V.
II. CLASSICAL GOOD-TURING ESTIMATOR AND MAXIMUM ENTROPY MODELS
A. Good-Turing estimator
Classical Good-Turing estimator (Good, 1953) can be stated as a formal model (Good, 1953; Nadas, 1985) in which the probability of an event σ (an n-gram) whose frequency of occurrence r is given by P(σ) = qr, with
where r is the frequency of repetition of an event, N is the total number of events, and cr corresponds to the number of events whose frequency of occurrence is r. A fundamental hypothesis of the model is the symmetry requirement which states that any two events having the same frequency in the text must also have the same probability estimate (Nadas, 1985). Equations (2) and (3) are difficult to determine and they are not used in practical implementations of the Good-Turing estimator, instead they are approximated with training data. As a consequence, many values of cr are zero, and there exists an unacceptable dispersion between values of cr and cr+1. These problems make necessary the use of windowing techniques, or non continuous qr in order to smooth such dispersions (Gale, 2000). Even though smoothing is necessary, in practical implementations, not only mathematical formality is lost with this approximation, but also empirical adjustments are necessary for each kind of text.
B. Maximum entropy models
Maximumentropy models have been used in language model contexts to estimate n-grams (see for example Rosenfeld, 1996). Basically they can be stated as follows
- Reformulate the different information sources as constraints to be satisfied by the target estimate.
- Among all probability distributions that satisfy these constraints, choose the one that has the highest entropy.
Mathematically m constrains are expressed as expectation functions as follows
where gk(x) are model constrains usually expressed as expectation of these functions. The distribution that maximizes entropy given such constrains is given by (Cover and Thomas, 1991)
where is the partition function.
III. MAXIMUM ENTROPY GOOD-TURING ESTIMATOR
A. A dynamic model for language production
We can think the speech production process as follows. Consider a hypothetical speaker who starts to speak to another person about some specific topic. At this moment his vocabulary is reduced to the number of words he said up to a particular moment t1 say . the number of repetitions is expected to be low at first. Therefore, a reasonable assumption for the probability of emission of a word is 1/. If we use entropy as a measure of the information of the message at time t1, it will be approximately @ log (Cover and Thomas, 1991). After some time of emitting words, say at instant t2, the speaker vocabulary will increase to and, language entropy will also grow. However at this point, some vocabulary repetitions are expected to have occurred, decreasing the growth rate of entropy. As a consequence, will be lower than log . Our assumption is that in the long term, language entropy of that dynamic process, will grow at decreasing rate up to a maximum stationary value. This value would correspond to the case when the speaker has used nearly all his vocabulary concerning to a specific topic to a specific person, and the number of repetitions is enough to avoid further entropy growth.
This means that we are viewing language production as a dynamic process by which the probability of an event is not fixed but is a function of time, so that it could be zero at a moment (when no examples of an event are emitted up to that moment), and non zero at another moment. A complete formulation of the dynamics of this model is out of the scope of the present work; however, if we assume that in the long term the system bounds a maximum entropy state which does not change any more, a simplified model can be developed and a robust estimator of the probability of an event can be found.
B. Model constraints
It should be clear from the discussion above, that r, the frequency of occurrence of event, is not constant but it changes when speaker introduces more and more vocabulary. We can think it as a random variable with an associated probability Pt(r) which, of course, is unknown. Use of the index t means that distribution changes with time. If we adopt the symmetry requirement used in the Good-Turing estimator, we will not be able to distinguish between different events that occur the same number of times. Hence a distribution which represents model dynamics will not only be a function of r, but also of the number of events whose frequency of occurrence is r. If we call such number cr , we will have an associated distribution Pt(r,cr). But we are not interested in the instantaneous dynamics of the model, instead we are concerned with the distribution whose entropy reaches a stable maximum. Such distribution would corresponds to the best static approach we could produce for our dynamic process. We will call such distribution P(r,cr).
In order to find P(r,cr) we will embody four statistics that include information of the process necessary for the model. The first is
where σ is an event, and N(σ) is the number of times such event occurs. This statistics corresponds to a sufficient statistics for the Poisson distribution (Cover and Thomas, 1991). The choice of this statistics is based on a previous work (Church and Gale, 1996) which shows that the frequency of occurrence of an event in a text follows a Poisson distribution. In another work (Witten and Bell, 1991), it is also shown that cr (the number of events with frequency r) also responds to a Poisson distribution, but different for each r, so the second statistics that we incorporate is
where Nr is the maximum number of occurrences of an event and δ(i,j)=0 i ¹ j. We also define two
statistics which take into account dynamics properties
Now we can formulate a maximum entropy probability distribution P(r,cr) that meets our four constraints.
C. Calculation of the distribution
The four statistics (6), (7), (8) and (9) are put together in the model trough equation (4) resulting in the following set of equations
where , , y are evaluated from training data, Nr is the maximum number of occurrences for all event and Nc is the maximum number of events that occur r times with the same frequency. Maximizing the entropy of P(r,cr) with the above constraints we obtain the corresponding form of equation (5) for our model
Expectations , , y , are obtained from training data. We have used resampling statistical techniques which give rise to Jackknife's estimators (Walsh, 2000); however, other techniques could have been used. Once we obtain expectations we can obtain parameters λ1, λ2, λ3 y λ4 using IIS algorithm (Della Pietra et al., 1997). Finally applying formula (14) we obtain our maximum entropy distribution. The next step is to introduce this distribution in the Good-Turing estimator.
D. Maximum entropy Good-Turing estimator
Once (14) is determined, it is not difficult to calculate expectations of the Good-Turing estimator (1). It is straightforward to show that
Finally replacing (15) in (1) we obtain our new maximum entropy Good-Turing estimator
It is important to compare our estimator with maximumlikelihood estimator qr = r*/N. To this end, define the quotient r*/r
This quotient allows us to understand the influence of the parameters model. Parameter λ1 is a measure of the velocity of growth of P(r, cr) when r increases. Parameter λ2 is related to the value of the estimator at very low values of r (including r = 1). Parameter λ3 measures the maximum likelihood limit that our estimator will reach. Finally, parameter λ4 is related to a multiplicative factor (independent of r). This parameter will affect the probability mass of unobserved events. If we model unobserved events probability as
an increase of the parameter λ4 will decrease qr, and as a consequence P(φ0), the probability of unobserved events will also grow.
Another advantage of our estimator is that it verifies two desired requirements for an estimator (Ney et al., 1995) qr £ r/N , and qr-1 £ qr r. The second requirement is easily seen from (16). To verify the first requirement we have found that our estimator satisfies the following condition that is equivalent to qr £ r/N which is verified by our estimator
Finally, if we make a series expansion of expression (16) and we take the linear term, also making a convenient choice of parameters λ1, λ2, λ3 and λ4, Ney discounting estimators (Ney et al., 1995) results as a special case of the maximum entropy Good-Turing estimator
IV. EXPERIMENTAL RESULTS
A. Data description
Experiments were performed on three corpora: an English database, switchboard phase one, and two Spanish databases, Latino 40 (available from LDC) and LatinAmerican Spanish database collected by SRI International (Bratt et al., 1998). We also used text extracted from newspapers. We performed perplexity measurements using the whole databases, and N-best rescoring using switchboard corpus. We used bigram models with Latino40 corpus and trigram models with switchboard and Latin-American Spanish databases. The text was split in three classes
- Text A: Consists of text taken from Latino40 transcriptions, we used 32k words for training and 8k words for testing.
- Text B: Consists of text taken from Latin-American Spanish database transcriptions and newspapers texts. Combining both classes of text we used 752k words for training, and 33k words for testing.
- Text C: Consists of 3M words taken from switch-board phase one transcriptions used for training, and 59k words taken from HUB5 2001 evaluation set transcriptions used for testing.
Perplexities measurements were performed over classical Good-Turing estimator (CGT) (Good, 1953), Katz estimator (KATZ) (Katz, 1987), Absolute discounting (ADE) and linear discounting (LDE) estimators (Ney et al., 1995) and Maximum entropy Good-Turing (MEGT). Results can be shown in table 1.
Table 1: Perplexities of selected estimators with different vocabulary
Finally we performed N-best re-scoring over 5895 sentences corresponding to the HUB5 2001 test set. We rescored 2000best hypothesis performed by The SRI DECIPHER(TM) speakerindependent continuous speech recognition system at SRI International. Results are shown in table 2
Table 2: WER after rescoring using Katz and MEGT estimators.
Table 1 shows that the maximum entropy method reports an improvement in terms of perplexity that is superior to the rest of the estimators. It is interesting to observe that, improvement is performed over all three text corpora. This is an important difference with respect to the other estimators. For example the Katz estimator has lower perplexity for texts B and C than for text A.
Table 2 shows results on N-best re scoring over the switchboard corpus in terms of WER. Only the Katz estimator gave a small improvement. The other estimators were not included because they did not decrease the baseline WER. We can see a significant improvement concerning the baseline of 3.4% in our maximum entropy Good-Turing estimator. We could expect a greater increase if we used maximum entropy estimator in a n-gram model on a ASR task.
Using a maximum entropy method and assuming a dynamic model for language production, we have found a Good-Turing like estimator which requires neither the smoothing nor the empirical adjustments which are necessary in the classical Good-Turing estimator. Parameters defining our model are determined using the well known IIS algorithm. We have also shown that our new estimator verifies both requirements desired in language estimators qr £ r/N , and qr-1 £ qr r. Finally, we have shown that our estimator contains the Ney discounting estimator as a particular case.
Experimental results show that the maximum entropy method performs better than all others estimators for the three classes of text corpora considered. We also tested our estimator in a 2000 hypothesis N-best re scoring over switchboard corpus obtaining decrements in the WER of 3.4% with respect to the baseline.
We want to thank Star-Lab at SRI International and specially Dr. Horacio Franco for permitting the use of their Latin-American Spanish database, and N-best data. We also thank Luciana Ferrer from SRI for her comments and suggestions.
1. Bratt, H., L. Neumeyer, E. Shriberg, H. Franco, "Collection and Detailed Transcription of a Speech Database for Development of Language Learning Technologies", Proc. ICSLP, Sydney, Australia, Paper number 926 (1998). [ Links ]
2. W. K. Church, and W. A. Gale, "Poisson mixtures", AT & T Bell LabsResearch, (1996). [ Links ]
3. Cover, T. and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, (1991). [ Links ]
4. Della Pietra, S., V. Della Pietra, and J. Lafferty, "Inducing Features of Random Fields", IEEE Trans. on Pattern Analysis and Machine Intelligence, 19, 380-393 (1997). [ Links ]
5. Gale, W., "Good-Turing Smoothing Without Tears", Report AT & T Bell Laboratories, (2000). [ Links ]
6. Good, I. J., "The population frequencies of species and the estimation of population parameters", Biometrika, 40, 237-264, (1953). [ Links ]
7. Katz, S. M., "Estimation of probabilities from sparse data for language model component of a speech recognizer", IEEE Trans. on Acoustics, Speech and Signal Proc., 35, 400-401, (1987). [ Links ]
8. Lindsey, J. K., and J. S. Denne, "Missing data: a fundamental frequentist problem", Report Biostatistics, Limburgs University, Diepenbeek, Belgium, (2000). [ Links ]
9. Nadas, A., "On Turing's formula for word probabilities", IEEE Trans. on Acoustic, Speech and Signal Proc., 33, 1414-1416, (1985). [ Links ]
10. Ney, H., U. Essen, and R. Kneser, "On the Estimation of Small Probabilities by Leaving-One-Out", IEEE Trans. on Pattern Analysis and Machine Intelligence, 17, 1202-1212, (1995). [ Links ]
11. Rosenfeld, R., "A Maximum Entropy Approach to Adaptive Statistical Language Modeling", Computer Speech and Language, 10, 187-228, (1996). [ Links ]
12. Walsh, B., "Re sampling methods: randomization test, Jackknife and Bootstrap Estimators", Lecture Notes, (2000). [ Links ]
13. Witten, I. H., and T. C. Bell, "The zerofrequency problem: Estimating the probabilities of novel events in adaptive text compression", IEEE Trans. on Information Theory, 37, 1085-1094, (1991). [ Links ]