A new estimator based on maximum entropy

Piantanida, J. P.; Estienne, C. F.

Services on Demand

Journal

Article

Indicators

Cited by SciELO

Latin American applied research

Print version ISSN 0327-0793

Lat. Am. appl. res. vol.35 no.2 Bahía Blanca Apr./June 2005

A new estimator based on maximum entropy

J. P. Piantanida and C. F. Estienne

School of Engineering, University of Buenos Aires, Argentina
Paseo Colón 850 Dept. de Electrónica (1063) Cap. Fed.
jpianta,cestien@fi.uba.ar

Abstract ¾ In this paper, we propose a new formulation of the classical Good-Turing estimator for n-gram language models. The new approach is based on defining a dynamic model for language production. Instead of assuming a fixed probability distribution of occurrence of an n-gram on the whole text, we propose a maximum entropy approximation of a time varying distribution. This approximation led us to a new distribution, which in turn is used to calculate expectations of the Good-Turing estimator. This defines a new estimator that we call Maximum Entropy Good-Turing estimator. In contrast to the classical Good-Turing estimator, the new formulation needs neither expectations approximations nor windowing or other smoothing techniques. It also contains the well known discounting estimators as special cases. Performance is evaluated both in terms of perplexity and word error rate in an N-best rescoring task. Also comparison to other classical estimators is performed. In all cases our approach performs significantly better than classical estimators.

Keywords ¾ Languaje Models. Maximum Entropy. Good-Turing Estimation.

I. INTRODUCTION

It is a well known fact that stateoftheart speech recognition systems use n-gram models in their language models. In order to estimate such models, it is necessary to use probability estimators which assign a probability to each n-gram. Because of the sparse characteristic of language two problems often arise. On the one hand the number of samples of a particular event is often inadequate to obtain robust estimators of such event. On the other hand, even when the amount of available training data is huge, many events do not occur at all, but this does not mean they have zero probability of occurrence, it just means they did not occur in the training set. As a consequence, the maximum likelihood estimator of the probability given by the quotient r/N where r is the frequency of occurrence of an event (n-gram) and N is the total number of events, will not be in general a good estimator of the probability. On one hand it will assign null probability to non zero occurrence events, and on the other hand it can be shown (Lindsey and Denne, 2000) that it tends to overestimate events which have low frequency of occurrence in a text. In order to deal with the problem of sparseness of data, many probabiliy estimators have been proposed on the literature. Two of the most popular are the Good-Turing estimator (Good, 1953; Nadas, 1985) and discounting estimators (Katz, 1987; Ney et al., 1995).

In this work we take a different approach. We assume a dynamic language model for speech production in the sense that the frequency of occurrence of an event is not fixed on the text, but is a random variable. Even when this view requires a careful mathematical treatment, it is possible using maximumentropy models to obtain an approximation which requires an estimator which just depends on r. Starting with classical Good-Turing estimator, we will reformulate it in order to meet our model requirements. As a result a new estimator called maximum entropy Good-Turing estimator will be obtained. This new estimator does not need approximations or empirical formulations as in the case of the classical Good-Turing estimator (Good, 1953; Gale, 2000).

In the next section we briefly describe classical Good-Turing estimation and maximumentropy models. In Section III we formally state our Good-Turing maximum entropy model and we discuss some issues of relevance. Experimental results are shown in Section IV. Finally some concluding remarks are given in Section V.

II. CLASSICAL GOOD-TURING ESTIMATOR AND MAXIMUM ENTROPY MODELS

A. Good-Turing estimator

Classical Good-Turing estimator (Good, 1953) can be stated as a formal model (Good, 1953; Nadas, 1985) in which the probability of an event σ (an n-gram) whose frequency of occurrence r is given by P(σ) = q_r, with

, (1)

where

	, (2)
	, (3)

where r is the frequency of repetition of an event, N is the total number of events, and c_r corresponds to the number of events whose frequency of occurrence is r. A fundamental hypothesis of the model is the symmetry requirement which states that any two events having the same frequency in the text must also have the same probability estimate (Nadas, 1985). Equations (2) and (3) are difficult to determine and they are not used in practical implementations of the Good-Turing estimator, instead they are approximated with training data. As a consequence, many values of c_r are zero, and there exists an unacceptable dispersion between values of c_r and c_r₊₁. These problems make necessary the use of windowing techniques, or non continuous q_r in order to smooth such dispersions (Gale, 2000). Even though smoothing is necessary, in practical implementations, not only mathematical formality is lost with this approximation, but also empirical adjustments are necessary for each kind of text.

B. Maximum entropy models

Maximumentropy models have been used in language model contexts to estimate n-grams (see for example Rosenfeld, 1996). Basically they can be stated as follows

Reformulate the different information sources as constraints to be satisfied by the target estimate.
Among all probability distributions that satisfy these constraints, choose the one that has the highest entropy.

Mathematically m constrains are expressed as expectation functions as follows

, (4)

where g_k(x) are model constrains usually expressed as expectation of these functions. The distribution that maximizes entropy given such constrains is given by (Cover and Thomas, 1991)

, (5)

where is the partition function.

III. MAXIMUM ENTROPY GOOD-TURING ESTIMATOR

A. A dynamic model for language production

We can think the speech production process as follows. Consider a hypothetical speaker who starts to speak to another person about some specific topic. At this moment his vocabulary is reduced to the number of words he said up to a particular moment t₁ say . the number of repetitions is expected to be low at first. Therefore, a reasonable assumption for the probability of emission of a word is 1/. If we use entropy as a measure of the information of the message at time t₁, it will be approximately @ log (Cover and Thomas, 1991). After some time of emitting words, say at instant t₂, the speaker vocabulary will increase to and, language entropy will also grow. However at this point, some vocabulary repetitions are expected to have occurred, decreasing the growth rate of entropy. As a consequence, will be lower than log . Our assumption is that in the long term, language entropy of that dynamic process, will grow at decreasing rate up to a maximum stationary value. This value would correspond to the case when the speaker has used nearly all his vocabulary concerning to a specific topic to a specific person, and the number of repetitions is enough to avoid further entropy growth.

This means that we are viewing language production as a dynamic process by which the probability of an event is not fixed but is a function of time, so that it could be zero at a moment (when no examples of an event are emitted up to that moment), and non zero at another moment. A complete formulation of the dynamics of this model is out of the scope of the present work; however, if we assume that in the long term the system bounds a maximum entropy state which does not change any more, a simplified model can be developed and a robust estimator of the probability of an event can be found.

B. Model constraints

It should be clear from the discussion above, that r, the frequency of occurrence of event, is not constant but it changes when speaker introduces more and more vocabulary. We can think it as a random variable with an associated probability P_t(r) which, of course, is unknown. Use of the index t means that distribution changes with time. If we adopt the symmetry requirement used in the Good-Turing estimator, we will not be able to distinguish between different events that occur the same number of times. Hence a distribution which represents model dynamics will not only be a function of r, but also of the number of events whose frequency of occurrence is r. If we call such number c_r , we will have an associated distribution P_t(r,c_r). But we are not interested in the instantaneous dynamics of the model, instead we are concerned with the distribution whose entropy reaches a stable maximum. Such distribution would corresponds to the best static approach we could produce for our dynamic process. We will call such distribution P(r,c_r).

In order to find P(r,c_r) we will embody four statistics that include information of the process necessary for the model. The first is

, (6)

where σ is an event, and N(σ) is the number of times such event occurs. This statistics corresponds to a sufficient statistics for the Poisson distribution (Cover and Thomas, 1991). The choice of this statistics is based on a previous work (Church and Gale, 1996) which shows that the frequency of occurrence of an event in a text follows a Poisson distribution. In another work (Witten and Bell, 1991), it is also shown that c_r (the number of events with frequency r) also responds to a Poisson distribution, but different for each r, so the second statistics that we incorporate is

, (7)

where N_r is the maximum number of occurrences of an event and δ(i,j)=0 i ¹ j. We also define two

statistics which take into account dynamics properties

	, (8)
	, (9)

Now we can formulate a maximum entropy probability distribution P(r,c_r) that meets our four constraints.

C. Calculation of the distribution

The four statistics (6), (7), (8) and (9) are put together in the model trough equation (4) resulting in the following set of equations

	, (10)
	, (11)
	, (12)
	, (13)

where , , y are evaluated from training data, N_r is the maximum number of occurrences for all event and N_c is the maximum number of events that occur r times with the same frequency. Maximizing the entropy of P(r,c_r) with the above constraints we obtain the corresponding form of equation (5) for our model

, (14)

where

Expectations , , y , are obtained from training data. We have used resampling statistical techniques which give rise to Jackknife's estimators (Walsh, 2000); however, other techniques could have been used. Once we obtain expectations we can obtain parameters λ₁, λ₂, λ₃ y λ₄ using IIS algorithm (Della Pietra et al., 1997). Finally applying formula (14) we obtain our maximum entropy distribution. The next step is to introduce this distribution in the Good-Turing estimator.

D. Maximum entropy Good-Turing estimator

Once (14) is determined, it is not difficult to calculate expectations of the Good-Turing estimator (1). It is straightforward to show that

, (15)

Finally replacing (15) in (1) we obtain our new maximum entropy Good-Turing estimator

(16)

E. Discussion

It is important to compare our estimator with maximumlikelihood estimator q_r = r^*/N. To this end, define the quotient r^*/r

This quotient allows us to understand the influence of the parameters model. Parameter λ₁ is a measure of the velocity of growth of P(r, c_r) when r increases. Parameter λ₂ is related to the value of the estimator at very low values of r (including r = 1). Parameter λ₃ measures the maximum likelihood limit that our estimator will reach. Finally, parameter λ₄ is related to a multiplicative factor (independent of r). This parameter will affect the probability mass of unobserved events. If we model unobserved events probability as

an increase of the parameter λ₄ will decrease q_r, and as a consequence P(φ₀), the probability of unobserved events will also grow.

Another advantage of our estimator is that it verifies two desired requirements for an estimator (Ney et al., 1995) q_r £ r/N , and q_r_-1 £ q_r r. The second requirement is easily seen from (16). To verify the first requirement we have found that our estimator satisfies the following condition that is equivalent to q_r £ r/N which is verified by our estimator

Finally, if we make a series expansion of expression (16) and we take the linear term, also making a convenient choice of parameters λ₁, λ₂, λ₃ and λ₄, Ney discounting estimators (Ney et al., 1995) results as a special case of the maximum entropy Good-Turing estimator

IV. EXPERIMENTAL RESULTS

A. Data description

Experiments were performed on three corpora: an English database, switchboard phase one, and two Spanish databases, Latino 40 (available from LDC) and LatinAmerican Spanish database collected by SRI International (Bratt et al., 1998). We also used text extracted from newspapers. We performed perplexity measurements using the whole databases, and N-best rescoring using switchboard corpus. We used bigram models with Latino40 corpus and trigram models with switchboard and Latin-American Spanish databases. The text was split in three classes

Text A: Consists of text taken from Latino40 transcriptions, we used 32k words for training and 8k words for testing.
Text B: Consists of text taken from Latin-American Spanish database transcriptions and newspapers texts. Combining both classes of text we used 752k words for training, and 33k words for testing.
Text C: Consists of 3M words taken from switch-board phase one transcriptions used for training, and 59k words taken from HUB5 2001 evaluation set transcriptions used for testing.

B. Results

Perplexities measurements were performed over classical Good-Turing estimator (CGT) (Good, 1953), Katz estimator (KATZ) (Katz, 1987), Absolute discounting (ADE) and linear discounting (LDE) estimators (Ney et al., 1995) and Maximum entropy Good-Turing (MEGT). Results can be shown in table 1.

Table 1: Perplexities of selected estimators with different vocabulary

Finally we performed N-best re-scoring over 5895 sentences corresponding to the HUB5 2001 test set. We rescored 2000best hypothesis performed by The SRI DECIPHER(TM) speakerindependent continuous speech recognition system at SRI International. Results are shown in table 2

Table 2: WER after rescoring using Katz and MEGT estimators.

C. Discussion

Table 1 shows that the maximum entropy method reports an improvement in terms of perplexity that is superior to the rest of the estimators. It is interesting to observe that, improvement is performed over all three text corpora. This is an important difference with respect to the other estimators. For example the Katz estimator has lower perplexity for texts B and C than for text A.

Table 2 shows results on N-best re scoring over the switchboard corpus in terms of WER. Only the Katz estimator gave a small improvement. The other estimators were not included because they did not decrease the baseline WER. We can see a significant improvement concerning the baseline of 3.4% in our maximum entropy Good-Turing estimator. We could expect a greater increase if we used maximum entropy estimator in a n-gram model on a ASR task.

V. CONCLUSIONS

Using a maximum entropy method and assuming a dynamic model for language production, we have found a Good-Turing like estimator which requires neither the smoothing nor the empirical adjustments which are necessary in the classical Good-Turing estimator. Parameters defining our model are determined using the well known IIS algorithm. We have also shown that our new estimator verifies both requirements desired in language estimators q_r £ r/N , and q_r_-1 £ q_r r. Finally, we have shown that our estimator contains the Ney discounting estimator as a particular case.

Experimental results show that the maximum entropy method performs better than all others estimators for the three classes of text corpora considered. We also tested our estimator in a 2000 hypothesis N-best re scoring over switchboard corpus obtaining decrements in the WER of 3.4% with respect to the baseline.

VI. ACKNOWLEDGMENTS
We want to thank Star-Lab at SRI International and specially Dr. Horacio Franco for permitting the use of their Latin-American Spanish database, and N-best data. We also thank Luciana Ferrer from SRI for her comments and suggestions.

REFERENCES
1. Bratt, H., L. Neumeyer, E. Shriberg, H. Franco, "Collection and Detailed Transcription of a Speech Database for Development of Language Learning Technologies", Proc. ICSLP, Sydney, Australia, Paper number 926 (1998).         [ Links ]
2. W. K. Church, and W. A. Gale, "Poisson mixtures", AT & T Bell LabsResearch, (1996).         [ Links ]
3. Cover, T. and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, (1991).         [ Links ]
4. Della Pietra, S., V. Della Pietra, and J. Lafferty, "Inducing Features of Random Fields", IEEE Trans. on Pattern Analysis and Machine Intelligence, 19, 380-393 (1997).         [ Links ]
5. Gale, W., "Good-Turing Smoothing Without Tears", Report AT & T Bell Laboratories, (2000).         [ Links ]
6. Good, I. J., "The population frequencies of species and the estimation of population parameters", Biometrika, 40, 237-264, (1953).         [ Links ]
7. Katz, S. M., "Estimation of probabilities from sparse data for language model component of a speech recognizer", IEEE Trans. on Acoustics, Speech and Signal Proc., 35, 400-401, (1987).         [ Links ]
8. Lindsey, J. K., and J. S. Denne, "Missing data: a fundamental frequentist problem", Report Biostatistics, Limburgs University, Diepenbeek, Belgium, (2000).         [ Links ]
9. Nadas, A., "On Turing's formula for word probabilities", IEEE Trans. on Acoustic, Speech and Signal Proc., 33, 1414-1416, (1985).         [ Links ]
10. Ney, H., U. Essen, and R. Kneser, "On the Estimation of Small Probabilities by Leaving-One-Out", IEEE Trans. on Pattern Analysis and Machine Intelligence, 17, 1202-1212, (1995).         [ Links ]
11. Rosenfeld, R., "A Maximum Entropy Approach to Adaptive Statistical Language Modeling", Computer Speech and Language, 10, 187-228, (1996).         [ Links ]
12. Walsh, B., "Re sampling methods: randomization test, Jackknife and Bootstrap Estimators", Lecture Notes, (2000).         [ Links ]
13. Witten, I. H., and T. C. Bell, "The zerofrequency problem: Estimating the probabilities of novel events in adaptive text compression", IEEE Trans. on Information Theory, 37, 1085-1094, (1991).         [ Links ]