Thoughts on the Science of Computing

Latex, Maximum Likelihood Estimation, and other stuff
November 28, 2007

This post is just a silly attempt at the following,  trying to look smart by putting fancy equations on my blog and playing around with latex, and talking about MLE.


Down below is an html version of a portion of a tex file that I generated for a homework. 

I am new to Latex and finding it incredibly useful for formatting documents that include equations or other mathematical symbols.

It also great for general typesetting, for instance, if you want to create a resume that really stands out there are great templates available. 

From my vantage, latex documents work similar to html documents with tags that control various features.  The great thing is that once you get the hang of them it is very easy to type in equations like below.   Here is the latex code for equation 1.

\begin{equation}
\vec{\theta} = (\lambda_1,\lambda_2,...,\lambda_k, Z)
\end{equation}

And similar ideas are used for the rest.  Here is the complete file.

Some blogs and web frameworks actually have renderers that will translate an latex equation directly to a png file dynamically.

Unfortunately, I haven't found anything in asp.net that does this.  So, I actually ran a program called htlatex to generate the png files below of the equations.

Now, for a quick, crude explanation of what's below. 

Maximum Entropy is a conecpt that is very popular in statistics these days.  The ideas is let's say you care about certain statistics about a dataset or problem, for instance you might be interested in the mean of the data and the variance.  And then lets say you want to fit a distribution to the data while preserving those statistics. 

Well, there may be many different distributions that share the same mean and variance, so one idea is to just choose the distribution with the maximum entropy that fits the data and preserves the statistics.  If you actually do the derivation you end up with an expression for the probability like in equation 2 that is a function of a set of phis, which are just your statistics.  So if we just cared about the mean and variance we would have two phis and two lambdas in the equations below.

So, now that you have equation 2 it depends on the parameters lambda and needs to be normalized.  So one still needs to find the lambdas.  One eventually derives equations like 19 which you can use the data to calculate the lambdas. 

This all can be show using the maximum entropy point of view, but we were asked to show that the equations for lambda could also be derived using Maximum Likelihood Estimation or MLE for short.  Here, we are maximizing the probability of the data given our choice of lambdas.  So one tries to find the lambdas for which the probability is a maximum. 

Below is the meat of the derivation.  Nothing earthshattering below.


We want to use Maximum Likelihood Estimation to find the parameters

⃗θ = (λ1,λ2,...,λk,Z)
(1)

We want to maximize the probability of the data given the parameters

             ∑k
p(x ∣⃗θ) = 1∕Ze-  j=1 λjφj(xi)
(2)

        ∏n       ∑k
p(D∣⃗θ) =   1∕Ze-   j=1λjφj(xi)
        i=1
(3)

Here we rewrite the equation as a log-likelihood

  ⃗     ∏n     - ∑kj=1λjφj(xi)
L(θ) = ln  1∕Ze
        i=1
(4)

  ∑n       - ∑k  λ φ (x )
=    ln1∕Ze    j=1 j j i
   i=1
(5)

  ∑n    - ∑kj=1λjφj(xi)
=    lne            - lnZ
  i=1
(6)

    ∑n ∑k
= -    (  λjφj(xi)+ lnZ)
    i=1 j=1
(7)

We want to maximize these parameters subject to the constraint that the integral of p(x) is normalized to 1 over all space.

∫ ∞       ∑k
    1∕Ze-   j=1λjφj(x)dx = 1
 -∞
(8)

We then use a Lagrange multiplier to constrain the log-likelihood when maximizing.

              ∫ ∞     - ∑k  λ φ(x)
f(⃗θ) = L(⃗θ)+ α    1∕Ze    j=1 j j  dx- α
               -∞
(9)

              ∑n          ∫ ∞           - ∑k  λφ (x)
∂f(⃗θ)∕∂λm = -    φm(xi)- α     φm(x)1∕Ze    j=1 jj  dx
              i=1           - ∞
(10)

                    ∫ ∞        ∑k
∂f (⃗θ)∕∂Z = - N∕Z - α    1∕Z2e-   j=1λjφj(x)dx
                     - ∞
(11)

Assuming Z is just a parameter or normalization constant that doesn’t depend on x we can move it outside the integral.

                      ∫         ∑
∂f(⃗θ)∕∂Z = - N∕Z - α∕Z   ∞ 1∕Ze-   kj=1λjφj(x)dx
                       -∞
(12)

And using equation 8 to replace the right hand term and setting the derivative to 0 we get.

∂f(⃗θ)∕∂Z = - N ∕Z - α ∕Z = 0
(13)

N ∕Z = - α∕Z
(14)

α = - N
(15)

Replacing alpha in equation 10 with -N and setting the derivative to 0 we get

              ∑n           ∫ ∞            ∑k
∂f(⃗θ)∕∂λm = -    φm(xi)+ N     φm(x)1∕Ze-   j=1λjφj(x)dx = 0
              i=1            -∞
(16)

             ∫
n∑              ∞          - ∑kj=1λjφj(x)
   φm(xi) = N -∞ φm(x)1∕Ze            dx
i=1
(17)

   ∑n          ∫ ∞          - ∑k  λ φ (x)
1∕N    φm(xi) =     φm(x)1∕Ze    j=1 j j  dx
    i=1          -∞
(18)

and finally we get substituting equation 2 into equation 18

    ∑n         ∫ ∞
1∕N    φm(xi) =    φm(x)p(x∣⃗θ)dx
    i=1          -∞
(19)

AdaBoost
January 31, 2008
Two of the more recent algorithms in the field of machine learning include Support Vector Machines and AdaBoost. I think both are on the order of ten years old or such.

The idea behind AdaBoost is quite simple. One just tries to build a strong classifier from a set of potentially weak classifiers. The idea is to choose the weak classifiers in a way that when combined they perform much better.

One of the classic applications of this technique is Face Detection. Lets say you have several thousand different filters you can apply to an image. You would then just apply each filter to a set of training images some of which have the feature you are looking for and some of which don't. Each filter would output some real number for each of the images. You would then choose a threshold for each filter such that it classifies over 50% correctly. Then just choose the filter with the highest accuracy.

One just repeats this as long as one has more filters to choose or a high enough accuracy is achieved. However at each stage when we choose a new filter we re-weight all the data samples that were correctly classified to have a lesser weight and the incorrectly classified data points to have a stronger weight. So as you progress you find filters that are specific to difficult cases.

There are some mathematically provable facts about error bounds, and papers on how to choose the weights at each step, but this is the basic idea. Probably the hardest aspect of this problem is how do you choose your set of filters or weak classifiers that you want to use.

Here is the classic paper on Face Detection using this technique.

YABE Blogging Engine

Copyright © 2008 Jeff Bergman