Xueqing Sun

June 12^{th}, 2011

*You can distribute, remix, tweak, and build upon this work, even commercially, as long as you credit Xueqing Sun for the original creation.*

I was reading a book, ‘Image Processing, Analysis, and Machine Vision, Milan Sonka, Vaclav Hlavac, Roger Boyle’ for studying mean-shift object tracking algorithm. But the book was really hard to understand, especially for a beginner of machine vision. Yesterday, with the help of Guangwei, a friend of mine (he is a scientist at National Astronomical Observatories, Chinese Academy of Science), I finally understood the algorithm. And I would like to explain what I understood for sake of helping other guys who are still struggling on this algorithm.

**Heads up Questions**

1. What is the probability density function of an image?

2. How does the density function contribute to target model?

3. What’s kernel function? How does it work?

Here is the summary of the algorithm

To locate a target on an image, we must have a target model, an image which contains the target. The goal of mean-shift algorithm is to tell you the position of your target on the image. Suppose we have a target model image with 5*5 pixels. The pixel values listed below:

x | y | pixel |

0 | 0 | 1 |

0 | 1 | 1 |

0 | 2 | 6 |

0 | 3 | 5 |

0 | 4 | 7 |

1 | 0 | 0 |

1 | 1 | 4 |

1 | 2 | 7 |

1 | 3 | 0 |

1 | 4 | 4 |

2 | 0 | 3 |

2 | 1 | 5 |

2 | 2 | 7 |

2 | 3 | 1 |

2 | 4 | 5 |

3 | 0 | 7 |

3 | 1 | 0 |

3 | 2 | 9 |

3 | 3 | 9 |

3 | 4 | 2 |

4 | 0 | 0 |

4 | 1 | 1 |

4 | 2 | 4 |

4 | 3 | 3 |

4 | 4 | 3 |

Target model image

The row x and y are the location of a pixel on the image. And the third row ‘pixel’ is the value. For convenient purpose, we limit pixel value from 0 to 9. Now we can create a frequency table like below. Normally images are colored, but we simplified the problem from colored images to gray scale images. A colored image can be considered as a composition of three gray scale images. So the rationale explained here can be easily applied to colored images.

bin | q |

1 | 8 |

2 | 1 |

3 | 3 |

4 | 3 |

5 | 3 |

6 | 1 |

7 | 4 |

8 | 0 |

9 | 2 |

Frequency

The frequency means how many pixels located in a specific bin. For example, the first row means there are 8 pixels which values are less or equal to ‘1’. Furthermore we can have histogram chart for the table.

Histogram

On the book, and many other materials, the density function of target model, defined as below:

1 |

The density function is closely related with the frequency table. Here the black ** **is a vector which has m components. In this case, the m is 9 because we have 9 bins. Each component is made from corresponding bin. Let look into the component definition.

2 |

The definition looks somewhat complex. Let’s split it into parts so we can understand it easier. First, let’s see the delta part

3 |

n is the pixel number. In this case we have 25 pixels. So the n is 25. is the coordinate of i^{th} pixel in the image. For example, the first pixel coordinate is (0,0), and the second is (0,1). u means we are currently defining u^{th} component corresponding to u^{th} bin. Function tells which bin the pixel go. For example, the first pixel (0,0) is 1, so it goes to bin 1, i.e. .

Actually, the formula 3 is a vector, let’s rewrite it as following:

4 |

The delta function gives you 1 if the pixel goes to bin u, otherwise it gives 0. So this vector works just like a mask. On the mask there are ‘1’s mark the pixels go to bin u. For example, if our image is 4*4, we might get a **D** vector like this:

1 | 0 | 0 | 1 |

0 | 1 | 1 | 0 |

0 | 0 | 0 | 1 |

1 | 0 | 0 | 0 |

The image is a matrix. **D** is a vector. Just put each row of the matrix to one row in sequence then we’ll get **D**.

**IMPORTANT**

What’s the relationship betweenThe answer is for each u, . Actually we can form feature space only based onDand the frequency table?D, because D already represents the image in statistics way. So mean-shift algorithm is so called an algorithm that based on probability density function. But marginal pixels are unstable, so the kernel function which is a weight function does the work. It give marginal pixels less weight to make the objective function in optimization process more stable.

OK, let’s see another part of formula 2. It’s

5 |

Again it’s a vector. Function k is the kernel function which actually is a weight function giving each pixel position a weight. So can be defined as below:

6 |

It looks beautiful. I really hate which makes formula unreadable.

OK, now we get the density function of target model image. The candidate density function is pretty similar.

7 | |

8 |

Formula 8 is easier to understand because we’ve already understood formula 2. The k function in formula 8 is just a transformed version of original k.

I think I’ve solved the hardest part of the mean-shift algorithm. And the rest of it would be much easier to understand because it is just kind of an optimism algorithm to find the **y**.