Head tracking with WebRTC

15. June 2012

A lot of new exciting standards are coming to browsers these days, among them the WebRTC standard, which adds support for streaming video and audio from native devices such as a webcamera. One of the exciting things that this enables, is so called head tracking. We decided to do a little demonstration of this for the Opera 12 release, which is the first desktop browser to support video-streaming via the getUserMedia API.

If you haven’t tried our fancy game out already, do so here:

The demo in the topmost video can be found here, though note that this needs WebGL support as well. Both demos work best if your camera is mounted over your screen (like internal webcameras on most laptops) and when your face is evenly lighted. And of course you have to have a browser that supports getUserMedia and a computer with a webcamera.

The javascript library which I made for the task, headtrackr.js, is now available freely here. It’s not currently well documented, but I’ll try to do so in the coming weeks. In this post I’ll give you a very rough overview of how it’s put together.

My implementation of head tracking consists of four main parts:

a face detector
a tracking mechanism
a smoother
the headposition calculation

diagram

For the face detection, we use an existing javascript library called ccv. This library uses a Viola-Jones type algorithm (with some modifications) for detecting the face, which is a very fast and reasonably precise face detection algorithm. We could have used this to detect the face in every videoframe, however, this would probably not have run in real-time. It also would not have been able to detect the face in all positions, for instance if the head was tilted, or turned slightly away from the camera.

Instead we use a more lightweight object tracking algorithm called camshift, which we initialize with the position of the face we detected. The camshift algorithm is an algorithm that tracks any object in an image (or video) just based on its color histogram and the color histogram of the surrounding elements, see this article for details. Our javascript implementation was ported from an actionscript library called FaceIt, with some modifications. You can test the camshift-algorithm alone here.

Though the camshift algorithm is pretty fast, it’s also a bit unprecise and will jump a bit around, which can cause annoying jittering of the face tracking. Therefore we apply a smoother for each position we receive. In our case we use double exponential smoothing, as it’s pretty easy to calculate.

We now know the approximate position and size of the face in the image. In order to calculate the position of the head, we need to know one more thing. Webcameras have widely differing angles of “field of view”, which will affect the size and position of the face in the video. For an example, see the image below (courtesy of D Flam). To get around this, we estimate the “field of view” of the current camera, by assuming that the user at first initialization is sitting around 60 cms away from the camera (which is a comfortable distance from the screen, at least for laptop displays), and then seeing how large portion of the image the face fills. This estimated “field of view” is then used for the rest of the head tracking session.

Using this “field of view”-estimate, and some assumptions about the average size of a person’s face, we can calculate the distance of the head from the camera by way of some trigonometry. I won’t go into the details, but here’s a figure. Hope you remember your maths!

trigonometry diagram

Calculating the x- and y-position relative to the camera is a similar exercise. At this point we have the position of the head in relation to the camera. In the facekat demo above, we just used these positions as the input to a mouseEvent-type controller.

If we want to go further to create the head-coupled perspective seen in the first video, we’ll have to use the headpositions to directly control the camera in a 3D model. To get the completely correct perspective we also have to use an off-axis view (aka asymmetric frustum). This is because we want to counteract the distortion that arises when the user is looking at the screen from an angle, perhaps best explained by the figure below.

off-axis view diagram

In our case we used the excellent 3D library three.js. In three.js it’s pretty straightforward to create the off-axis view if we abuse the interface called camera.setViewOffset.

Overall, the finished result works decently, at least if you have a good camera and even lighting. Note that the effect looks much more convincing on video, as we then have no visual cue for the depth of the other objects in the scene, while in real life our eyes are not so easily fooled.

One of the problems I stumbled upon while working with this demo, was that the quality of webcameras vary widely. Regular webcameras often have a lot of chromatic aberration on the edges of the field of view due to cheap lenses, which dramatically affects the tracking effectiveness outside of the immediate center of the video. In my experience the built-in cameras on Apple Macbooks had very little such distortion. You get what you pay for, I guess.

Most webcameras also adjust brightness and whitebalance automatically, which in our case is not very helpful, as it messes up the camshift tracking. Often the first thing that happens when video starts streaming is that the camera starts to adjust whitebalance, which means that we have to check that the colors are stable before doing any sort of face detection. If the camera adjusts the brightness a lot after we’ve started tracking the face, there’s not much we can do except reinitiate the face detection.

To give credit where credit is due, the inspiration for this demo was this video that was buzzing around the web a couple of years ago. In it, Johnny Chung Lee had hacked a Wii remote to capture the motions of the user. Later on, some french researchers decided to try out the same thing without the Wii remote. Instead of motion sensors they used the front-facing camera of the Ipad to detect and track the rough position of the head, with pretty convincing results. The result is available as the Ipad app i3D and can be seen here:

Although head-coupled perspective might not be ready for any type of generic interaction via the web camera yet, it works fine with simple games like facekat. I’m sure there are many improvements that can make it more precise and failproof, though. The library and demos were patched together pretty fast, and there are several improvements that I didn’t get time to test out, such as:

tweaking the settings of the camshift algorithm
using other tracking algorithms, such as bayesian mean shift, which also uses information about the background immediately surrounding the face
maybe using edge detection to further demarcate the edges of the face, though this might be a bit heavy on processing
using requestAnimationFrame instead of setIntervals
using hue and saturation for the camshift algorithm (which the original camshift paper suggests) instead of RGB

If you feel like implementing any of these, feel free to grab a fork! Meanwhile, I’m pretty sure we’ll see many more exciting things turn up once WebRTC becomes supported across more browsers, check out this for instance…

Update: a slightly edited version of this post, which also includes some more details about the trigonometry calculations, was published at dev.opera.com