In the preprocessing step, characteristical biometric features are extracted from samples that are recorded by a set of input sensors like for example a camera and a microphone.
Depending on the configuration, currently the recorded sample can consist of:
Up to two optical biometric features can be extracted from the video sequence:
In order to extract those features, it is necessary to know the exact position of the face. Since BioID should be usable in any arbitrary environment with off-the-shelf video equipment, the face finding process is one of the most important steps in the feature extraction process.
BioID uses a two-stage model-based algorithm to detect the location of a human face in an arbitrary image: a binary face model is being matched in a binarized version of the current scene. The comparison is performed with the modified Hausdorff distance, which determines the optimal location, scaling and rotation of the model.
The estimated face position is refined by matching a more detailed eye region model, again using the Hausdorff distance for comparison. The exact eye positions are determined by checking the output of an artificial neural network (ANN) specialized on detecting eye centers.
The eye positions allow for all further processing: using anthropomorphic knowledge, a normalized portion of the face and of the mouth region can be extracted.
![]() |
| Face localization with a binary face model |
One of the first three pictures of the video sequence, the one with the best match of the face detection process, will be used for face recognition.
The face is transformed to a uniform size. This procedure ensures that the appropriate biometric features of the face are analyzed, and not, for example, the size of the head, hair style, a tie, or piece of jewelry.
Some further preprocessing steps reduce the impact of lighting conditions and color variance. Afterwards feature extraction methods are applied to the normalized image, resulting in a face feature vector, which is then used by the classifier.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Samples of extracted faces | |||||
Lip movement recognition is based on the idea that the motion of the mouth while speaking is unique for every person. The following figure shows a video sequence displaying the motion of the lip region of a speaker while speaking a pass phrase:
![]() |
| Motion of the lip region |
The BioID mimic features represent the appearance of the lip area during the record period. An area of interest (AOI) around the lips is extracted from the first 16 frames of the recorded video sequence. The position of the AOI is determined using anthropomorphic knowledge, based on the estimated eye positions. Tracking the whole face region with a motion tracker compensates unwanted motions caused by changes in head position.
After the extracted images have been preprocessed in order to discard illumination influence, each single lip image is orthogonal transformed. BioID uses those compact representations of the lip area as a classification feature for the mimic channel.
The computation of the features used in the voice trait can be separated into a preprocessing stage and a second step, the actual feature extraction. The preprocessing consists of a speech enhancement filter technique which is performed in the frequency domain. The time signal is cut into pieces of equal length (called "frames") which are transformed with a short time frequency analysis (Fast Fourier Transform, FFT). In the frequency domain, stationary noise and background components of the signal are estimated based on a comparatively small part of the signal. This noise estimate is spectrally subtracted, the resulting "cleaned" signal is transformed back to the time domain. The picture shows the waveform of an audio signal with heavy background noise and the waveform of the signal after it has been cleaned with speech enhancement.
![]() |
| Noisy audio signal in time domain before and after speech enhancement |
In the feature extraction stage, the enhanced audio signal is sliced into frames again using a window approximately twice as large as the window for speech enhancement, increasing spectral resolution of the FFT-transformed frames. A reduction of the dimension of spectral vectors is achieved by mel-spaced band-pass filtering, mimicing perception with the human ear. The result of this filter operation is transformed again applying a Discrete Cosine Transform (DCT), yielding the actual feature vectors in cepstral space. The elements of this feature vector are called Mel Frequency Cepstral Coefficients (MFCC). Below, a visualization of the features that result from one second of audio data can be seen. The x-axis is the time axis, whereas the y-axis corresponds to the MFC coefficient. This means that each pixel column represents one MFCC vector which is the result of one frame of the signal.
![]() |
| Visualization of the MFCC feature vector sequence of a pass phrase utterance |
As a second normalization step, complementing the speech enhancement at the beginning, therefore further reducing channel effects and increasing robustness, Cepstral Mean Subtraction (CMS) is applied. This technique operates directly in feature space, estimating the mean feature vector over a longer time period than speech enhancement, and therefore being able to reduce longer-time stationary background noises of the signal more efficiently than speech enhancement.
The MFCC coefficients contain information about the speaker of the utterance as well as about the phonemes that have been uttered. As the voice trait of BioID is based on text dependent speaker recognition, both of these informations are valuable sources for the classification decision which is based entirely on the information present in the feature vectors. Nevertheless, the speaker should have more influence on the classification decision than the the text of the spoken utterance in order to make an attack of the system by an impostor who has knowledge of a valid user's pass phrase unlikely. Therefore, parameterization of MFCC computation has been adapted to the specific task in BioID, increasing the feature dimension compared to MFCC features as they are in common use for speech recognition applications.