Object recognition algorithm for mobile devices

  • Published on

  • View

  • Download


  • Image Processing & Communication, vol. 18,no. 1, pp.31-36DOI: 10.2478/v10248-012-0088-x 31



    Institute of Telecommunications, University of Technology & Life Sciences in Bydgoszcz, ul. Kaliskiego 7, 85-789Bydgoszcz, Poland


    Abstract. In this paper an object recognition al-gorithm for mobile devices is presented. The al-

    gorithm is based on a hierarchical approach for

    visual information coding proposed by Riesen-

    huber and Poggio [1] and later extended by

    Serre et al. [2]. The proposed method adapts

    an efficient algorithm to extract the informa-

    tion about local gradients. This allows the al-

    gorithm to approximate the behaviour of simple

    cells layer of Riesenhuber and Poggio model.

    Moreover, it is also demonstrated that the pro-

    posed algorithm can be successfully deployed

    on a low-cost Android smartphone.

    1 Introduction

    This paper is a continuation of authors recent research on

    object detection algorithms dedicate for mobile devices.

    The tentative results have been presented in [4]. In con-

    trast to previous work, modifications to overall method

    architecture have been proposed. Moreover, the exper-

    iments described in this paper have been conducted for

    extended database of images.

    The proposed approach follows the idea of HMAX

    model proposed by Riesenhuber and Poggio [1]. It adapts

    the hierarchical fed-forward processing approach. How-

    ever, modifications are introduced in order to reduce com-

    putational complexity of the original algorithm.

    Moreover, in contrast to Mutch and Lowe [10] model

    an additional layer that mimics the "retina codding" mech-

    anism is added. The results showed that this step increases

    the robustness of the proposed method. However, it is

    considered as an optional step of visual information pro-

    cessing, because it introduces additional computational

    burden that can increase the response time (the frame rate

    drops significantly) of at low-cost mobile devices.

    In contrast to original HMAX model, a different

    method for calculating the S1 layer response is proposed.

    The responses of 4 Gabor filters are approximated with

    algorithm that mimic the behaviour of HOG descriptors.

    The HOG [3] descriptors are commonly use for object

    recognition problems. The image is decomposed into

    small cells. In each cell an histogram of oriented gradi-

    ents is used. Typically for HOG features the result is nor-

    malised and the descriptor is returned for each cell. How-

    ever, in contrast to original HOG descriptor proposed by

    Dalal and Triggs, in this method (in order to simplify the

    computations) the block of cells are not overlapping, and

    Brought to you by | University of California - San FranciscoAuthenticated

    Download Date | 12/17/14 10:11 AM

  • 32 R. Kozik, A. Marchewka

    the input image is not normalised.

    Moreover, in previous work [7] the S2 and C2 layers

    were replaced with machine-learned classifier. Such ap-

    proach significantly simplifies the learning process and

    decreases the amount of computations. However, this im-

    pacts the invariance to shape changes of a recognised ob-

    ject and the feature vectors presented to classifier are rel-

    atively of grater dimensionality.

    2 The HMAX model overview

    The HMAX Visual Cortex model proposed by Riesenhu-

    ber and Poggio [1] exploits a hierarchical structure for the

    image processing and coding. It is arranged in several lay-

    ers that process information in a bottom-up manner. The

    lowest layer is fed with a grayscale image. The higher

    layers of the model are either called "S" or "C". These

    names correspond to simple the (S) and complex (C) cells

    discovered by Hubel and Wiesel [9]. Both type of cells

    are located in the striate cortex (called V1), which is the

    part of visual cortex that lies in the most posterior area of

    the occipital lobe.

    The simple cells located in "S" layers apply local filters

    that responses form a vector of texture features. As noted

    by Hubel and Wiesel the individual cell in the cortex re-

    spond to the presence of edges. They also discovered that

    these cells are sensitive to edge orientation (some of cells

    fire only when a given orientation of an edge is observed).

    The complex cells located in "C" layers calculate in a

    limited range of a previous layer the strongest responses

    of a given type (orientation). That way more complex

    combination of simple features are obtained combined

    from three simple features.

    The hierarchical HMAX model proposed by Mutch and

    Lowe [10] is a modification of the model presented by

    Serre et al. in [2]. It introduces two layers of simple cells

    (S1 and S2) and two layers of complex cells (see Fig. 1). It

    uses set of filters designed to emulate V1 simple cells. The

    Fig. 1: The structure of a hierarchical model proposed byMutch and Lowe [10]. (used symbols: I - image, S -simple cells, C - complex cells, GF - Gabor Filters, F -prototype features vectors, - convolution operation)

    simple layers are computed using a hard max filter [11].

    It means that the "C" cells responses are the maximum

    values of the associated "S" cells. As it is shown in the

    Fig. 1 the images are processed by the subsequent simple

    and complex cells layers and reduced to feature vectors,

    which are further used in the classification process. The

    set of features (F ) is shared across all images and object

    categories. Features are computed hierarchically in sub-

    sequent layers built from the previous one by alternating

    the template matching and the max pooling operations.

    The S1 layer in the Mutch and Lowe [10] model adapts

    the 2D Gabor filters computed for four orientations (hor-

    izontal, vertical, and two diagonals) at each possible po-

    sition and scale. The Gabor filters are 1111 in size, andare described by:

    G(x, y) = e(X2+Y 2)/(22) cos(


    ) (1)

    whereX = x cosy sin and Y = x sin+y cos;x, y < 5; 5 >, and < 0; >. The aspect ra-tion (), effective width (), and wavelength () are set

    to 0.3, 4.5 and 5.6 respectively. The response of a patch

    of pixels X to a particular filter G is computed using the

    formula (2).

    R(X,G) = abs(


    ) (2)

    The complex cells located in C1 layer pool associated

    units in the S1 layer. For each orientation, the S1 re-

    Brought to you by | University of California - San FranciscoAuthenticated

    Download Date | 12/17/14 10:11 AM

  • Image Processing & Communication, vol. 18, no. 1, pp. 31-36 33

    sponses are convolved with a max filter, that is 1010 ofsize in x, y dimension (position) and has 2 units of deep

    in scale.

    As it is shown in the Fig. 1, the intermediate S2 layer

    is formed by convolving the C1 layer response with a set

    of intermediate-level features (depicted as F in Fig. 1).

    The set of intermediate-level features is established dur-

    ing the learning phase. For a given set of learning im-

    ages C1 responses are computed. The most meaningful

    features are selected using SVM weighting. Mutch and

    Lowe [10] suggested to sub-sample the C1 responses be-

    fore feature selection. Therefore, authors select at ran-

    dom positions and scales of the C1 layer patches of size

    44, 88, 1212, and 1616. Selected and weightedfeatures compose so called prototypes that are used in the

    Mutch and Lowe model as filters which responses create

    the S2 layer. The C2 layer composes a global feature vec-

    tor which particular element corresponds to the maximum

    response to a given prototype patch. In order to identify

    the visual object on the basis of feature vector a classifier

    is learnt (e.g. SVM).

    3 The proposed method

    In order to approximate the S1 layer behaviour of the

    HMAX model the algorithms that mimics the HOG de-

    scriptors is used. However, to make this step as simple

    as possible (due to the low computation power of mo-

    bile devices) only gradient orientation binning schema is

    adapted. In fact, the intention here is no to compute HOG

    descriptor, but to approximate the response of four Gabor

    filters used in original HMAX model.

    In the S1 layer there are N M 4 simple cells ar-ranged in a grid of sizeNM blocks. In each block thereare 4 cells. Each cell is assigned a receptive field (pixels

    inside the block). Each cell activates depend on a stimuli.

    In this case there are four possible stimulus, namely verti-

    cal, horizontal, left diagonal, and right diagonal edges. As

    a result the S1 simple cells layer output has dimensional-

    ity of a size 4 (x, y, scale and 4 cells). In order to compute

    responses of all four cells inside a given block (receptive

    field), an Algorithm 1 is applied.

    Algorithm 1 Algorithm for calculating S1 responseRequire: Grayscaled image I of W H sizeEnsure: S1 layer of size N M 4

    AssignGmin the low-threshold for gradient magnitude.for each pixel (x, y) in image I do

    Compute horizontalGx and verticalGy gradients us-ing Prewitt operatorCompute gradient magnitude |Gx,y| in point (x, y)n x NW ; m y


    if |G| < Gmin thengo to next pixel

    elseactive get_cell_type(Gx, Gy)S1[n,m, active] S1[n,m, active] + |Gx,y|

    end ifend for

    The algorithm computes the responses of all cells us-

    ing only one iteration over the whole input image I . For

    each pixel at position (x, y) a vertical and horizontal gra-

    dients are computed (Gx and Gy). Given the pixel posi-

    tion (x, y) and gradient vector [Gx, Gy] the algorithm in-

    dicates the block position (n,m) and type of cell (active)

    that response has to be incremented by |G|.In order to classify given gradient vector [Gx, Gy]

    as horizontal, diagonal or vertical the get_cell_type(, )function uses simple schema for voting. If a point

    (|Gx|, |Gy|) is located between line y = 0.3 x andy = 3.3 x it is classified as a diagonal. If Gy is posi-tive then the vector is classified as a right diagonal (other-

    wise it is a left diagonal). In case the point (|Gx|, |Gy|) islocated above line y = 3.3 x the gradient vector is clas-sified as vertical and as horizontal when it lies below line

    y = 0.3 x.The processing in S1 layer is applied for three scales

    (an original image, and two images scaled by a factor of

    0.7 and 1.5). This is a necessary operation to achieve the

    response of C1 layer. The C1 complex layer response is

    Brought to you by | University of California - San FranciscoAuthenticated

    Download Date | 12/17/14 10:11 AM

  • 34 R. Kozik, A. Marchewka

    computed using max-pooling filter, that is applied to S1layer. This layer allows the algorithm to achieve scale


    The output of theC1 layer cells is used in order to select

    features that are meaningful for a given task of an object

    detection. Firstly, randomly selected images of an object

    are reduced to a features vectors by a consecutive layers

    of the proposed model. Afterwards, the PCA is applied

    to calculate the eigenvectors. Each eigenvector is used as

    a single feature in a features vector. The response of S2layer is obtained using eigenvectors to compute the dot

    product with the C1 layer.

    Finally, an classifier is learned to recognise an object

    using S2 response layer. For this paper different machine-

    learnt algorithms are investigated.

    4 Experiments and Results

    In contrast to the previous work [4], the data base of

    images has been extended and new experiments have

    been conducted, which aimed at evaluating the effective-

    ness of different classifiers. In this work tree-based and

    rule-based classification approaches have been compared,

    namely N (5,10,15) Random Trees, J48 and PART (pre-

    viously used by author in [5, 6, 4]) algorithms. The com-

    parison has been presented in Fig.2.

    For the experiments the dataset was randomly split for

    training and testing samples (in proportion 70% to 30%

    respectively). For training samples the C1 layers re-

    sponses were calculated and eigenvectors were extracted.

    The eigenvectors were used to calculate the S2 layers re-

    sponses that were further used to learn classifier. During

    the testing phase the eigenvectors were again used in or-

    der to extract S2 layers responses for testing samples. The

    whole procedure was repeated 10 times ( the dataset was

    randomly reshuffled before split) and the results were av-


    The best results were obtained for 10 Random Trees. It

    was possible to achieve 95,6% of detection rate while hav-

    ing 2.3% of false positives. It can be noticed that adding

    additional trees does not improve the effectiveness.

    The early prototype of proposed algorithm was de-

    ployed on an Android device. The code was written in

    pure Java and tested on Samsung Galaxy Ace device. This

    device is equipped with 800 MHz CPU, 278 MB of RAM

    and Android 2.3 operating system. Current version imple-

    ments brute force Nearest Neighbour classifier, operates

    only for one scale (PC scans three scale - an original im-

    age, and two images scaled by a factor of 0.7 and 1.5),

    and achieves about 10 FPS when less than 10 training

    samples are provided. Some examples of object detec-

    tion are shown in Fig. 3. During the testing it was noticed

    that the algorithm is able to recognise an object on a clut-

    tered background even if only few learning samples are


    Fig. 3: Example of object detection (mug and doors) withproposed algorithm deployed on an Android device

    5 Conclusions

    In this paper an algorithm for object recognition dedi-

    cated form mobile devices was presented. The algorithm

    is based on a hierarchical approach for visual information

    coding proposed by Riesenhuber and Poggio [1] and later

    Brought to you by | University of California - San FranciscoAuthenticated

    Download Date | 12/17/14 10:11 AM

  • Image Processing & Communication, vol. 18, no. 1, pp. 31-36 35










    0,78 0,8 0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96 0,98


    e P





    Detection Rate

    10 random trees

    5 random trees

    15 random trees



    Expon. (10 random trees)

    Expon. (5 random trees)

    Expon. (15 random trees)

    Expon. (J48)

    Expon. (PART)

    Fig. 2: Different classifiers effectiveness

    extended by Serre et al. [2]. The experiments show that

    the introduced algorithm allows for efficient feature ex-

    traction and a visual information coding. Moreover, it was

    shown that it is also possible to deploy proposed approach

    on a low-cost mobile device. The results are promising

    and show that described in this paper algorithm can be

    further investigated as one of assistive solutions dedicated

    for visually impaired people.


    [1] M. Riesenhuber, T. Poggio, Hierarchical models of

    object recognition in cortex, Nat Neurosci, Vol. 2,

    pp.10191025, 1999

    [2] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U.

    Knoblich T. Poggio, A quantitative theory of imme-

    diate visual recognition, In:Progress in Brain Re-

    search, Computational Neuroscience: Theoretical

    Insights into Brain Function, Vol. 165, pp. 3356,


    [3] N. Dalal, B. Triggs, Histograms of oriented gradi-

    ents for human detection, Computer Vision and Pat-

    tern Recognition, CVPR 2005, IEEE Computer So-

    ciety Conference on, Vol.1, pp. 886893, June 2005

    [4] R. Kozik, A Simplified Visual Cortex Model for

    Efficient Image Codding and Object Recognition,

    Image Processing and Communications Challenges

    5, Advances in Intelligent Systems and Computing,

    pp. 271278, 2013

    [5] R. Kozik, Rapid Threat Detection for Stereovision

    Mobility Aid System , In: T. Czachorski et al.

    (Eds.): Man-Machine Interactions 2, AISC 103, pp.

    115123, 2011

    [6] R. Kozik, Stereovision system for visually im-

    paired. Burduk, Robert (ed.) et al., Computer recog-

    nition systems 4, Advances in Intelligent and Soft

    Computing 95, pp. 459468, 2011

    [7] R. Kozik, A Proposal of Biologically Inspired Hi-

    Brought to you by | University of California - San FranciscoAuthenticated

    Download Date | 12/17/14 10:11 AM

  • 36 R. Kozik, A. Marchewka

    erarchical Approach to Object Recognition, Jour-

    nal of Medical Informatics & Technologies, Vol.

    22/2013, ISSN 1642-6037, pp. 171176, 2013

    [8] R. Kozik, A Simplified Visual Cortex Model for Ef-

    ficient Image Codding and Object Recognition, Im-

    age Processing and Communications Challenges 5,

    Advances in Intelligent Systems and Computing S.

    Choras, Ryszard, pp. 271-278, 2013

    [9] D.H. Hubel, T.N. Wiesel Receptive fields, binoc-

    ular interaction and functional architecture in the

    cats visual cortex, J Physiol, Vol. 160, pp. 106

    154, 1962

    [10] J. Mutch, D.G. Lowe, Object class recognition and

    localization using sparse features with limited re-

    ceptive fields,International Journal of Computer Vi-

    sion (IJCV), Vol. 80, No. 1, pp. 4557, October


    [11] MAX pooling, http://ufldl.stanford.edu/wiki/index.


    Brought to you by | University of California - San FranciscoAuthenticated

    Download Date | 12/17/14 10:11 AM