Hierarchical camera auto-calibration for traffic surveillance systems Hierarchical camera auto-calibration for traffic surveillance systems S. Álvarez !, D.F. Llorca, M.A. Sotelo Computer Engineering Department, Polytechnic School, University of Alcalá, Madrid 28871, Spain a r t i c l e i n f o Keywords: Auto-calibration Pan-tilt-zoom cameras Vanishing points Intelligent transportation systems Urban traffic infrastructures a b s t r a c t In this paper, a hierarchical monocular camera auto-calibration method is presented for applications in the framework of intelligent transportation systems (ITS). It is based on vanishing point extraction from common static elements present on the scene, and moving objects as pedestrians and vehicles. This pro- cess is very useful to recover metrics from images or applying information of 3D models to estimate 2D pose of targets, making a posterior object detection and tracking more robust to noise and occlusions. Moreover, the algorithm is independent of the position of the camera, and it is able to work with variable pan-tilt-zoom (PTZ) cameras in fully self-adaptive mode. The objective is to obtain the camera parame- ters without any restriction in terms of constraints or the need of prior knowledge, to deal with most traf- fic scenarios and possible configurations. The results achieved up to date in real traffic conditions are presented and discussed. ! 2013 Elsevier Ltd. All rights reserved. 1. Introduction Camera calibration is a fundamental stage in computer vision. The process is the determination of the relationship between a ref- erence plane and the camera coordinate system (extrinsics), and between the camera and the image coordinate system (intrinsics). These parameters are very useful to recover metrics from images or applying prior information of 3D models to estimate 2D pose of targets, giving an idea of the size of the objects and making their detection and tracking more robust to noise and occlusions. In a previous paper Álvarez et al. (2012), the authors presented a target detection system for transport infrastructures based on manual camera calibration through vanishing points. After that, the approach was improved, as described in Álvarez et al. (2013), with a preliminary automatic calibration method based on camera zooming and zebra-crossings. The current paper extends these works with a hierarchical camera auto-calibration system, which deals with most traffic scenarios and configurations with no restrictions. The work begins with the paper presented in Álvarez et al. (2011). The standard method to calibrate a camera is based on a set of correspondences between 3D points and their projections on im- age plane as presented by Hartley and Zisserman (2000) and Tsai (1986). However, this method requires either prior information of the scene or calibrated templates, limiting the feasibility of surveillance algorithms in most possible scenarios. In addition, cal- ibrated templates are not always available, they are not applicable for already-recorded videos and if the camera is placed very high their small projection can derive in poor accurate results. Finally, in case of having PTZ cameras, using a template each time the cam- era angles or zoom changes is not feasible. One novel method which solves the problem of the template is the orthogonal calibra- tion proposed by Kim (2009). The system extracts the world coor- dinates from aerial pictures (on-line satellite images) or GPS devices to make the correspondences with the image captured. However this system is dependent on prior information from an external source and it does not work indoor. Therefore auto-cali- bration seems to be the more suitable way to recover camera parameters for surveillance applications. One of the distinguished features of the perspective projection is that the image of an object that stretches off to infinity can have finite extent. For example, parallel world lines are imaged as con- verging lines, which image intersection point is called vanishing point. Caprile and Torre (1990), developed a new method for cam- era calibration using simple properties of vanishing points. In their work, the intrinsic parameters of the camera were recovered from a single image of a cube. In a second step, the extrinsic parameters of a pair of cameras were estimated from an image stereo pair of a suitable planar pattern. The technique was improved by Cipolla et al. (1999), who computed both intrinsic and extrinsic parameters from three vanishing points and two reference points from two views of an architectural scene. However these assumptions were incomplete, because as demonstrated by Hartley, Zisserman and Liebowitz in different publications, and summarized in Hartley and Zisserman (2000), it is possible to obtain all the parameters needed to calibrate a camera from three orthogonal vanishing points. From the mentioned works, a lot of research has been done to calibrate cameras in architectural environments (Rother, 2002; Tardif, 2009, etc.. . .). All these methods are based on scenarios 0957-4174/$ - see front matter ! 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.08.050 ! Corresponding author. Tel.: +34 91 885 6702. E-mail address: sergio.alvarez@aut.uah.es (S. Álvarez). Expert Systems with Applications 41 (2014) 1532–1542 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2013.08.050&domain=pdf http://dx.doi.org/10.1016/j.eswa.2013.08.050 mailto:sergio.alvarez@aut.uah.es http://dx.doi.org/10.1016/j.eswa.2013.08.050 http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa where the large number of orthogonal lines provide an easy way to obtain the three orthogonal vanishing points. Nevertheless, in absence of so strong structures, as usual in the case of traffic scenes, the vanishing point-based calibration is not applicable. In this context, a different possibility is to make use of object motion. The complete camera calibration work using this idea was introduced by Lv et al. (2006). The method uses a tracking algorithm to obtain multiple observations of a person moving around the scene. Three orthogonal vanishing points are then com- puted by extracting head and feet positions in their leg-crossing phases. The approach requires accurate localization of these posi- tions, which is a challenge in traffic surveillance videos. Further- more, the localization step uses FFT based synchronization of a person’s walk cycle, which requires constant velocity motion along a straight line. Finally it does not handle noise models in the data and assumes constant human height and planar human motion, so the approach is really limited. Based on this knowledge, in Junejo (2009) it is proposed a quite similar calibration approach for pedestrians walking on uneven terrains. There are no restrictions as with Lv’s work, but the intrinsic parameters are estimated by obtaining the infinite homography from all the extracted points in multiple cameras. To manage these inconveniences, the solution lies in computing the three vanishing points by studying three orthogonal compo- nents with parallel lines in the moving objects or their motion pat- terns. Zhang et al. (2013) presented a self-calibration method using the orientation of pedestrians and vehicles. The method seems to extract a vertical vanishing point from the main axis direction of the pedestrian trunk, perpendicular to the ground plane. Addition- ally, two horizontal vanishing points are extracted by analysing the histogram of oriented gradients of moving cars. The idea is inter- esting and it was initially implemented for this work. However, the straight vehicles used by Zhang differ from the modern ones, usually with more irregular and rounded shapes. Finally, the pe- destrian detection step is not described and results are not de- picted in the paper. Hodlmoser et al. (2010) present a different approach. They use zebra-crossings with known metrics to obtain the ground plane information, and pedestrians to obtain the verti- cal lines. The problem is the maximum distance that the camera can be from the ground and the necessity of knowing real distances from the scene. In this paper, a self-calibration procedure based on vanishing points is presented. It is done through a hierarchical process which covers most of traffic scenarios and possible configurations. The objective is to obtain both intrinsic and extrinsic camera parame- ters without restrictions in terms of constraints (restrictions men- tioned in previous paragraphs, vehicles driven in only one road direction (Hue et al., 2008), deprecated camera roll (Schoepflin and Dailey, 2003), etc.) or the need of prior information, except for the camera height. After the present introduction, the remainder of the document is organized as follows. Section 2 describes the developed camera auto-calibration method, based on vanishing points, and the hier- archical system proposed. In Section 3, an application of this tech- nique in the context of traffic surveillance is depicted with the developed segmentation and tracking algorithms. The results ob- tained are presented and discussed in Section 4 and finally Section 5 contains the conclusions and future work. 2. Camera autocalibration The camera model used and the equations to obtain the calibra- tion parameters from orthogonal vanishing points are described in the previous paper, Álvarez et al. (2013). In summary, the conclu- sion is that it is possible to calibrate a camera if the principal point and two orthogonal vanishing points are known; or by computing the principal point as the orthocentre of the triangle formed by three orthogonal vanishing points as vertices. The current work is focused on the way to extract these points from common ele- ments of traffic scenarios. Fig. 1 summarizes the proposed camera calibration process. 2.1. Hierarchical auto-calibration This section presents the proposed method to extract the van- ishing points from the image through a hierarchical process. Depending on which elements appear in the scene and the chance of using camera zoom, 5 levels have been established to determine the hierarchy of each developed method and the priority of the solution adopted. Before presenting the hierarchical tree of Fig. 2, and to make its comprehension easier, the different options devel- oped to obtain the vanishing points and optical center are de- scribed in the following paragraphs: ! Zoom: when zooming, if several features of the image are matched between frames they converge in a common point which corresponds to the optical center. ! Crosswalk (cross): the alternate white and gray stripes painted on the road surface provide a perfect environment to obtain two perpendicular sets of parallel lines. It means that two van- ishing points of the ground plane can be extracted. ! Pedestrians (ped): humans are roughly vertical while they stand or walk. This characteristic makes them very useful to extract perpendicular lines to the ground. ! Vehicle motion (vmot/vperp): if one vanishing point from the ground plane is needed, it can be obtained from vehicles mov- ing along the main motion direction (vmot). In case of a perpen- dicular intersection (in 3D coordinates), vehicles along the two main directions will provide perpendicular sets of parallel lines corresponding to the two ground plane vanishing points (vperp). ! Structured scene (struct): in case of scenes with a considerable number of architectural elements, the orthogonal vanishing point extraction can be done by brute force gradient analysis. ! Optical center assumption (OC): when it is not possible to obtain one of the three vanishing points, the optical cener can be assumed as the center of the image, although a small error is committed. The different possible cases are also summarized in Fig. 2 and Table 1. 2.2. Principal point through camera zoom When zooming, if several features of the image are matched be- tween frames, the lines which join the previous and new feature positions converge in a common point which corresponds with the optical center. This effect is demonstrated in Álvarez et al. (2013) and represented in Fig. 3: an image was taken before and Fig. 1. Camera auto-calibration process. S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 1533 after zooming and the matched features converge to the same point, the optical center. 2.3. Zebra crossing vanishing point extraction The alternate stripes painted on the road surface provide a perfect environment to obtain two perpendicular sets of parallel lines. It means that two vanishing points from the ground plane can be computed. The crosswalk detection method is also ex- plained in Álvarez et al. (2013)), and illustrated in the Fig. 4. Firstly, the background model image is binarized, and the lines are extracted by gradient analysis and grouped by angle. After that, a RANSAC-based filter is applied to get the final candidates. The red line is the one which best fits the candidate. Bipolarity and transition analysis is then done in order to obtain a confi- dence factor. Finally, the vanishing points are computed. 2.4. Pedestrian vanishing point extraction Humans are roughly vertical while they stand or walk. This property makes them very useful to get perpendicular lines to the ground, to compute the vertical vanishing point. One option is to extract the vertical component of each pedestrian to form the necessary set of parallel lines, as done by Hodlmoser et al. (2010). However, the cameras in common traffic scenarios are usu- ally located quite higher than the situations proposed by the authors in the paper, and small pedestrians can derive into errone- ous lines extractions. Traffic scenes provide a lot of structured ele- ments with vertical components (walls, lampposts, traffic lights, etc.), that can be used to increase the performance and quality of the system. The developed algorithm is based on this idea, and it is divided into the following steps: pedestrian detection with ver- tical component extraction, scene analysis and vanishing point computation. The aim of this method is to detect pedestrians with no false positives, to avoid lines that are not perpendicular to the ground. Therefore, it is not crucial to detect all the pedestrians in the image but it is important to be sure that the detected objects are humans. In order to obtain useful candidates for vertical lines extraction two kinds of parameters for every moving object are obtained: the motion direction and the main axis direction. The difference of these directions is quite significant for moving pedestrians while it is very small for vehicles. It is evident that this classification is not very accurate, but in practice it is good enough to get valid pedestrians useful to extract vertical lines. To compute the main axis direction of the blob h, three different approaches have been used: moment analysis, principal compo- nent analysis and RANSAC estimation. The direction estimated by moment analysis is defined as: Fig. 2. Hierarchical calibration tree used. Note: Perp. Intersec means perpendicular intersection. Table 1 Cases of the hierarchical tree. CASE ZOOM CROSS PED VMOT PERP STRUCT OC MANUAL 1 x x 2 x x x 3 x x 4 x x 5 x x 6 x x 7 x x 8 x x 9 x x x 10 x x 11 x 12 x Fig. 3. Principal point computation through camera zoom. (a) Image before zooming and extracted features. (b) Image after zooming and extracted features. (c) Feature matching. The common point corresponds to the optical center. 1534 S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 hmoment " tan#1 2l11 l20 # l02 ! " $1% where lpq is the central moment of order (p,q). Principal Component Analysis (PCA) is equivalent to major axis regressions, so the largest axis can be considered as the vertical component. And finally RANSAC algorithm takes the centroid of each candidate row to estimate the line that corresponds to the main axis of the pedestrian. When these three methods obtain sim- ilar results and the blob aspect ratio is valid, the candidate is con- sidered a pedestrian. At the same time, a gradient line extraction of the image is done in order to extract all the possible structured ele- ments. The angle of the vertical components of the pedestrians is compared to the lines extracted and, in case of matching, the lines will be saved to compute afterwards the vanishing point. Due to the perspective of the camera, a perpendicular line to the ground in the image has different angles depending on the position. More- over, because of the negative pitch the vertical vanishing point has to be positive. Therefore the image is divided into five quadrants. Fig. 5 depicts an example of the developed method. Fig. 5(a) represents the lines extracted from the scene, with different colors depending on the belonging quadrant. Fig. 5(b) shows the detected pedestrian inside a green box with the estimated vertical compo- nent in red, and the matched vertical lines in magenta. Finally, Fig. 5(c) depicts the estimation of the vertical vanishing point with all the accumulated vertical lines. Red lines are the outliers and green lines the inliers for a RANSAC-based method to obtain the intersection point. 2.5. Vehicle motion vanishing point extraction One of the properties of the traffic scenarios is that many vehi- cles drive in the same or inverse direction of the 3D world. There- fore the main axis of these vehicles are parallel to each other, and also parallel to the ground plane. This supplies important informa- tion to extract horizontal vanishing points. As explained in the hierarchical calibration tree (Fig. 2), there are cases that need only one ground plane vanishing point while others need two. In case of computing the optical center (either by zooming analysis or assuming it as the image center) and detecting pedestrians, only one vanishing point from the ground plane is needed, in any direction. On other hand, in case of needing two ground plane vanishing points and if a perpendicular intersec- tion (in 3D coordinates) is present in the scene, vehicles moving along the two main directions will provide perpendicular sets of parallel lines corresponding to the two ground plane vanishing points. In both cases the followed process is similar, done either for one direction or two respectively. Firstly, the main motion directions are extracted. For this pur- pose, a feature optical flow analysis of the foreground blobs is done and their motion direction is saved into an histogram. Once it is Fig. 4. Crosswalk detection example. (a) Binarized background model. (b) Line extraction. (c) Grouped candidates with testing lines in red. (d) Parallel lines to compute the vanishing points. (For interpretation of the references to colour in this figure caption, the reader is referred to the web version of this article.) Fig. 5. Vertical vanishing point extraction example. (a) Extracted scene lines divided by 5 quadrants. (b) Detected pedestrians with red vertical component and vertical matches in magenta. (c) RANSAC vanishing point estimation with red outliers and green inliers. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 1535 constructed after a determined number of frames, an EM algorithm is used to fit the histograms into gaussians in order to get the prin- cipal components of the movement. Fig. 6 shows an example of a perpendicular intersection, where the features of the foreground objects are tracked by optical flow and the motion direction histo- gram with the gaussian components in red is computed. The verti- cal axis corresponds to the frequency of the angle, and the horizontal axis corresponds to the angle value between 0" and 180". These values are not perpendicular in image coordinates due to the perspective projection. After getting the main directions of the scene, the motion of each foreground blob is analysed. In case of detecting motion in the computed directions, the gradients of the blob are extracted in order to look for parallel lines with the mentioned angles. Once obtaining a representative number of parallel segments, a RANSAC- based method to obtain the intersection point is used. Fig. 7 shows an example of two ground plane vanishing point extraction using the method explained in this section. 2.6. Structured scenarios vanishing point extraction In the case of having a considerable number of architectural ele- ments in the scene, a last option for an autonomous calibration is available (although less common and effective). It consist on extracting the vanishing points by brute force gradient analysis, assuming that the three sets of parallel lines with most number of lines are orthogonal. To group the lines, J-Linkage algorithm (Toldo and Fusiello, 2008) is used. This method is based on the work of Tardif (2009), although he does not look for orthogonal vanishing points. Fig. 8 shows the orthogonal lines extracted in a structured scenario to compute the three orthogonal vanishing points. 3. Traffic target detection and tracking After calibrating the camera, an approximate size of pedestrians and vehicles in the image can be obtained using a standard size for them in world coordinates. This step will give the system a notion of how big are the searched elements. In this section, a multilevel framework to detect and track pedestrians and vehicles is pre- sented. Fig. 9 illustrates the flowchart of the proposed framework, which consists of 3 levels: 1) Image segmentation level, to create and handle a background model and to obtain the foreground ob- jects without image noise, camera vibrations or illumination ef- fects; 2) features level, which extracts and follows features of the foreground objects; 3) objects level, which is in charge of managing occlusions and create and track object clusters. The first and sec- ond levels are similar than the ones described by the authors in Álvarez et al. (2012), and the objects level is improved in the current work as explained next. 3.1. Objects level Usually, feature grouping works associating features directly into objects using proximity and motion history. However, the dis- tance between two features that belong to the same object can be much larger than two features that belong to two nearby objects, which can confuse the system. To efficiently deal with the problem, Fig. 6. Example of main motion directions extracted in a perpendicular intersection. (a) Perpendicular intersection. (b) Foreground optical flow analysis. (c) Histogram of directions and fitted gaussians in red. (For interpretation of the references to colour in this figure caption, the reader is referred to the web version of this article.) Fig. 7. Example of ground plane vanishing point extraction in a perpendicular intersection. Fig. 8. Extracted lines in a structured scenario to obtain three orthogonal vanishing points automatically. 1536 S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 a multilevel grouping algorithm is presented. First, an occlusion reasoning step is done in order to split foreground blobs from dif- ferent objects. After that, the individual features are associated to a blob and grouped into clusters depending on their motion and 3D sizes. Finally, the objects are tracked. 3.1.1. Partial occlusion reasoning The first step when considering this problem is to observe the shapes of the objects involved in an occlusion. A common charac- teristic is that the shapes generated by an occlusion are not uni- form: non-occluded objects are generally convex, whereas the shape of partially occluded objects become concave. An example of non-occluded and occluded objects is given by comparing their convex hull in Fig. 10. It canbeseenthatnon-occludedobjectscanreachagoodfitbytheir convex hull, which does not hold for occluded objects. Accordingly, if there is an approximate idea of the searched objects sizes (through the camera calibration), an occlusion can be figured out by studying the blob shapes and their convex hulls. In particular one simple shape descriptorhasbeenwidelyusedinthistask:theshapecompactness. Itis an intrinsic characteristic of the object shapes defined by: C " P2 A $2% where A is the shape area and P is the shape perimeter or boundary length. This way to measure shape compactness is taken from the isoperimetric inequality (Montero and Bribiesca, 2009). The next step to evaluate if a blob is the result of an occlusion is to compare the shape compactness of the object (Co) and the one of its convex hull (Ch). Obviously Co is always greater than Ch, because the area of an object is smaller than the one of its convex hull, whereas the boundary length of an object is greater. Therefore, for non-occluded objects Ch is close to Co, and for occluded ones Ch is smaller than Co. The ratio between both descriptors is used to discriminate both sit- uations. It is called compactness ratio and it is defined by: CR " Ch Co $3% Another parameter used to detect occlusions is the convexity, determined by the ratio between areas as: RA " Ao Ah $4% where Ao and Ah represent the area of the object and the area of the object’s convex hull respectively. Since the denominator is always greater than the numerator, RA is always less than one. For a non- occluded object its shape is convex and RA is close to 1, whereas for occluded objects RA is far less than 1. The third estimator to consider a blob as an occlusion is its size. After calibrating the camera, the approximate sizes of the pedestri- ans and vehicles located in the ground plane are known. In case of occlusions these sizes will be considerably increased. If the three parameters described above indicate an occlusion, the occlusion reasoning method is run as described in the flowchart of Fig. 11. Fig. 9. Flowchart of the proposed framework to detect and track pedestrians and vehicles. Fig. 10. Object occlusions and convex hull. The convex hull is represented in white and the foreground blob in gray color. Fig. 11. Flowchart of the occlusion reasoning method. S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 1537 An useful way to understand the shape of an object contour is to compute its convex hull and convexity defects. Fig. 12 illustrates these concepts using the image of a vehicle occlusion. The gray area corresponds to the foreground blob and the coloured areas represent the different defects of the convex hull. Finally, the red marks correspond to the farthest points from the convex hull with- in each defect, also called defect points. The distance between the farthest defect point and the convex hull is taken, and this point is selected as the first cutting point. The next objective is to find an optimum second cutting point to create a cutting line which separates the blob into two different ob- jects. To extract the second point, the occluded object is sequen- tially cut by segments that join the cutting point with the rest of defect points. For every line, the area and compactness ratios for each new blob are computed. The chosen cutting line is the one that brings the maximum ratio given by the Eq. (5). Ratio " X2 i"1 RAi & CRi 2 $5% Fig. 13 depicts some examples of occlusion reasoning using the method explained before. This procedure does not require prior knowledge but the known measures from camera calibration. By using this method, most partial occlusions can be effectively handled. 3.1.2. Feature clustering and model fitting To group all the features from the same object, a 2-stage 3D clustering algorithm is used. First the individual features are as- signed to a blob (after the occlusion reasoning) and grouped into clusters depending on their motion. Finally these clusters are grouped into objects depending on the 3D sizes and motion. Therefore, if a blob corresponds to a single object, all its features will have a similar motion and will be grouped together. Other- wise, they will be clustered into multiple objects associated to dif- ferent motion characteristics. As an unsupervised stage, it is necessary to identify the number of clusters and the correspon- dence of the samples automatically. Mean Shift (Comaniciu and Meer, 2002) is used as a non-parametric method which does not require prior knowledge of the number of clusters, and does not constrain their shape. The main idea behind mean shift is to treat the points in the d-dimensional feature space as an empirical probability density function where dense regions in the feature space correspond to the local maxima or modes of the underlying distribution. For each data point in the feature space, one performs a gradient ascent procedure on the local estimated density until convergence. The stationary points of this procedure represent the modes of the dis- tribution. Furthermore, the data points associated with the same stationary point are considered members of the same cluster. The quality of the output is controlled by a kernel bandwidth, and it is not critical due to objects moving with different angles or veloc- ities generate features with a strong different component. Fig. 14 depicts an example of the feature clustering step. As mentioned before, an approximate size of vehicles is known thanks to the information provided by the camera calibration. Therefore, a vehicle which has been split into several blobs due to errors in the foreground or a misclassified occlusion can be merged. If the clusters fits into the 3D size of an standard target in the corresponding 2D coordinates and have similar motion, the clusters are merged into a final object. Fig. 15 shows an exam- ple of blob merging after splitting the blob due to an occlusion with a tree. 3.1.3. Cluster tracking After detecting consecutively a cluster several times, a tracking stage combined with a multi-frame validation process takes place. This final step is used to reinforce the coherence of the detected objects over time, obtaining a more stable position, avoiding occlu- sions in case the previous methods fail, and minimizing the effect of both false-positive and false-negative detections. The multi- frame validation and tracking algorithm relies on the Kalman filter theory in 2-D space, with a state vector based on the ellipse parameters: centroid, axis and angle. For the data association, Hun- garian assignment is used. Fig. 16 depicts a performance example of the detection system after the steps described. Fig. 12. Blob, convex hull and convexity defects in an occlusion example. Fig. 13. Examples of occlusion management by the proposed algorithm. 1538 S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 4. Experimental results 4.1. Camera auto-calibration For the auto-calibration method, 30 sequences from different scenarios and conditions have been used, testing the 12 cases of the hierarchical tree. As a result, a comparative table (Table 2) has been constructed with the average errors of the main intrinsic and extrinsic parameters extracted (focal distance, pitch and roll), compared to the cases 5 and 12 which are considered the ground- truth. Yaw is not used because its variation does not modify the ground plane and does not have impact into the 3D projection. As can be seen, case 1 is the best solution due to the strong par- allel component of their orthogonal elements and the zooming chance. Near it, cases 6 and 7 have similar results. It was expected because they are the relative cases to the first one, but without zoom. On the other hand, the worst options are cases 3, 4, 10 and 11, based on perpendicular intersections (not always available or strictly perpendicular), and structured scenes (not always with strictly orthogonal components). The obtained results are really satisfactory: the low error ob- tained proves the strength of the system, and the multiple options of the hierarchical tree provide high versatility to cover most of the possible traffic scenarios. Furthermore, the system is able to adapt the calibration parameters in case of PTZ camera displacements without manual supervision. Even if there is no chance to auto-cal- ibrate the camera (due to absence of orthogonal components), the manual input of lines remains as a valid option which allows the user to control the system in a short time. 4.2. Target detection system Firstly, the performance of the proposed object occlusion rea- soning framework has been quantitatively evaluated. The results are summarized in Table 3, separated by occlusion class and depending on the level of the algorithm where they were detected and managed. Detected columns stand for the number of occlusions detected by each level, and handled is the number of occlusions correctly managed. The total column contains all the detected Fig. 14. Examples of feature clustering represented by coloured features. Fig. 15. Examples of cluster merging. Fig. 16. Example of the target detection system. Table 2 Auto-calibration errors comparative table. Case Focal (%) Pitch (") Roll (") 12/5 0.00 0.00 0.00 1 2.29 1.68 0.30 2 4.69 2.83 0.34 3 5.14 2.55 0.65 4 6.68 3.05 0.67 6 3.52 1.46 0.51 7 3.88 2.05 0.51 8 4.05 2.25 0.69 9 4.40 2.57 0.26 10 7.47 3.11 0.64 11 7.18 3.16 0.65 S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 1539 and non-detected occlusions and it is used to evaluate the rate values as Handled/Total. Occlusion level always takes part in the process and only if it can not detect or handle the occlusion, the algorithm passes through the next level. The global occlusion management ratio (91%) is very reason- able. It is important to emphasize that this analysis is single frame. Therefore an error due to an occlusion in a particular frame is not important in the whole path of an object. Moreover, after the track- ing stage this value is increased to 95% because several occlusions are managed by the multi-frame validation. The advantage of the approach is the use of a multi-level framework that allows to solve an occlusion from different and complementary points of view. To analyse the performance of the global target detection sys- tem, the algorithm has been tested on over 2 h of traffic videos with more than 2000 objects between vehicles and pedestrians. The sequences include different camera views, illumination effects, shadows, etc., in order to evaluate the method in a wide range of situations. Some examples of the testing scenarios used are shown in Fig. 17 and described in Table 4. The global results of the application are depicted in Table 5 in terms of object detection rate, recall and precision. The Detection Rate (DR) is the percentage of correctly detected objects, the Recall (R) measures the system’s ability to identify positive samples, and the Precision (P) is the fraction of retrieved instances that are rele- vant. where TP stands for the number of true positives (objects cor- rectly detected at least the 80% of their path), FP stands for the number of false positives (unexpected detections or object splits) and FN is the number of false negatives (missing detections). From a total amount of 2269 objects, the system has obtained a detection rate of 93.3%, which is considered a good result and valid for the proposed application. To analyse the importance of the cal- ibration stage, this value is compared with the one obtained with- out calibrating the camera as represented in Fig. 18. Firstly, the system works correctly with the parameters obtained by an initial camera auto-calibration. Between frames 44 and 54, the camera changes its angles and zooms out. Then, a comparison between Table 3 Quantitative evaluation of the occlusion reasoning framework. Occlusion level Clustering level Together Detected Handled Detected Handled Detected Handled Total Rate Ped& Ped 226 213 11 11 237 224 251 0.89 Car& Car 124 115 19 18 143 133 142 0.93 Car& Ped 53 51 29 26 82 77 85 0.90 Result: 403 379 59 55 462 434 478 0.91 Fig. 17. Samples of testing scenarios. Table 4 Description of testing videos. Video number of frames Resolution Conditions Source video1 16402 640 ' 480 Cloudy Own sequence video2 5244 640 ' 480 Dusk (dark) Own sequence video3 3332 640 ' 480 Dusk (bright) Own sequence video4 18296 640 ' 480 Sunny Lunds Univ. video5 15921 640 ' 480 Cloudy Own sequence video6 3585 640 ' 480 Sunny Own sequence video7 630 768 ' 576 Fog/rain Karlsruhe Univ. video8 4290 352 ' 288 Cloudy Candela Table 5 Results of the target detection system. N: number of samples. TP: number of true positives. FP: number of false positives. FN: number of false negatives. DR: detection rate. R: recall. P: precision. Scenario N TP FP FN DR R P Sunny (shadows) 901 832 39 32 0.923 0.963 0.955 Cloudy 885 841 23 43 0.950 0.951 0.973 Dusk 312 291 17 15 0.933 0.951 0.945 Rain/snow 171 152 17 13 0.889 0.921 0.899 Total 2269 2116 96 103 0.933 0.954 0.957 1540 S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 the system with and without auto-recalibration is done (through OC computation and vanishing points extracted from a crosswalk). The blue line represents the detection rate of the auto-recalibrated system (near 93%) and the red line represents the DR in case of the calibration parameters remain constant (near 50–60%) with graphical examples. It demonstrate the need to work in fully self-adaptive mode. 5. Conclusions and future work In this paper, a novel hierarchical camera self-calibration proce- dure based on vanishing points has been presented. Depending on which elements appear in the scene and the chance of using cam- era zoom, 5 levels have been established to determine the hierar- chy of each developed method and the priority of the solution adopted. It is an important step for many possible applications, be- cause it provides very useful information to compute an approxi- mate size of the searched objects. In this context, a monocular system has been developed to detect and track vehicles and pedes- trians for applications in the framework of Intelligent Transport Systems. Trough the auto-calibration step, the algorithm requires no object model or prior knowledge (only an approximate size of the searched objects in world coordinates), it can work indoor and outdoor, in different conditions and scenarios. Moreover, it is completely autonomous (‘‘plug & play’’), independent of the posi- tion of the camera and able to manage PTZ changes in fully self- adaptive mode. From the results and conclusions of the present work, several future lines for each treated topic are devised. With respect to the camera auto-calibration, an interesting improvement is related to the recalibration process in case of PTZ displacements. The idea is to develop a segment tracking, to use the same set of orthogonal lines to find the new position of the previously used vanishing points. Besides that, due to the high diversity of camera views, operating conditions and observation objectives in traffic surveil- lance, there is an important lack of a common framework and most authors use their proprietary sequences. This condition has gener- ated a large diverse body of work, where it is difficult to perform direct comparison between the proposed algorithms. It would be very important to generate a public traffic database, with a wide range of scenarios and conditions, to be able to make these comparatives. Acknowledgment This work has been supported by the Spanish Ministry of Sci- ence and Innovation by means of Research Grant ONDA-FP TRA2011-27712-C02-02. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.eswa.2013.08. 050. References Álvarez, S., Sotelo, M. A., Llorca, D. F., & Quintero, R. (2011). Monocular vision-based target detection on dynamic transport infraestructures. In Lecture notes in computer science (pp. 576–583). Álvarez, S., Llorca, D. F., Sotelo, M. A., & Lorente, A. G. (2012). Monocular target detection on transport infrastructures with dynamic and variable environments. In IEEE intelligent transportation systems conference. Álvarez, S., Llorca, D. F., & Sotelo, M. A. (2013). Camera auto-calibration using zooming and zebra-crossing for traffic monitoring applications. In IEEE intelligent transportation systems conference. Caprile, B., & Torre, V. (1990). Using vanishing points for camera calibration. International Journal of Computer Vision, 4, 127–140. Cipolla, R., Drummond, T., & Robertson, D. (1999). Camera calibration from vanishing points in images of architectural scenes. Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 603–619. Hartley, R., & Zisserman, A. (2000). Multiple view geometry in computer vision. Cambridge University Press. Hodlmoser, M., Micusik, B., & Kampel, M. (2010). Practical camera auto- calibration based on object appearance and motion for traffic scene visual surveillance. In IEEE conference on computer vision and pattern recognition (pp. 1–8). Hue, T., Lu, S., & Zhang, J. (2008). Self-calibration of traffic surveillance camera using motion tracking. In IEEE conference on intelligent transportation systems. Junejo, I. N. (2009). Using pedestrians walking on uneven terrains for camera calibration. Machine Vision and Applications, 22, 137–144. Kim, Z. (2009). Camera calibration from orthogonally projected coordinates with noisy-ransac, In IEEE workshop on application of computer vision. Lv, F., Zhao, T., & Nevatia, R. (2006). Camera calibration from video of a walking human. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1513–1518. Montero, R., & Bribiesca, E., (2009). State of the art of compactness and circularity measures. In International mathematical forum. Rother, C. (2002). A new approach to vanishing point detection in architectural environments. Image and Vision Computing, 20, 647–655. Fig. 18. Comparative example of the system with and without recalibration after a camera change. S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 1541 http://dx.doi.org/10.1016/j.eswa.2013.08.050 http://dx.doi.org/10.1016/j.eswa.2013.08.050 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0005 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0005 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0010 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0010 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0010 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0015 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0015 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0020 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0020 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0025 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0025 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0025 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0030 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0030 Schoepflin, T., & Dailey, D. (2003). Dynamic camera calibration of roadside traffic management cameras for vehicle speed estimation. IEEE Transactions on Intelligent Transportation Systems, 4(2), 90–98. Tardif, J. P. (2009). Non-iterative approach for fast and accurate vanishing point detection. In IEEE conference on computer vision. Toldo, R., & Fusiello, A. (2008). Robust multiple structures estimation with j-linkage. In European conference on computer vision (pp. 537–547). Tsai, R. (1986). An efficient and accurate camera calibration technique for 3d machine vision. In IEEE conference on computer vision and pattern recognition. Zhang, Z., Tan, T., Huang, K., & Wang, Y. (2013). Practical camera calibration from moving objects for traffic scene surveillance. IEEE Transactions on Circuits and Systems for Video Technology, 23, 518–533. 1542 S. Álvarez et al. / Expert Systems with Applications 41 (2014) 1532–1542 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0035 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0035 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0035 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0040 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0040 http://refhub.elsevier.com/S0957-4174(13)00671-4/h0040 Hierarchical camera auto-calibration for traffic surveillance systems 1 Introduction 2 Camera autocalibration 2.1 Hierarchical auto-calibration 2.2 Principal point through camera zoom 2.3 Zebra crossing vanishing point extraction 2.4 Pedestrian vanishing point extraction 2.5 Vehicle motion vanishing point extraction 2.6 Structured scenarios vanishing point extraction 3 Traffic target detection and tracking 3.1 Objects level 3.1.1 Partial occlusion reasoning 3.1.2 Feature clustering and model fitting 3.1.3 Cluster tracking 4 Experimental results 4.1 Camera auto-calibration 4.2 Target detection system 5 Conclusions and future work Acknowledgment Appendix A Supplementary data References