Towards Real-Time Monitoring of the Hajj

– An automated approach to explore the fundamental properties of high-density pedestrian traffic is outlined. The framework operates on video or time lapse images captured from surveillance cameras. For pedestrian velocity extraction, the framework incorporates cross-correlation based Particle Image Velocimetry (PIV) techniques. For pedestrian density estimation, the framework relies on the Machine Learning technique of the Boosted Regression Trees. The information collected from images in pixel coordinates are transformed to world coordinates with a pin-hole camera based projective transformation technique. The framework has been tested with high density crowd images acquired during the Muslim religious event, the Hajj. Accuracy and performance of the framework are reported.


Introduction
Every year millions of Muslims congregate to perform the Hajj. Managing pedestrian safety and comfort for crowds of this size presents formidable challenges. As recently as 2015 hundreds of people lost their lives in an unfortunate accident during the Hajj [1]. Since surveillance cameras are widely used, studying video or time lapse photos from these cameras may provide valuable insights on the dynamics of high-density pedestrian traffic. The aim of this work is to enable automatic processing of these images and the extraction of quantifiable information. More specifically, the current work provides ways for obtaining velocity and density information from surveillance camera images. The framework relies on the Particle Image Velocimetry (PIV) [2] technique for pedestrian velocity extraction and a trained Machine Learning model for obtaining density. After obtaining velocity and density in the image/pixel coordinates, these are transformed to physical units (meters) in world coordinates through projective geometry (perspective correction).
The rest of the paper is organized as follows: in Section 2 a brief description is provided for the current state of research with image processing of crowd images. In Section 3 the theoretical aspects of the framework are discussed followed by Section 4, where the experiments and results are reported on the accuracy of the framework. In Section 5, some potential applications of the data are presented and finally conclusions are drawn on Section 6.

Background
Image processing and computer vision techniques have been used to analyze various aspects of pedestrian dynamics, namely walking behavior, crowd monitoring, head counting, trajectory extraction etc. In recent times, Maurin et al. [3] have constructed a crowd monitoring system based on optical flow, segmentation, and Kalman filter. Nedevschi et al. [4] also constructed a detection and collision avoidance scheme from video data based on Kalman Filtering. The results presented in both of these studies are promising. However, a major limitation of these frameworks is that they are not designed to handle high density crowds.
In order to obtain pedestrian density from a given image, the first step would be to get a headcount of the people in that image. Machine learning models have been shown to be effective in order to achieve this. The crowd counting has often been formulated as a regression problem in machine learning. A regression model would output a predictive headcount once it is trained properly. Ma et al. [5] have developed a counting model based on Gaussian process regression. Idrees et al. [6] have constructed a Support Vector Regressor that has been trained with more than one feature (image gradients, Fourier peaks and interest point based samplings).
For obtaining pedestrian speed, Optical Flow [7] is arguably the most popular. The optical flow technique resolves the pedestrian velocities per pixel. As a result, the processing time becomes quite high for surveillance camera images which often have resolutions close to 5760x3840 pixels. The PIV technique that being widely used in the fluid dynamic community can be used to extract pedestrian velocities in a much faster rate. The PIV technique was developed in the mid 80's [8], and has recently found its use in other fields. For example, Vanlanduit et al. [9] have used PIV for metal fatigue experiments and Sveen et al. [10] applied it to the study of water waves. The application of PIV in audio speaker performance can be found in Rossi et al. [11]. From the review of relevant literature, it is apparent that by combining machine learning techniques with Optical Flow or PIV speed detection, one can conveniently create an automated framework that would monitor crowd properties and provide quantifiable information which could be used to make decisions on crowd management such as impending danger.

Materials and Methods
This section outlines a brief description of the theoretical aspects of the crowd monitoring framework. Crowd velocity extraction from PIV is discussed in section 3.1, crowd counting through boosted regression trees is discussed in section 3.2 and image-to-world coordinate transformation is discussed in section 3.3.

PIV
The PIV technique takes a sequence of time lapse photos that are being separated by a small timegap (typically fractions of seconds). Initially the photos are divided into smaller blocks known as the interrogation spots. Afterwards, for each interrogation spot, cross correlation is performed. The crosscorrelation computes component wise inner product. The inner product is generally computed in the frequency domain through convolution. Operating in frequency domain enables much faster processing time compared to regular (direct) correlation. In Figure 1, a sample input image and correlation surface can be seen.

Head counting
Head counting or pedestrian counting is the first step for the density estimation. The machine learning model formulated for head counting uses Histogram of Oriented Gradients (HOG) [12] as image features and manual annotations as ground truth. The machine learning model solves a regression problem following the gradient boosting algorithm [13]. Input to the machine learning model consists of image segments with pedestrians in it and output would be approximate counts of the number of people. To achieve the final counts, the machine learning model operates in two stages. In step 1, image features are extracted via HOG and in the subsequent step, the regression model is trained with the HOG features and ground truth counts.

Feature Extraction via HOG
The HOG feature creates a histogram of images edges based on their orientation. The input image segment is first divided into 8x8 pixel blocks. For each block, image gradient is calculated. If the input image is an RGB image, it is first converted to Gray scale image (0-255 gray level values) in order to reduce the influence of illumination effects. Then gradients are calculated in a block wise fashion. Afterwards, these gradients are collected into 9 orientation bins. The final outcome is a histogram of this gradients. More information on its application to crowd counting can be found in [14].

Regression Model Construction
The regression model is constructed with HOG features of the input images and their corresponding ground truth counts. The model input has the form of ([x11, x12, x13,.. and m is the dimension of the HOG histogram (68600 in this case). The goal of the regression model is to formulate an approximate function F, (xnm− >y) that minimizes dissimilarity between the ground truth and the model prediction in a stage wise fashion by formulating trees (regression trees). After the model is being trained, if an input image is given, the model will first compute the HOG features of the input image and with the HOG features, the regression model approximates a head count of pedestrians in the image. More details of the approach can be found in [14] and [15].

Image to World Transformation
The pedestrian speed obtained through PIV comes in pixel coordinates. Also, for the density calculation, the counts obtained through Machine Learning need to be divided by the image area. As the images are not taken from an orthogonally posed camera, converting of these pixels to meters/centimeters becomes a challenge. The coordinate transformation technique involved here operated on a few landmark points for which pixel and world coordinates are known. In Figure 2 (a), these landmark points along with their pixel and world coordinates can be seen. Since the pedestrians are moving along a 2D plane, an equation of this plane is formulated in the image and world coordinates. Later, an intersection of any pixel point with this plane is determined. This intersection point is then converted to 3D coordinate with camera intrinsic parameters. These intersection points can be pixel locations of pedestrians or PIV displacements. More details of the approach can be found in [15].

Experiments and Results
In this section, the results are presented for the numerical experiments undertaken in order to investigate the performance of the framework in real world application. In section 4.1, the dataset used for the study is presented. In sections 4.2 to 4.4, accuracy results are presented for velocity extraction, head counting and image-to-world coordinate transformation.

Dataset
As the paper title implies, the focus of the study is the Muslim event, the Hajj. The Sahn area of the Hajj constitute the gathering (Figure 3) where people move circularly around the kaaba (black building) , this is called tawafs. The study focuses two different camera images. A sample frame from the second camera can be seen in Figure 3(a). These images are taken from Closed Circuit Television camera (CCTV) located at the facility. As part of the training and testing of the machine learning model, 600 image segments were manually annotated. Some annotations can be seen in Figure 3

Velocity Extraction
The velocity vectors obtained from PIV processing of crowd images can be seen in Figure 4. Figure  4 (a) depicts the vectors of the entire image while Figures 4 (b) and (c) show vectors from two selected portions of Figure 4 (a). As can be seen from the magnified sections [ Figure 4(b) and (c)] that the vectors are not streamlined. This is a result of high density. The predominant flow in this location is circular. However due to high density, some density waves appear that results in the chaotic patterns of the vectors. To check the accuracy of the PIV velocity extraction, a number of methods have been undertaken. One of them is to manually track some random people at random locations of an image. In Table 1, the results are listed for 6 random locations with PIV and manual tracking. It was found that the maximum error was about 16%.

Head counting Results
The head counting from images are needed to obtain density. In Table 2, the results are presented for the Machine Learning model vs the ground truth counts for 6 different images of both datasets. As can be seen from the table ( Table 2) the countings are close to each other. As mentioned earlier in section 3.2, for head counting, the input image is being divided into 100 smaller sub images i.e. image cells [ Figure 2(b)]. The Machine Learning model provides predictive counts for each of this smaller image cells. In Figure 5, the blue dots represent the predictive headcount from Machine Learning model while the error bars indicate difference between the ground truth counts and the predictive counts. It can be seen that from cells 0 to 50, the headcounts vary quite a bit. This is because of the camera angle (camera perspective). The effect diminishes in the cells from 50-99. However, Table 2 indicates that the cumulative counts are close to the ground truth counts. So, the perspective is not severely affecting the overall counts.

Coordinate transformation results
The results of this transformation process for 4 different coordinate points can be seen in Table 3. It can be seen that the projected points are not very far from the world coordinates. Compute time plays a key role in real-time processing applications. The framework outlined in this study has two components (PIV processing and Head counting) where timings are critical. To process two frames of the second dataset ( Figure 3) took 32.79 seconds. The timing includes the PIV processing and head counting combined. The experiments are performed in a 4-core laptop with 1.8 GHz processor and 8 GB main memory. So, in terms of compute power requirement, the framework is not very demanding. However, as more cameras are incorporated in the processing, the demand for compute power will go up. GPU processing may offer a simple solution to this.

Application
When fundamental properties (density and velocity) of a crowd are known they can be incorporated into a fundamental diagram. Furthermore, they can also be used to obtain pedestrian distribution in future time. These applications are briefly explained in this section.

Fundamental Diagram
The fundamental diagram (speed vs density plot) for images from dataset 2 are shown in Figure 6. The fundamental diagrams are rare to find for high density flows considering the risks associated in conducting experiments in such conditions. The proposed framework can provide as an alternative approach in obtaining empirical results for high density crowds. The fundamental diagram of Figure 6 also graphically outlines the difference of density values obtained from the Machine Learning model and ground truth counts.

Predicting Future State of Crowd
Microscopic models such as PEDFLOW [17] take pedestrian density and speed as input and can approximate pedestrian distribution at future times. The current framework can enable the microscopic models to approximate a more accurate distribution of pedestrians through a more accurate input. In Figure 7, three images of the Kaaba premises are shown that gives a whole 360-degree coverage of the facility. The PEDFLOW approximation of pedestrian distribution and pedestrian density at 4 seconds in future can be seen in Figure 8. Although the approach is currently in its infancy, such an application possesses a great promise in high density crowd monitoring and accident prevention.

Conclusions and Outlook
The study addresses two critical challenges of high-density pedestrian traffic: real-time monitoring and estimation of future state (accident/congestion). In the case of real-time monitoring, the proposed PIV technique can be considered as a faster alternative of optical flow; while for future state estimation, the PEDFLOW model combined with inputs from PIV (speed) and machine learning model (density) provides a useful tool to the safety personnel managing/monitoring the crowd. The accuracy of the framework is reported for the crowd of Makkah. The main challenge lies in the scarcity of ground truth datasets in order to make the model more general. As of right now, there are only a handful of datasets for machine learning based crowd counting and their generalization ability is limited. An alternative to the scarcity of datasets could be to model individual crowd events separately and study them individually. This the route that the authors have taken in this study.