Machine Learning Computer Vision Applications for Spatial AI Object Recognition in Orange County, California

We provide an integrated and systematic automation approach to spatial object recognition and positional detection using AI machine learning and computer vision algorithms for Orange County, California. We describe a comprehensive methodology for multi-sensor, high-resolution field data acquisition, along with post-field processing and pre-analysis processing tasks. We developed a series of algorithmic formulations and workflows that integrate convolutional deep neural network learning with detected object positioning estimation in 360{\deg} equirectancular photosphere imagery. We provide examples of application processing more than 800 thousand cardinal directions in photosphere images across two areas in Orange County, and present detection results for stop-sign and fire hydrant object recognition. We discuss the efficiency and effectiveness of our approach, along with broader inferences related to the performance and implications of this approach for future technological innovations, including automation of spatial data and public asset inventories, and near real-time AI field data systems.


Introduction
The twentieth-first century has introduced information-related functions in our everyday lives and social practices. For many of the youngest members of our societies throughout the world, life without a constant and readily accessible digital information flow is almost unimaginable. Of course, the latter is more profound in societies with higher degree of technological advancement. Nevertheless, information flows become more and more a core part of our contemporary realities. It has moved beyond the intellectual symbolism and realm of our social construction of reality, to become a well embedded norm and way of functioning in our societies. Information flows are in the core of our physical realities, infrastructure and modus operandi of the functioning and organizing our social systems. Education, technology, science, governance and institutions across all aspects and facets of our social lives are dependent and rely upon digital information systems and flows.
The growing and relatively widespread use of artificial intelligence approaches to machine learning, has begun to proliferate applied and practical technological innovation across academic, government, industry, and commercial sectors alike. Technological advances in the last decade have completely transformed and restructure the dynamics and nature of technological innovation. The explosive growth of cloud technologies, methods, physical infrastructure, and software/algorithmic methods has exponentially reached a point where AI is becoming the industry's gold standard from smart technologies. From smart homes and devices, smart vehicles, smart wearable devices, self-driving technologies and software, intelligent vehicle collision avoidance systems, industrial robotics, smart detection technologies, are just some examples of technological innovation with which large part of our societies are steadily growing accustomed with. Among these innovations, one area stands in the intersection of AI, machine learning and pattern recognition: deep learning methodologies, and its relevant applications for machine learning such as computer and cognitive vision. Such methods gained particular attention and attracted scientific and practical focus [11] in the past few years. This paper will introduce, describe, and discuss the application of machine learning technologies in acquiring, processing, and analyzing multi-sensor field imagery data for spatial AI object recognition tasks. Specifically, we present (a) a systematic methodology for field acquisition of multi-sensor imagery data (including integrated LiDAR point cloud data, high-accuracy GNSS positional data and 360°photosphere imagery data); (b) a workflow frame-work for post-field and pre-analysis processing data tasks; (c) a machine learning computer vision data analysis framework for spatial object recognition and positional detection, and; (d) an automated workflow for GIS analysis and visualization of spatial data. We will provide examples of data analyzing though these integrated workflows and discuss our findings along with directions for next steps of analysis.

Background
Deep Learning or, more descriptively, deep neural networks (DNN) methodologies use autodifferentiated back propagation techniques to train a neural network classifier (using training and validation datasets consequently). A growing number of analytical and visualization methods for deep neural network learning have been developed in the last 5-10 years [1,20,14,16]. An example of recent taxonomical classification of these methods is provided by Yu and Shi in their recent paper [38]. For example, deep learning algorithms related to locational and/or spatially explicit situations have been designed and implemented, as in an application of DNN, where Yan et al [36] enabled the ability to predict vehicle speeds using data screening of historical velocity, acceleration, steering signal input, location and temporal awareness data from electric vehicle's onboard GPS sensors. Convolutional neural network approaches as part of a deep learning framework have been reported in applications of 3D audio-visual patter recognition [30], scene pattern classification tasks in imagery [33], in satellite imagery analysis tasks using transfer learning [35], color classification imagery tasks [37], to name a few applications.
A few methodologies have been proposed, used and implemented in situations where traffic sign recognition is sought and pursued as part of an unsupervised or semi-supervised classification framework [18,3]. For example, Aziz et al [4] recently used an algorithmic methodology called extreme learning machine (ELM) to evaluate traffic sign classification performance in German (GT-SRB) and Belgian (BTSC) sign datasets. The method uses gray-scale small resolution images with normalized band histogram intensity values to perform feature fusion for their ELM classifier. Their classification accuracy ranged between 95% to more than 99% (in the case of multi-dataset training sets). Yu et al [39] used hierarchical deep models for traffic sign recognition in LiDAR data analysis tasks with classification accuracy ranging from 85%-99% for different algorithmic implementations and datasets. Zhu et al [41] performed traffic sign recognition tasks in panoramic images using convolutional neural networks for the German (GTSDB) benchmark database. Panorama images was used in Fakour-Sevom et al [8] and Coors et al [6]. Traffic sign detection and classification studies have been reported in a number of studies with varying results, including AdaBoost wavelet detection [5], rectangle detection [15], LiDAR and image combinations [12,40,10].

Materials and Methods
Most of the studies and related literature described in the previous section uses benchmark precompiled databases and training datasets for developing and testing algorithmic implementations of deep learning and convolutional neural network models. Furthermore, most studies only address issues of detection, recognition, and classification rather than incorporating comprehensive spatial and locational integration of data. In many cases benchmark training datasets may incorporate data that are unsuitable for every situation, such as traffic signs from different countries, with different shapes and designs, or data with relative low resolutions that may not be suitable for highresolution, high accuracy needs and settings. For example, high-accuracy GNSS systems require and mandate the use of high-resolution imagery for centimeter positional accuracy. We developed a comprehensive methodology that incorporates high resolution field data acquisition and collection, multi-task pre-processing and analysis, algorithmic detection, recognition, and classification tasks, and spatial and GIS locational processing and geodatabase construction of object detections. We describe the methodological components used in this study in the next subsections.

LiDAR, GNSS and 360°Image Photosphere Data Acquisition
The field data collection was conducted by David Evans and Associates [7] using the mobile unit configuration described in the next subsections (see also Figure 1). The field unit configuration included three integrated sensor systems: firstly, a mobile LiDAR mapping system for capturing 3D point cloud data, as shown in Figure 1(a); secondly, an embedded GNSSS-inertial positioning system for accurate locational data acquisition, shown in Figure reffig01(b), and; thirdly, an automated 360°photosphere image capture system, shown in Figure 1(c). Figure 1: An overview of the field collection instruments used for the data collection in the study, that include a mobile LiDAR sensor, a Trimble GNSS positioning system, and a photosphere image camera (credit: Matthew Kumpula, David Evans and Associates Inc).
The mobile LiDAR data were acquired by a sensor amounted on a surveying vehicle and the point cloud imagery was captured as the vehicle traveled along the streets. The mobile unit used was the RIEGL VMQ-1HA high-speed single scanner mapping system [22]. The tiled point clouds cover the scene of the whole 3D space along the streets, including the trees, grass, buildings, cars, pedestrian and so on. There are some points with z coordinates of negative values that corresponding to catch basins at both sides of the roads. The GNSS positioning system used was an Trimble Applanix AP60 embedded on-board card along with an inertial measurement unit [32] for capturing high-accuracy GPS positioning data combining the vehicle's real-time position and the photosphere imaging data.
The camera used for static equirectangular 360°imagery acquisition was a FLIR Ladybug 5, 30MP (5MP×6 sensors) camera imaging system [9]. It produces full bit depth of 12-bit RAW images, converted to JPEG post-field collection using the camera's API software in C#. The camera's field of view is 90% of full sphere, and the spherical distance is calibrated at minimum focus distance of 2 meters (6.562 feet). The imaging system itself reports data from several onboard environmental sensors, namely temperature, barometer, humidity acceleration and compass readings for each capture.
Data collection was conducted by David Evans and Associates Inc [7]. The data collection dates for the data reported in this paper ranged from December 2018, to April 2019 in 12 data sample groups and 8 field sample days. They cover two regions in Orange County: Anaheim Hills

Field Post-Processing Data Applications
The mobile LiDAR point-cloud data underwent field post-processing using the MLS Ri Software suite from RIEGL (RiWORLD for georeferencing mobile data, RiPROCESS for data processing, RiANALYZE for wave-form analysis and RiPRECISION for mobile data adjustment) [25,24,21,23].
The Applanix system allowed post-processing georeferencing of the mobile mapping sensors using the POSPac MMS computational processing software system [31].
The photosphere image capture process was based on travel distance rather than time. The field data collection configuration allowed the capture of one image per approximately 10 feet of vehicle traverse travel. In this way, the yielded dataset was spatially balanced and stratified, and allowed for accounting for and remedying temporal variations due to traffic conditions or vehicle stops. The stored original row photospheres image data were converted to JPEG format with an overall width of 8000 pixels, and a height of 4000 pixels. A comma-separated dataset escorted each subset of images (for each field data collection run) that reported for each image capture the GNSS data captured from the Applanix Trimble system.

Calibrating and Adjusting for Relative Sensor Positioning
Because of the slight deviation in the mobile field data collection unit configuration in terms of the sensor placements, a mathematical/geometric correction in the reporting data was necessary. Specifically, as can be seen in Figure 2, the relative positioning of True North azimuthal direction of the photosphere camera's position, and the one obtained from the GPS sensor may differ from each other. This difference is zero when the vehicle travels to the direction of azimuthal True North, in which case the direction of movement is aligned with the central axis of the vehicle's length (same as the dotted line connecting the GPS and camera sensors in Figure 2). On the other hand, the deviation increases with the radius of the turn of the vehicle. The more left or right from the True North is the direction of movement, the more different the relative positions are. The distance between the two sensors, is 1 meter (3.28084 feet). In order to correct for the direction of the movement, we can use the geometric inverse tangent function of the angle, expressed in degrees. Symbolically if is the original difference between the True North axis and the axis of the direction of movement, then the corrected direction can be calculated from the following logical expressions: In other words, the original angle, in degrees can be calculated directly from the Easting and Northing coordinates of the GPS sensor (in State Plane California, Zone 6 Datum Coordinate System [29]) using the inverse tangent function and then converting from radial degrees to directional degrees. If the angle is zero, then = , otherwise, the corrected angle will be the modulus of the sum of + 360°and 360°. A second calibration that needs to be applied during the calculation process of the photosphere imagery is the relative direction of each horizontal center axis of the image. As mentioned before the flattened photosphere images span over 8000 pixels wide, with the direction of movement corresponding to the first and last horizontal width pixel coordinates (see Figure 3). Thus, since the photospheres reflect a 360°view of the location, then the derived pixel degree correspondence value is: Thus, for any pixel of the image with horizontal location , given the direction of movement 0 , the direction of the line of sight of the pixel can be calculated as:

Database and Image Analysis Pre-Processing Methodology
The original 360°equirectangular panorama photosphere imagery is transferred to an Azure blob storage database (cold storage) for processing using Azure Python API libraries [19] and occupy approximately 265 Gb of cold data storage (see also Table 1). As it will be described further below, images stored as cold blobs allow for simultaneous, and associative storage of custom metadata fields for each of the stored blobs, thus enabling linking sensor data with images. These blob images are processed in three consecutive stages in order to be ready for analysis (see also Figure 4). Stage 1 involves dynamically and programmatically cropping from the original photosphere image (4000×8000) a functional analysis area around the image height center (1000×8000). The reasons for extracting a subset of the original images are multiple. First, it allows for minimizing vertical fish-eye lens distortion of the original camera sensor. Second, minimizes noise and error from detection-irrelevant sections of the image, since the top and bottom part of each photosphere contain sky and blind spot areas respectively. Third, serves the purpose of improving detection accuracy of the deep learning and detection algorithm used for the analysis. In preliminary model runs and experimentation we found that the same model (convoluted ANN) performs better with the reduced-size functional area compared to the original photosphere image. Finally, it systematically and consistently reduces model training and prediction processing times, which is critical in achieving near real-time processing of the ML model. Stages 2 and 3 involve programmatically separating the extracted functional area into eight cardinal sub-images reflecting relevant cardinal directions to the direction of movement. Each of these cardinal images has dimensions of 1000×1000 pixels. The reason for separating these cardinal images is twofold: a procedural, and an analytical one.
The procedural reason was derived empirically from subsequent trials and model runs on the photosphere images (and their reduced-from functional area equivalents). Our experiments showed that detection is vastly improved when computer vision object recognition is performed on the smaller than on the wider, and bigger resolution images. The wider photospheres cover a bigger focal area, with multiple potential and candidate objects to be detected, thus both the simultaneous detection probabilities and error rates increase with the size and magnitude of the image to be processed. Furthermore, it appears that both standard and custom models in Microsoft Azure (Azure Cognitive Services Computer Vision/Custom Vision models), and in Google Services (Google Cloud Vision/Custom Vision models) are trained and perform better in images with few, or single focal objects to be detected, rather than multiple object detections in the same image.
The analytical reason for separating these images has to do with the spatial configuration and the nature of objects sought for detection. For example, traffic will always appear in the top right cardinal image (cardinal ID 1), i.e., always immediately to the right of the driving direction (for right-driving roads). Thus, in order to improve model accuracy, minimizing irrelevant and noisy data and improve recognition and detection the cardinal separation serves a very useful purpose. The following Table 2 demonstrates the use of eight nominal cardinal directions if we begin (driving direction) from True North in the NAD83 coordinate system.
While the coarse classification showcased in Table 2 serves a useful purpose from the nominal True North case, it does not differentiate enough directions to account for all possible starting driving directions present in the data. In order to provide a more refined and suitable cardinal classification, we generated a cardinal lookup dictionary (in the python class script of the ML application), to account for all possible configurations.
Since the driving direction is fixed with 6 decimal point float accuracy, and we need to generate eight cardinal directions starting from the left to right direction across our image, then we know the exact direction of the center of each of these cardinal images starting with 1 = 0 + 22.5°for the first image (with 0 representing the cardinal direction), and for each of the cardinal images adding another 45°(modulo 360), i.e., for each i=1,2. . . ,8: Using the center direction of each of these eight cardinal images, we can classify each image in one of the following 32 directional classes each one representing a 11.25°range (since there are two directions from the image's center, i.e., 2 × 11.25°= 22.5°). The cardinal lookup dictionary directions used for the analysis are shown in Table 3. These lookup values are used also for naming conventions of the cardinal photosphere images, and for visualizing results in spatial (GIS) and non-spatial applications post-processing.

Automated Production Workflows
We automated the entire production workflow and processing of the field data acquisition using Python programming. We designed four distinct and associated production workflow stages. These stages are shown diagrammatically in Figure 5 and are summarily described below.
Production stage 1 data processing workflows involves configuring and staging Azure blob storage and metadata operations, including custom medatada fields that hold ML object recognition output variables along with field sensor data variables. These operations are followed by analytical processing operations on cardinal image processing (see detailed process on the methodology subsections 3.2, 3.4 and 3.7 below) and compiling structured geoJSON dictionaries from deep learning custom vision operations.
Production stage 2 algorithmic workflows involves calculating and processing positioning spatio-temporal data from obtained sensor information; processing imagery data for cardinal and pixel-directional associations, annotations, and other string construction operations; multiple operations related to AI and ML processing algorithmic tasks (REST construction, JSON response dictionary construction, object tagging and categorization, etc.), and post-AI object detection calculation (triangulating object position, object centroid calculation, and image-to-object distance and angles) as it is described in detail in subsequent sessions.  roduction stage 3 geoprocessing workflows programmatically follows the AI algorithmic object extraction stage and involves generating and constructing positional data and detected object geodatabases from geoJSON dictionary responses, including feature collections and feature classes with positional and ML detected object-related attributes. In this stage, feature class data are separated into feature datasets by cardinal direction, object category, and object locations, processes that are followed by statistical clustering analysis (DBSCAN) and combinatorial positional calculations for multiple detections. The final operations in this stage involve pair-wise object coordinate extraction, and within-cluster combinatorial averaging and post-processing variational accuracy assessment.
Production stage 4 visualization workflows is the final stage that involves programmatically processing spatial geodatabases and feature datasets through ArcGIS arcPy programming classes, and ArcGIS API processing for constructing spatial REST feature services, web maps, and web apps used in public data portal, along with analytics and spatial accuracy detection accuracy metrics. All four production stages are fully automated and are processed by dataset, in such way that the processing that begins from data acquisition caries through final data portal production and visualization.
Two distinct python classes were designed and implemented for the production workflows described here. Specifically, the main data processing class acvml, shown using the class diagram of Figure 6, and the spatial geoprocessing class acvgis, shown in the class diagram of Figure 7.
Both Python classes contain multiple functions for sequential data processing, and the output geoJSON file of acvml is used in processing spatial feature classes and geodatabase operations in acvgis.
Class initialization and object instantiation is performed by sub-dataset (blob storage container) of the field data, and the automated production process adds the deep learning object recognition results at multiple outputs: (a) as metadata at each cardinal blob of photosphere images in the blob storage containers; (b) as geoJSON dictionary class members; and (c) as attributes in geodatabase feature classes and resulting features in ArcGIS online REST feature services.

Machine Learning and Computer Vision Methods
We used and tested multiple ML algorithms for Computer Vision. Specifically, we tested in production data:  rom these models, (a.i), (b.i) and (c) used object classifiers pre-trained through the convolution deep neural network models from web images. An example of the topological configuration of the Mathematica's Image Processing model showcasing the layout of the convolution layers and network in the deep neural network model is shown in Figure 8. We used classification results from these models to perform additional, custom classifications to train models (a.ii) and (b.ii). Specifically, we assessed the accuracy of object classification obtained by the Azure's Cognitive Services Vision model and Google Cloud Vision model, by identifying images correctly classified by the models, and using approximately 50-100 of these images (depending of the object: 100 for stop signs, 50 for fire hydrants) for training the custom vision version of the models in (a) through (d).

Detected Object Position Calculation
The relationship between a slope of a line ( ) and its angle in degrees (°) is: and the slope between two coordinate points is, We have two GPS points (from a vehicle) that are known, i.e., point = ( , ), and = ( , ). We want to find the unknown coordinates of a detected object, = ( , ). For such object, we only know its detected directions (angles) from points and respectively, namelŷ andˆ. We know that the two lines intersect at point . Therefore, from the slope equations, we have, then, and, dividing the two equations in (8), we have, and therefore, = − + tan · − tan · tan − tan (10) or, finally, Figure 6: Symbolic class diagram for the main data processing Python function acvml, implementing photosphere image processing, object recognition, attribute extraction and tagging.
The pair of values and in equation 11 represent the two GPS coordinates for point expressed in State Plane California zone coordinate system [29] (Easting and Northing).
Once all three positions are known, , , and , we can calculate also all relevant distances, since,

Spatial Clustering of Object Detection
Following the object identification process using the data processing (stage 1), algorithmic (stage 2) and early stages of geoprocessing workflows (GDB construction and feature dataset compilation of stage 3) described in section 3.5 and Figure 5, a spatial clustering process is required to uniquely identify the location of each object. In summary, the process (a) heuristically identifies detection groups with statistical likelihood of referencing a single unique object and clusters them into a cluster group, and; (b) uses pair-wise combinatorial statistics for distance calculations (see also section 3.7 above) by calculating the statistical mean and standard deviation of object positioning coordinates. The use of spatial clustering algorithms has been well documented in the relevant geospatial literature and has been among the most powerful tools in the geostatistical toolset of analysis.

Results
The analytical results obtained from the experimental and methodological processing described in the previous section are provided in the following subsections. For this experimental application, we used two unincorporated areas in Orange County, covering a total of 45 square miles, and traversing a total of 195 miles collecting sample observation data (photospheres and 3D LiDAR imagery) every 10 feet for both driving directions. We consequently processed the data according to the methodology described in the previous section and obtained object detection and locational estimations according to the implemented algorithms. We present the analysis of two types of objects: stop signs (1,562 detected) and fire hydrants (425 detected) with specific examples integrating 3D LiDAR imagery data. Finally, we provide an accuracy assessment of the detection and location estimates for the sampled object data.

Geographic Coverage in Collection Areas
As can be seen in Figure 10, the two collection areas are in the north-central part of Orange County. These areas are the Anaheim Hills, and the North Tustin areas, both part of the County of Orange unincorporated areas.
The two sample collection areas used for the classification cover approximately a combined 45 square mile area as it can be measured using an approximation of a minimum confined rectangle containing all the field observations in each of the two regions. As can be seen in Table 4, the Anaheim Hill unincorporated area covers approximately 5 square miles, while the North Tustin area covers approximately 40 square miles. In terms of the amount of collected photospheres, a total of 102,901 360°images (8,000×1,000 pixels each) were collected and processed, of which 27,347 photospheres were collected in the Anaheim Hills area, and 75,554 photospheres were collected in the North Tustin area. From these 360°images, using the processing algorithmic methodologies described in the methodology section, we obtained and processed successfully a total of 822,492 cardinal photosphere images (1,000×1,000 pixels each) for both collection areas: 218,389 cardinal images for the Anaheim Hills area, and 604,103 cardinal images for the North Tustin area. Given the approximate 10 feet distance between two consecutive data captures of our data collection instruments, we can make an estimation of the distance covered within our collection areas, a total of 194.9 miles travelled, with 51.8 miles covered in the Anaheim Hills area and 143.1 miles covered in the North Tustin area. The summary of the collection statistics for the sample areas can be seen in Table 4.    Table 5.

Example of Stop Sign Detection
An example of multiple detections for a single stop sign can be seen in Figure 11. Given that photosphere field data captures occurred at least every 10 feet apart we can see that the stop sign was detected at least 70 feet away (the last detection estimated the point-to-object distance, around 14 feet from the sensor).
The estimation statistics and basic accuracy data are provided in Table 6. While both the driving direction 0 , and the object direction , incrementally increase, their absolute difference ( − 0 ) also increases from 21.69°in the first detection, to 43.92°in the last detection, indicating a widening angle between the sensor location and the object location (stop sign), as one would expect, and illustrated graphically in Figure 9.
The mean driving distance of the multi-detection estimation was 0 = 17.15719 feet, with mean standard deviation of the estimation 2 0 = 10.93862 feet. The mean estimated distance between points at detection and the detected stop sign object was 1 = 13.95209 feet, with mean standard deviation of 2 1 = 9.146422 feet. We have verified the prediction accuracy of our model results using the associated LiDAR point cloud data captured simultaneously with the photosphere imagery. As can be seen in Figure 12(a), the detected stop sign and area corresponds to an associated LAS point cloud data tile (top subgraph), and more specifically to the street intersection shown in subgraph Figure 12(b). The visual presents a point budget of 10 million points, visualized through a custom Potree application [27], using a LAS point cloud feature service. The coordinates obtained from the point cloud data, shown in subgraph Figure 12(c), are identical with the coordinates estimated by the detection algorithm, thus providing an additional validation of the accuracy of our model estimates.

Example of Fire Hydrant Detection
A similar example of multiple detections for a single fire hydrant detection can be seen in Figure13 and the obtained statistics in Table 7. For this example, we registered and process four detections of a fire hydrant in the North Tustin focal area. The first two detections occur around 50-70 feet away from the fire hydrant, and the last two within 10-30 feet away. As with the stop sign detection example, the angle between the direction of movement and the detected fire hydrant increases from 5.76°(first detection) to 20.21°(last detection). The estimated mean driving distance among these detections was 0 = 11.93858 feet, with mean standard deviation of 2 0 = 6.500228 feet. The estimation of the mean distance between detection points and fire hydrant objects was 1 = 8.541106 feet, with mean standard deviation of 2 1 = 11.08175 feet. Again, we verified the accuracy of the estimation using the LiDAR LAS point cloud data for the detected location, and as it can be seen in Figure 14, the LAS point coordinates in subgraph (c) match the detected and estimated coordinates of our algorithmic approach.

Accuracy Assessment of Spatial Object Detection Algorithm
Beyond the spatial visual and graphical verification of the accuracy of our detection approach in terms of the estimated location coordinates, we also evaluated the estimation accuracy of the estimation. We assessed the variability of the estimates produced by the detection algorithm, and evaluated the mean locational estimate values, along with the variance of their detection. Given the probabilistic nature of the machine learning and computer vision framework we employed for this analysis, one would expect to see variability increased with the detection distance between a driving point and the detected object. This variability is further compounded by the constraints imposed by the physical measurement instrumentation, specifically by the resolution of the imagery used for the estimation.  at/Lon: Latitude/Longitude (WGS84); Alt: altitude (sea level elevation) in feet; 0 : driving direction; : object direction (stop sign); conf : confidence (model-estimated probability); : mean value; 2 : standard deviation value.
As we can see in Figure15, the standard deviation between multiple detections for each stop sign is relatively low for both latitude and longitude coordinates. More specifically, Table 8 presents the statistics for the mean standard deviation of latitude and longitude estimations for the detected objects (stop signs and fire hydants) in each of the sampled areas. The mean standard deviation of latitude and longitude measurements derive from the variability observed in multiple detections for each unique location of object. The computed statistics (mean value and standard deviation) show the additional variability across all object detections in each area. The mean standard deviation for all stop signs is (0.0001336, 0.0000955) and the mean standard deviation for all fire hydrants is (0.000094, 0.0000110) for the (latitude, longitude) measurements. Thus all measurements present approximate values at the level of 1 × 10 −4 range. Such a mean deviation in terms of WGS84 decimal degrees correspond to values around ±8 feet at mean point-of-detection to object distance across all detected objects at about 14.33 feet. These results are further bounded by an additional set of confounding factors: a. The number of sequential detections for each object. More detections, further from the object increase variability of the estimates (albeit, potientially improve locational estimates).
b. The image's minimum pixel accuracy ( ) values as shown in (2).
c. The relative mean distance between point-of-detection and object. The longer this distance, the largest the within-pixel location resolution error is (thus, pairs of far-object detections are bound to exhibit more variance than pairs of near-object detections).
d. The role of vertical distortion error exhibited in 360°photosphere imagery, as can be seen in Figure3 (albeit our methodology for cardinal photosphere extraction attempts to minimize this error).

Conclusions and Discussion
The results and methodology described in this study provide a concise, scientifically accurate, and reliable approach for automatic and spatially explicit object detection using deep learning computer vision algorithms. Our results demonstrate how a systematic data collection methodology that utilizes high-resolution photosphere imagery coupled with both high-end, centimeter-accuracy GPS sensors, and LiDAR point-cloud datasets can enhance our ability to sense and detect focal objects such as street signs, fire hydrants, and other objects of public interest. We show how the use of convolutional artificial neural networks for deep learning computer vision tasks can provide reliable and consistent pattern recognition results. Both traditional (pretrained) models trained on generalized images from the web, as well as the custom computer vision models (for example, Azure and Cloud custom vision models, and the native custom models for TensorFlow/Keras, Matlab, Mathematica, and other applications) achieve acceptable and reliable levels of recognition performance.
The detection methods described in this paper encapsulate some significant benefits and potential for some applications, while presenting challenges and drawbacks for others. For example, the stop-sign detection algorithms can play a critical role in improving efficiency, competency, and accuracy on spatial and geospatial asset inventories. Such applications may reduce cost in labor and time and yield more productive and efficient workflows and eventually public service benefits. The traditional workflows involved for geospatial asset inventories involve laborious and costly field crews with mobile GPS sensors entering data manual-ly. Our showcased methodology allows at/Lon: Latitude/Longitude (WGS84); Alt: altitude (sea level elevation) in feet; 0 : driving direction; : object direction (stop sign); conf : confidence (model-estimated probability); : mean value; 2 : standard deviation value. for accurate, and near real-time detection, positional estimations, and geospatial feature dataset construction.
On the other hand, even for real-time detection algorithms, such models do not have the necessary speed to serve in self-driving and autonomous vehicle settings. As we saw in the example showcased in section IV.B, the maximum detection distance was estimated around 70 feet. Assuming for the sake of this argument that the first detection occurs in real-time, a vehicle must come to a complete stop within the next 70 feet. For a vehicle traveling around 20 miles per hour (29.33 feet per second), this distance translates to a response time a little over 2 seconds. As can be seen in relevant and widely accepted road standards and studies [34,17,26,13], more likely than not, such response time might not meet the minimum stopping distance (MSD) requirements, even if we ignore human response timing factors. Algorithmic modeling technology for such applications may need to incorporate heuristic, probabilistic and GIS methods in addition to object detection tasks to achieve acceptable levels of performance that meets driving safety standards.  Figure 15: Frequency distribution histograms of the standard deviation in the detected stop sign and fire hydrant object's latitude (subgraph group a), and longitude (subgraph group b). For each subgraph group, the first row represents the stop sign detections, and the second row the fire hydrant detections, while the first column represents the Anaheim Hills sample area, and the second, the North Tustin sample area.