Photogrammetry for Unconstrained Optical Satellite Imagery With Combined Neural Radiance Fields

We introduce a novel method tailored for unconstrained multiview optical satellite photogrammetry in time-varying illumination and reflection conditions. Our approach uses continuous radiance fields to represent surface radiance and albedo based on radiometry principles, integrating both static and transient components for satellite photogrammetry. In addition, an innovative self-supervised mechanism is introduced to optimize the learning process which leverages dark regions’ accentuation, transient and static composition, and shadow regularization. Evaluations on multidate WorldView-3 images affirm that our model consistently surpasses the state-of-the-art techniques.

The accumulation of satellite image data has aroused research combined with computer vision methods, emphasizing the generation of digital surface models (DSMs) and digital elevation models (DEMs) using multiview satellite photogrammetry [9].Some studies use binocular stereo matching and bundle adjustment based on selected pixels, subsequently fusing point clouds or depth maps with camera matrices [8], [11].Conversely, others use an encoder-decoder framework to establish correspondences between imaging patterns and 3-D volumes [19].Nevertheless, these conventional methods often struggle under unconstrained conditions and require extensive manual processing alongside substantial computational resources.Furthermore, passive remote sensing, a characteristic of visible imaging sensors, contrasts with active sensors such as LiDAR [10].The strong correlation with lighting conditions poses formidable challenges to visual processing.
Digital Object Identifier 10.1109/LGRS.2023.3337352captured with limited viewpoints, often under a wide range of seasonal and meteorological conditions.These variances are further exacerbated by uncertain factors such as solar angle, atmospheric transmission, and sky's diffuse reflection, which influence the radiation intensity received by the sensor.Subsequently, intricate terrains and structures, including buildings, can obscure both direct sunlight and diffusely reflected light.This occlusion manifests as brightness disparities across surfaces and shadows that are unevenly distributed, complicating high-fidelity reconstruction.Finally, temporary objects such as vehicles and crowds are unpredictable and occur in standalone scenarios, leading to confusion in learning-based reconstruction efforts and localized image blurring.
Recent advancements have seen techniques delivering intricate 3-D reconstructions using end-to-end methods to represent color and volume in stereoscopic domains.A prime exemplar is the neural radiance field (NeRFs) [12], which restores pixel color in single-or multiview scenarios and achieves inverse-rendering along visible light.Based on input images, NeRF uses an intrinsic continuous field function to model volumetric representations, facilitating realistic reconstructions from multidirectional viewpoints.This approach inherently models environmental and radiometric parameters and can independently segment transient spatial elements, demonstrating its significant potential for photogrammetry in unconstrained satellite imagery.While there still have limited prescient study exploring NeRF's utilization in remote sensing, these efforts are more suitable for a constant illumination condition and weak in reflections and shadows [15], [16].To address all these issues, we introduce a model termed photogrammetry for unconstrained optical satellite imagery with combined NeRFs (OSI-NeRFs).The primary contributions of our research can be delineated as follows.
1) We represent the surface radiance and albedo intersected by extensive rays using the continuous radiance fields based on the radiometry principle.This allows us to learn satellite photogrammetry by integrating static and transient components.2) We enhance the photogrammetric precision using a novel self-supervised hybrid loss function for constrained training.The proposed method encourages the model to accentuate dark regions, integrate transient and static elements, and suppress shadows.3) We evaluate the proposed OSI-NeRF on multidate WorldView-3 images.Both the qualitative and quantitative results distinctly showcase its superior performance.
II. PROPOSED MODEL Our proposed OSI-NeRF addresses challenges in handling unconstrained multiview aerial and satellite photogrammetry under time-varying conditions of illumination and reflection.Specifically, we construct implicit scene representations, integrating both the static and transient components within the same spatial location, as depicted in Fig. 1.For simplicity, we model without considering external disturbances such as atmospheric radiation attenuation, which facilitate the determination of imaging parameters for surface albedo and irradiance given variable sunlight angles.We introduce a novel selfsupervised mechanism, implemented by a hybrid loss function, to refine photogrammetric details further, thereby elevating the overall quality of scene reconstructions.

A. Combined NeRF
To initiate our approach, we commence with the fundamental NeRF model, incorporating a learnable and continuous space radiation field denoted by a multilayer perceptron F S for constructing time-invariant static objects.This field consists of a density value σ s ∈ R and radiance c s = (r, g, b) ∈ R 3 that are computed using multilayer perceptrons, in which the inputs are the spatial coordinates x = (x, y, z) ∈ R 3 and view direction d = (d x , d y , d z ) ∈ R 3 , with ||d|| = 1.x and d are processed through the well-designed positional encoding function γ.The mathematical formula is as follows: (1) In optical remote sensing applications, multiview imageries from the same scene demonstrate notable variability and cannot be directly calculated by NeRF under differing illumination conditions and photometric postprocessing.However, satellite photogrammetry theory allows us to invert irradiance l and surface albedo α based on the performance radiance received by the satellites.Operating under the assumption of an ideal Lambertian model for the Earth within a simplified atmospheric radiative transfer model [18], its irradiance is primarily derived from atmospheric scattering l s and direct sunlight l d , with isotropic irradiance in the atmosphere [7].Drawing inspiration from [16], the image-independent static color ĉs can be decomposed as the product of irradiance and albedo.The irradiance, encompassing the intensity of atmospheric scattering and direct sunlight intensity, can be derived from F l .Shadows, resulting from light obstructions, can be modeled by the shadow field F ρ .The surface albedo is formulated through For precise rendering of time-dependent transient objects within a scene, we adapt the concept of a "transient" head from [12].Analogous to the computation for static entities, this head emits its own color, density, and uncertainty field, facilitating accurate synthesis with opaque occlusions without prior knowledge The following formula represents this process: The delineated formula encompasses the term Ĉ(r, τ i ), representing pixel colors associated with the image.We assume that transient objects occupy the transparent space above the surface, which would block the passage of light.For precise and efficient computations in Earth observation scenarios, we use numerical quadrature for integral evaluation and derive discrete points through altitude-based sampling within the range of absolute z-axis values [t n , t f ].In addition, the model adopts hierarchical volume sampling to improve efficiency by fine-tuning the sampling process, as implemented in [13].By integrating the continuous fields derived from photogrammetry along the ray path, both the dynamic and static elements can be accurately measured and reconstructed in remote sensing imagery.

B. Hybrid Loss Function
To effectively train multiple continuous field function parameters, we propose a hybrid loss function for selfsupervised training across diverse aspects.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

1) Scene Reconstruction
To ensure that the static and transient align their outputs with real scenes, OSI-NeRF is trained based on the difference between the predicted pixel colors Ĉ(r, τ i ) ∈ R 3 and the authentic sensor-derived pixel colors C(r, τ i ) ∈ R 3 .The formula is as follows: (6) In the real situation, time-dependent shadows are presented in multiview remote sensing images.These shadows, often small, exhibit distinct color contrasts relative to the surrounding areas.As depicted in test images of NeRF in Fig. 2, the conventional L2-loss yields a vague reconstruction of darker regions, as brighter regions dominate the loss contribution.Inspired by the research RawNeRF [17] relating to high dynamic range (HDR) images, we use the gradient of the tonemapping curve function ψ(z) = ng( Ĉ(r, τ i )) to emphasize these dark regions.The formula is presented below The function ψ represents the commonly used nonuniform µ-law conversion method in audio signal compression and high dynamic image processing [17].The argument of ng(•) is treated as a constant, thus truncating the gradient flow during backward propagation.Furthermore, to improve the synthesized DSM's quality, we use the l − 2 distance between the estimated altitude Ĥ (r, τ i ) integrated from the ray direction r and the altitude H (r, τ i ) obtained from LiDAR measurements as a supervised loss In the given expression, h(t) ∈ [h min , h max ] represents the mapping of points in sunlight to the altitude range of the scene.In addition, we define the scene reconstruction loss

2) Density Refinement
The proposed network combines the volume densities of both the static and transient objects.As illustrated in test images of likeminded S-NeRF in Fig. 2, voxel overlap of static and transient objects occurs in routine scenes, contradicting the Pauli exclusion principle [20] in the physical world.To mitigate this issue, we use density binary entropy loss, taking inspiration from the work in [14], to penalize the coexistence of the transient and static objects where w ∈ [0, 1] determines the transient-static ratio along the ray path.D constrains them to obey the {0, 1} distribution, while P > 1 causes their distribution to be right-skewed, preventing the model from overproducing transient scenes.
Moreover, we impose a penalty on the maximum value of ω for each ray, which restricts the occlusion of the transient component.This approach is analogous to real-world phenomena and helps improve our model's accuracy After that, we define the overall loss of this part: In this case, due to the attenuated gradient near the coordinate origin, there are still bits of transient feature mists in the static area [14].For this reason, we adopt the curriculum learning thought [19], which uses the monotone decreasing concave function to control the proportion of the transient part within the rendering function The variable l denotes the current training epoch, while L represents the maximum epochs for the training process.
During the initial stages of training, the peak of the curve is skewed to the right, which penalizes blurry voxels to attain the utmost clarity in static object density.As training progress goes on, the curve's shape gradually shifts toward the negative direction, optimizing and pushing the small gradient density parts that were difficult to train toward the origin.This strategy prioritizes securing the stable representations of consistent static and transient scenes, subsequently fine-tuning the transient component for enhanced aggregate outcomes.

3) Shadow Regularization
To remove the influence of direct sunlight in shadowed areas, we use pointwise reductions in ( 2) and ( 3).Yet, as depicted in the bottom row of Fig. 3, ablation models may overexplain the darkness elements as the shadow regions in transient and static cases.To rectify this issue, we design a mean-squared loss L ρ along the rays to weaken the scope of the shadow scale as follows: The overall hybrid loss of our method is L r, ) where λ ω , λ ρ represent the empirical ratios of the regularization terms.To determine appropriate parameter values, we use the grid search technique, considering the appearance distributions inherent in the datasets.

III. EXPERIMENTS
In this section, we evaluate the practical performance of the proposed model.We use surface scenes from five distinct datasets captured at different periods using the WorldView-3 optical sensor [21]. 1 Each crop has a resolution of 0.3 m/pixel  with selected visible RGB bands, covering an area of 256 × 256 m for each area of interest (AOI).The elevation data ranges from −29 to 30 m, and specific details for each scenario can be found in Table I.

A. Implementation Details
In our experiments, the Nerf model served as the foundation for implementing under PyTorch. 2 In all, 64 coarse and 64 refine points are sampled from each ray, and the solar azimuth is obtained from the metadata file.For scenes with transient objects and shadows, values of λ ψ = 5e −1 , λ ω2 = 5e −2 , λ ω = 5e −4 , and λ ρ = 5e −5 are assigned based on validation outcomes, whereas other parameters are ascertained via grid search.The optimization procedure consists of 100k iterations with a batch size 2048 and an exponentially decayed learning rate ranging from 10 −3 to 10 −5 .The training process takes about 10 h with two NVIDIA 3090 GPUs.Our method is easily reproducible, and we will make code and datasets open to publication. 3

B. Evaluation
We execute a multiperspective optical remote sensing image synthesis task.The process generates images using known sunlight angles, and we evaluate the model's efficacy by comparing the synthesized imagery with actual scenes.For comparison, we image evaluating metrics and DSM-derived 3-D quality indicators, notably PSNR/SSIM and altitude MAE (A.MAE).Furthermore, we qualitatively scrutinize the algorithm's results using practical examples.

C. Baselines
Previous satellite stereo pipelines predominantly use stereo matching for DSM construction.However, these methods often lack a way of modeling inconsistencies and the ground texture and details typically require additional processing, complicating direct color rendering.Analogous to [22], we use the S2P method, selecting the ten pairs with closer acquisition dates as a baseline.We then evaluate the quality of the DSM by comparing the A.MAE values.Furthermore, few latest studies have focused on leveraging neural rendering techniques for remote sensing images, such as S-NeRF [16] and Sat-NeRF [15].Specifically, S-NeRF is a pioneering approach, the first we identify that applies the NeRF technique to remote sensing imagery.This method innovatively uses solar correction to establish consistency between the incoming light ratio and spatial transparency.Extending this, Sat-NeRF incorporates a model for transient objects and tailors the camera models' representation more fittingly to satellite data.In addition, we also consider the original NeRF model as a baseline.Our proposed OSI-NeRF, based on the NeRF framework like the aforementioned baselines, separately computes static and transient elements while calculating additional loss constraints along the light paths.These enhancements do not result in significant changes in computational cost.
In addition, we design three variants for ablation experiments.The first model, OSI-NeRF-ω, deletes transient-static density refinement loss L ω .The second model, OSI-NeRF-ψ, removes the tonemapping function that promotes the representation quality of dark regions from the denominator of L ψ loss.Finally, the third model, OSI-NeRF-ρ, removes the auxiliary shadow scale regularization L ρ .We confirm the significance of each module through the above ablation models.

D. Results
The experimental outcomes are presented and Table II visually illustrates the qualitative indexes on sets of datasets.Among the baselines, NeRF underperforms notably.Specifically, on dataset Num.068, there is a pronounced disparity.Moreover, on other datasets, the PSNR drops by 2-6 dB and SSIM reduces by 0.1-0.3.This inferior performance primarily arises from the limited perspective of satellite imagery, combined with fluctuating lighting and surface conditions that violate NeRF's single-scene presumption.Sat-NeRF and S-NeRF factually approximate the shadow regions caused by occlusion to spatial voxels while solely processing surface reflections.These strategies manifest certain enhancements but fall short in generalized adaptability across multidata scenarios, leading to diminished performance metrics, particularly evident in dataset Num.131.Nonetheless, Sat-NeRF incorporates an altitude adjustment, yielding comparable A.MAE scores.The global altitude derived from OSI-NeRF demonstrates a marginal improvement over S2P DSMs, which are influenced by point cloud anomalies and candidate pairings.The ablations OSI-NeRF-ψ and OSI-NeRF-ω lack dark regions' accentuation and precise decoupling of static and transient components, significantly decreasing model performance.In OSI-NeRF-ρ, the removal of the occlusion and shadow Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The qualitative results of image rendering are illustrated in Fig. 2, showcasing results from both the training and testing datasets across two chosen scenarios.The transient latent codes of the test set are determined through optimal results, which verify our ability to independently process and reconstruct static and transient components.Yet, the optimization of NeRF results in fragmentary or indistinct images, evident in Num.113.Unseen scenes lack street details and exhibit a misty appearance.Sat-NeRF produces unclear details, such as zebra crossings and conceals transient objects like moving vehicles.S-NeRF occasionally introduces vague spots in some scenes, attributable to its limited capacity to handle intense changes in lighting and albedo.In contrast, our OSI-NeRF effectively handles unseen scenes and restores transient elements as much as possible, while it also reconstructs details obstructed by buildings.
In Fig. 3, visual results from the various ablations are further investigated.OSI-NeRF-ω removes the refinement of static and transient components, resulting in the loss of static details and exhibiting cluttered transient components.It induces augmented blurriness, especially on architectural features.OSI-NeRF-ψ blurs out areas with insufficient illumination, like those with vegetation cover, due to its inability to accentuate the dark areas.Ulteriorly, OSI-NeRF-ρ mistakenly identifies redundant areas as shadow regions, such as rooftops, causing some raised structures to become uneven.Our introduced approach, OSI-NeRF, synergistically incorporates the capabilities of distinct modules to realize optimal reconstruction outcomes.
IV. CONCLUSION We propose OSI-NeRF, an effective NeRF-based approach for reconstructing scenes in satellite imagery.Our approach is adept at modeling optical satellite images under time-varying conditions of illumination and reflection.Moreover, it seamlessly incorporates a self-supervised superiority to address challenges such as chiaroscuro, transient and static composition, and occlusion and shadow.Adequate experiments on real-world datasets indicate prominent qualitative and quantitative improvements over previous state-of-the-art approaches.Nonetheless, OSI-NeRF does exhibit certain limitations, such as irregularities in the DSM structural model, imperfect reconstruction of water surface, and restricted scene scope, which necessitate further refinement.

Manuscript received 27
May 2023; revised 25 September 2023; accepted 7 November 2023.Date of publication 28 November 2023; date of current version 19 December 2023.This work was supported in part by the Laboratory Fund of Chinese Academy of Sciences under Grant CXJJ-22S032, in part by the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDA0310502, and in part by the Future Star of Aerospace Information Research Institute under Grant E3Z10701.(Corresponding author: Zide Fan.)

Fig. 1 .
Fig. 1.Architecture of OSI-NeRF.The combined neural radiance is structured from five distinct neural functions.Given the solar angle, spatial coordinate and transient embedding, the atmospheric scattering l s and direct sunlight l d are produced by F l .The shadow field function F ρ calculates the nonstatic shadows.Static albedo and density are derived from the perceptrons F α and F s , respectively.Meanwhile, transient scenes are modeled through the volumetric function F t .The hybrid loss function, encompassing scene reconstruction loss, density refinement loss, and shadow regularization loss, ensures consistency between photogrammetric imagery and the actual scene.
) Light intensities across different bands are denoted by l s , l d ∈ R 3 and can be approximated based on the sunlight angle.The solar orientation, defined by the azimuth and elevation angles (ζ , φ), informs the ray direction θ for each input image.The shadow parameter ρ ∈ [0, 1] indicates whether a specific terrestrial location is obstructed from light and is activated by the sigmoid function.The band-related surface albedo is defined by α ∈ R 3 , solely dependent on the coordinate positions.The resultant pixel color ĉs ∈ R 3 is derived by multiplying l with α.
Here, τ i ∈ R m represents the per-image transient latent code of the ith imagery, while ρ signifies the likelihood of direct sunlight obstruction.The refinement of transient calculation will be discussed in detail later.To ascertain pixel color, we leverage the volume rendering technique to combine the static and transient heads along the ray r (t) = o + t d, originating from the camera's projection center o with direction d.

Fig. 2 .
Fig. 2. Qualitative rendering results obtained from experiments conducted on the train and test images in two typical scenes.Our OSI-NeRF model is capable of simultaneously modeling surface appearance, illumination, shadow, and static/transient components.

Fig. 3 .
Fig. 3. Visualization comparison of different photogrammetry features in the ablation experimental group.These results demonstrate the effectiveness and versatility of our approach's different modules in generating high-quality synthesized DSMs.

TABLE I STATISTICS
OF THE DATASETS