Applications of Deep Learning, Machine Learning, and Remote Sensing to Improving Air Quality and Solar Energy Production

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



Exposure to higher PM2.5 can lead to increased risks of mortality; however, the spatial concentrations of PM2.5 are not well characterized, even in megacities, due to the sparseness of regulatory air quality monitoring (AQM) stations. This motivates novel low-cost methods to estimate ground-level PM2.5 at a fine spatial resolution so that PM2.5 exposure in epidemiological research can be better quantified and local PM2.5 hotspots at a community-level can be automatically identified. Wireless low-cost particulate matter sensor network (WLPMSN) is among these novel low-cost methods that transform air quality monitoring by providing PM information at finer spatial and temporal resolutions; however, large-scale WLPMSN calibration and maintenance remain a challenge because the manual labor involved in initial calibration by collocation and routine recalibration is intensive, the transferability of the calibration models determined from initial collocation to new deployment sites is questionable, as calibration factors typically vary with urban heterogeneity of operating conditions and aerosol optical properties, and the stability of low-cost sensors can drift or degrade over time. This work presents a simultaneous Gaussian Process regression (GPR) and simple linear regression pipeline to calibrate and monitor dense WLPMSNs on the fly by leveraging all available reference monitors across an area without resorting to pre-deployment collocation calibration. We evaluated our method for Delhi, where the PM2.5 measurements of all 22 regulatory reference and 10 low-cost nodes were available for 59 days from January 1, 2018 to March 31, 2018 (PM2.5 averaged 138 ± 31 μg m-3 among 22 reference stations), using a leave-one-out cross-validation (CV) over the 22 reference nodes. We showed that our approach can achieve an overall 30 % prediction error (RMSE: 33 μg m-3) at a 24 h scale and is robust as underscored by the small variability in the GPR model parameters and in the model-produced calibration factors for the low-cost nodes among the 22-fold CV. Of the 22 reference stations, high-quality predictions were observed for those stations whose PM2.5 means were close to the Delhi-wide mean (i.e., 138 ± 31 μg m-3) and relatively poor predictions for those nodes whose means differed substantially from the Delhi-wide mean (particularly on the lower end). We also observed washed-out local variability in PM2.5 across the 10 low-cost sites after calibration using our approach, which stands in marked contrast to the true wide variability across the reference sites. These observations revealed that our proposed technique (and more generally the geostatistical technique) requires high spatial homogeneity in the pollutant concentrations to be fully effective. We further demonstrated that our algorithm performance is insensitive to training window size as the mean prediction error rate and the standard error of the mean (SEM) for the 22 reference stations remained consistent at ~30 % and ~3–4 % when an increment of 2 days’ data were included in the model training. The markedly low requirement of our algorithm for training data enables the models to always be nearly most updated in the field, thus realizing the algorithm’s full potential for dynamically surveilling large-scale WLPMSNs by detecting malfunctioning low-cost nodes and tracking the drift with little latency. Our algorithm presented similarly stable 26–34 % mean prediction errors and ~3–7 % SEMs over the sampling period when pre-trained on the current week’s data and predicting 1 week ahead, therefore suitable for online calibration. Simulations conducted using our algorithm suggest that in addition to dynamic calibration, the algorithm can also be adapted for automated monitoring of large-scale WLPMSNs. In these simulations, the algorithm was able to differentiate malfunctioning low-cost nodes (due to either hardware failure or under heavy influence of local sources) within a network by identifying aberrant model-generated calibration factors (i.e., slopes close to zero and intercepts close to the Delhi-wide mean of true PM2.5). The algorithm was also able to track the drift of low-cost nodes accurately within 4 % error for all the simulation scenarios. The simulation results showed that ~20 reference stations are optimum for our solution in Delhi and confirmed that low-cost nodes can extend the spatial precision of a network by decreasing the extent of pure interpolation among only reference stations. Our solution has substantial implications in reducing the amount of manual labor for the calibration and surveillance of extensive WLPMSNs, improving the spatial comprehensiveness of PM evaluation, and enhancing the accuracy of WLPMSNs. Satellite-based ground-level PM2.5 modeling is another such low-cost method. Satellite-retrieved aerosol products are in particular widely used to estimate the spatial distribution of ground-level PM2.5. However, these aerosol products can be subject to large uncertainties due to many approximations and assumptions made in multiple stages of their retrieval algorithms. Therefore, estimating ground-level PM2.5 directly from satellites (e.g., satellite images) by skipping the intermediate step of aerosol retrieval can potentially yield lower errors because it avoids retrieval error propagating into PM2.5 estimation and is desirable compared to current ground-level PM2.5 retrieval methods. Additionally, the spatial resolutions of estimated PM2.5 are usually constrained by those of the aerosol products and are currently largely at a comparatively coarse 1 km or greater resolution. Such coarse spatial resolutions are unable to support scientific studies that thrive on highly spatially-resolved PM2.5. These limitations have motivated us to devise a computer vision algorithm for estimating ground-level PM2.5 at a high spatiotemporal resolution by directly processing the global-coverage, daily, near real-time updated, 3 m/pixel resolution, three-band micro-satellite imagery of spatial coverages significantly smaller than 1 × 1 km (e.g., 200 × 200 m) available from Planet Labs. In this study, we employ a deep convolutional neural network (CNN) to process the imagery by extracting image features that characterize the day-to-day dynamic changes in the built environment and more importantly the image colors related to aerosol loading, and a random forest (RF) regressor to estimate PM2.5 based on the extracted image features along with meteorological conditions. We conducted the experiment on 35 AQM stations in Beijing over a period of ~3 years from 2017 to 2019. We trained our CNN-RF model on 10,400 available daily images of the AQM stations labeled with the corresponding ground-truth PM2.5 and evaluated the model performance on 2622 holdout images. Our model estimates ground-level PM2.5 accurately at a 200 m spatial resolution with a mean absolute error (MAE) as low as 10.1 μg m-3 (equivalent to 23.7% error) and Pearson and Spearman r scores up to 0.91 and 0.90, respectively. Our trained CNN from Beijing is then applied to Shanghai, a similar urban area. By quickly retraining only RF but not CNN on the new Shanghai imagery dataset, our model estimates Shanghai 10 AQM stations’ PM2.5 accurately with a MAE and both Pearson and Spearman r scores of 7.7 μg m-3 (18.6% error) and 0.85, respectively. The finest 200 m spatial resolution of ground-level PM2.5 estimates from our model in this study is higher than the vast majority of existing state-of-the-art satellite-based PM2.5 retrieval methods. And our 200 m model’s estimation performance is also at the high end of these state-of-the-art methods. Our results highlight the potential of augmenting existing spatial predictors of PM2.5 with high-resolution satellite imagery to enhance the spatial resolution of PM2.5 estimates for a wide range of applications, including pollutant emission hotspot determination, PM2.5 exposure assessment, and fusion of satellite remote sensing and low-cost air quality sensor network information. We later, however, found out that this CNN-RF sequential model, despite effectively capturing spatial variations, yields higher average PM2.5 prediction errors than its RF part alone using only meteorological conditions, most likely the result of CNN-RF sequential model being unable to fully use the information in satellite images in the presence of meteorological conditions. To break this bottleneck in PM2.5 prediction performance, we reformulated the previous CNN-RF sequential model into a RF-CNN joint model that adopts a residual learning ideology that forces the CNN part to most effectively exploit the information in satellite images that is only “orthogonal” to meteorology. The RF-CNN joint model achieved low normalized root mean square error for PM2.5 of within ~31% and normalized mean absolute error of within ~19% on the holdout samples in both Delhi and Beijing, better than the performances of both the CNN-RF sequential model and the RF part alone using only meteorological conditions. To date, few studies have used their simulated ambient PM2.5 to detect hotspots. Furthermore, even the hotspots studied in these very limited works are all “global” hotspots that have the absolute highest PM2.5 levels in the whole study region. Little is known about “local” hotspots that have the highest PM2.5 only relative to their neighbors at fine-scale community levels, even though the disparities in outdoor PM2.5 exposures and their associated risks of mortality between populations in local hotspots and coolspots within the same communities can be rather large. These limitations motivated us to concatenate a local contrast normalization (LCN) algorithm at the end of the RF-CNN joint model to automatically reveal local PM2.5 hotspots from the estimated PM2.5 maps. The RF-CNN-LCN pipeline reasonably predicts urban PM2.5 local hotspots and coolspots by capturing both the main intra-urban spatial trends in PM2.5 and the local variations in PM2.5 with urban landscape, with local hotspots relating to compact urban spatial structures while coolspots being open areas and green spaces. Based on 20 sampled representative neighborhoods in Delhi, our pipeline revealed that on average a significant 9.2 ± 4.0 μg m-3 long-term PM2.5 exposure difference existed between the local hotspots and coolspots within the same community, with Indian Gandhi International Airport area having the steepest increase of 20.3 μg m-3 from the coolest spot (the residential area immediately outside the airport) to the hottest spot (airport runway). This work provides a possible means of automatically identifying local PM2.5 hotspots at 300 m in heavily polluted megacities. It highlights the potential existence of substantial health inequalities in long-term outdoor PM2.5 exposures within even the same local neighborhoods between local hotspots and coolspots. Apart from posing serious health risks, deposition of dust and anthropogenic particulate matter (PM) on solar photovoltaics (PVs), known as soiling, can diminish solar energy production appreciably. As of 2018, the global cumulative PV capacity crossed 500 GW, of which at least 3–4% was estimated to be lost due to soiling, equivalent to ~4–6 billion USD revenue losses. In the context of a projected ~16-fold increase of global solar capacity to 8.5 TW by 2050, soiling will play an increasingly more important part in estimating and forecasting the performance and economics of solar PV installations. However, reliable soiling information is currently lacking because the existing soiling monitoring systems are expensive. This work presents a low-cost remote sensing algorithm that estimates utility-scale solar farms’ daily solar energy loss due to PV soiling by directly processing the daily (near real-time updated), 3 m/pixel resolution, and global coverage micro-satellite surface reflectance (SR) analytic product from the commercial satellite company Planet. We demonstrate that our approach can estimate daily soiling loss for a solar farm in Pune, India over three years that on average caused ~5.4% reduction in solar energy production. We further estimated that around 437 MWh solar energy was lost in total over the 3 years, equivalent to ~11799 USD, at this solar farm. Our approach’s average soiling estimation matches perfectly with the ~5.3% soiling loss reported by a previous published model for this solar farm site. Compared to other state-of-the-art PV soiling modeling approaches, the proposed unsupervised approach has the benefit of estimating PV soiling at a precisely solar farm level (as in contrast to coarse regional modeling for only large spatial grids in which a solar farm resides) and at an unprecedently high temporal resolution (i.e., 1 day) without resorting to solar farms’ proprietary solar energy generation data or knowledge about the specific components of deposited PM or these species’ dry deposition flux and other physical properties. Our approach allows solar farm owners to keep close track of the intensity of soiling at their sites and perform panel cleaning operations more strategically rather than based on a fixed schedule.





Zheng, Tongshu (2021). Applications of Deep Learning, Machine Learning, and Remote Sensing to Improving Air Quality and Solar Energy Production. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.