Estimating Black Carbon Levels With Proxy Variables and Low-Cost Sensors

We develop a portable and affordable solution for estimating personal exposure to black carbon (BC) using low-cost sensors and machine learning. Our approach uses other pollutants and environmental variables as proxies for estimating the concentrations of BC and combines this with machine learning-based sensor calibration to improve the quality of the inputs that are used as proxies in the modeling. We extensively validate the feasibility of our approach and demonstrate its benefits with benchmarks conducted on real-world data from two different urban locations with different population densities and characteristics. Our results demonstrate that our approach can accurately estimate BC $(R^{2}$ higher than 0.9) without relying on a dedicated sensor. The results also highlight how calibration is essential for ensuring accurate modeling on low-cost sensor measurements. Our results offer a novel affordable and portable solution that can be used to estimate personal exposure to BC and, more generally, demonstrate how low-cost sensors and proxy modeling can increase the spatiotemporal scale at which information about BC level is available.


I. INTRODUCTION
P ortable personal air quality sensors facilitating the mon- itoring and mitigation of personal pollution exposure are becoming increasingly affordable and available thanks to advances in miniaturization and sensor technology.As exposure to pollutants has been linked with many acute and chronic diseases [1]- [6], several types of cancer [3], [7], [8] and even the occurrence and severity of COVID-19 [9]- [12], these devices provide essential information for mitigating personal health risks and contribute toward improved health.Besides benefits to individuals, portable air quality sensors help to improve the resolution and coverage of air quality information [13], offering policymakers and other stakeholders Submitted for review September 15, 2023.This research was supported in part by Nokia Center for Advanced Research (NCAR).It was also supported in part by the Academy of Finland projects with grant numbers 324576, 345008, 335934, and 339614.The work is also supported in part by the funding from Technology Industries of Finland Centennial Foundation to Urban Air Quality 2.0 project.information about current pollutant characteristics at high spatial and temporal resolution.This helps to mitigate the global cost of pollution and evaluate the effectiveness of countermeasures designed to tackle pollution.Indeed, estimates suggest that 2-5% of global GDP is spent on the treatment of diseases linked to poor air quality [14], [15], making pollution a truly global problem.
Among the different pollutants, black carbon (BC) (also known as soot) is among the worst pollutants to affect individuals.BC is linked with several chronic health conditions, including cancer and respiratory diseases [16], [17], and it contributes to global climate change by absorbing heat [18].What makes BC particularly problematic is that the BC particles linger in the air for a long time while bonding with chemicals and other substances [17], [19].This makes soot, besides a harmful substance in its own right, a carrier of harmful compounds, including airborne viruses.
While portable and affordable (i.e., costing $100 -$1000) sensors for many pollutants, such as particulate matter (PM) and common aerosols (e.g, NO 2 and CO 2 ), are widely available [13], [20]- [24], unfortunately this is not the case for BC level.Indeed, the dominant approach for estimating the BC level currently is to rely on professional-grade measurement technology which typically is highly expensive to operate and maintain with a single sensor typically costing over $50 000 [25], [26].This lack of affordable and portable sensors for BC is unfortunate as it limits the information about personal exposure to harmful pollutants and the resolution of the available information.The limited resolution of information is also challenging to policy makers as they must base their decisions on aggregate information without having a detailed view of how BC levels vary in different parts of the urban environments.
We contribute a portable and affordable solution for estimating BC levels using low-cost air quality monitoring sensors and machine learning.The key idea in our approach is to take advantage of proxy variables that are integrated to a single sensor unit, and that measure environmental variables and concentrations of other pollutants.The proxy variables are used as input to a machine learning model that estimates the current BC level.Sensor utilities on portable air quality monitoring devices are affected by cross-sensitivities which results in the measurements of different pollutants being correlated.This suggests that the concentrations of one pollutant can be estimated at least with a reasonable accuracy from the concentrations of other pollutants.This is particularly useful for BC, due to its tendency to linger in the air.Naïvely modeling relationships between different pollutants, however, is not sufficient, as sensor utilities on portable devices often suffer from significant inaccuracies [27], [28].To overcome these inaccuracies, we combine proxy variables with machine learning based calibration which helps to improve the quality of the sensor measurements used for proxy modeling.Machine learning-based calibration has recently emerged as a powerful solution for improving the accuracy of low-cost sensors, and to provide an alternative to laboratory based calibration.Machine learning techniques are effective in dealing with air quality and environmental data, which are often nonlinear, and machine learning algorithms can learn from large amounts of environmental data and identify complex patterns to perform lowcost sensor calibration [28], [29].As we demonstrate in this paper, the integration of machine learning based calibration with proxy modeling is essential for achieving accurate BC level estimation performance.
We validate our approach through extensive experiments carried out on measurements from two locations with differing urban densities and characteristics, and over a long period of time (20 months and 26.5 months of measurements).The results demonstrate that our approach can reliably estimate BC levels with an R 2 higher than 0.9 without relying on a dedicated sensor.The results also highlight how sensor calibration is essential for improving the quality of the measurements that are used as input for the proxy modeling.Taken together, our work offers a novel portable and affordable solution for estimating BC concentrations, significantly extend the scope, scale and spatiotemporal resolution at which information about air pollutants can be captured on low-cost sensors.

Summary of Contributions
• Feasibility: Demonstrate the feasibility of estimating BC levels on proxy variables using various machine learning models and pollution data at two reference stations collected in a long period.• Novel approach: We estimate BC levels using proxy variables leveraging intelligent machine learning and sensor calibration.Specifically, our solution combines data from low-cost sensor data with pre-calibrated proxy variables on low-cost air quality sensors for estimating BC levels.• High accuracy: The estimating BC levels on low-cost sensors can achieve an accuracy close to those provided by expensive high-quality instruments.

B. BC estimation and proxy variables
BC estimation is typically carried out using expensive professional-grade measurement stations.Most of the works on modeling BC concentrations focus on a global scale and use aggregate-level estimates [45]- [49] which offer limited spatiotemporal resolution and are unable to offer insights into personal exposure.The few works to consider a finer resolution have focused on specialized micro-environments, such as transportation systems or high-density city blocks, and used professional-grade measurements as inputs for estimation [50]- [52].There has also been some limited work on developing mobile platforms for capturing BC concentrations, but these remain proprietary and limited in use [46].
Proxy variables are defined as variables that are not directly relevant but can be utilized to serve in place of an unobservable or immeasurable variable.Proxies have been used, e.g., to forecast pollutant concentrations [53] or to fill in missing values in observations [29].The feasibility of using air quality measurements as proxy variables for BC estimation has been demonstrated in our earlier research.Fung et al. [54] developed a simple linear regression white-box model for estimating BC by using an input adaptive approach, which manages to search for the best combination of proxy variables for the estimation using ordinary least squares (OLS).Contrary to linear regression, Rovira et al. [55] focused on the non-linear properties of BC by exploring two black-box models, i.e., support vector regression and random forest.Zaidan et al. [56] and Fung et al. [57] compared and evaluated BC estimation using white-box and black-box models using proxy variables measured at reference stations.
Compared to previous research, instead of measuring BC directly, we use proxy variables for estimating BC concentrations.We also incorporate sensor calibration as part of the estimation process to improve performance, and consider a broader range of input variables.Our work extends the literature by providing a new affordable and approach for BC level estimating by integrating low-cost sensor calibration with proxy modeling.Our experiments demonstrate that this innovative combination of techniques helps to improve the accuracy of BC estimates significantly.

III. SENSOR MEASUREMENTS AND SYSTEM IMPLEMENTATION
The focus of our research is on providing an affordable, accurate, and portable solution for estimating BC concentrations.We accomplish this by combining modeling that uses proxy variables and intelligent sensor calibration.In this section we describe the measurements that we use to develop and evaluate our methodology (III-A and III-B) and detail a prototype implementation of our system (III-C).

A. Reference sensing stations
We develop our BC proxy models using air quality data extracted from two high-quality measurement stations, namely SMEAR III 1 and Mäkelänkatu2 stations.The SMEAR III station is located in a suburban in the front open yard and its surface includes built, car parking, road, and vegetation areas, whereas the Mäkelänkatu station is located in a street canyon just beside Mäkelänkatu street.Using the measurements from these locations with different air pollution profiles enables developing proxies for estimating the BC concentrations that can work in different environments and thus generalizing our BC proxy model.
The high-quality reference stations are equipped with accurate professional-grade sensors measuring important pollutants and environmental factors.The important pollutants mainly include the PM and gas.Environmental factors include wind direction (WD), wind speed (WS), pressure (P), relative humidity (RH), temperature (T), and, depending on the sensor unit, other related measurements.The sensor types and corresponding measured variables from these two reference stations and low-cost sensor packages (LCPs) are presented in Table I.The reference measurements are used to develop the BC proxy model to explore the feasibility of estimating BC levels on proxy variables using machine learning methods.The two reference stations are described below.

SMEAR III:
The reference station is operated by the Institute for Atmospheric and Earth System Research (INAR) and is located at the Kumpula Campus area at the University of Helsinki, Finland.The station is located in the front open yard and at about 150 m from a main street in Kumpula district and it is about 4 kilometers north-east from Helsinki center in Helsinki [58].The station is planned for research and scientific exploration and it is designed to measure the relationship between forest and atmosphere in boreal climate zone [59].This site is categorized as a semi-urban area, a distinct surface covered with buildings, roads, and vegetation areas.The station consists of high-quality sensors mounted on a 31 m tall tower, with its base located on a rocky hill at 26 m above sea level.Its sensors can measure PM, gases, and meteorological and radiation variables.

Mäkelänkatu station:
The reference station is located at the Mäkelänkatu district in Helsinki and is operated by the Helsinki Region Environmental Services Authority (HSY).Mäkelänkatu is one of the main streets of the city that lead to the city center.The street is lined with apartment buildings and has 42 m of width.The street consists of six lanes, two tramlines, two rows of trees, and two pavements.Mäkelänkatu Street is one of the arterial roads in the city where every day different kinds of vehicles such as cars, buses, trams, and trucks cross in it and often cause traffic congestion [60], the reason for having high level of PM 2.5 and BC pollution.The traffic is especially high during rush hours, at 8 a.m. in the morning and at 5 p.m. in the afternoon, and it is the main source of BC in this street.This is the main reason that the reference station is placed in the vicinity of the street and it is interesting to measure air quality there.The sensing station consists of a container equipped with standard air quality measurement instruments.Most of the inlets for the measuring devices are located on the top of the container, approximately at a height of 2.8 m from ground level.System on Module (SoM), which is powered with a 3500 mAh battery and enclosed in a 3D-printed case made of ESD-PETG filament.General battery life before recharging via micro USB interface is about 26 hours.Each LCP is connected to mobile phones via Bluetooth Low Energy (BLE) for transmitting the measured data to the back-end (server layer).The mobile phones are connected to the server through the 4G network or Wi-Fi.Each LCP reports measurements periodically.The reported readings include T, RH, P, carbon monoxide (CO), nitrous dioxide (NO 2 ), ozone (O 3 ), PM of various masses and sizes, the amount Ultraviolet (UV) light, the GPS position, and the timestamp [61].
To collect data for calibration and validation purposes, we install four LCPs near high-quality reference stations.By installing four LCPs close to each other in the same environment we aim to ensure sensors' consistency and sensor failures.Whereas sensors' consistency means the LCPs generate similar measurements while operating in the same environment [24].In addition, by installing multiple LCPs, we plan to recover from the sensor failures whereas if a sensor fails to operate due to power drainage or other reasons other LCPs still continue measurements.As shown in Figure 1b, we install the LCPs in pairs facing each other under a rain cover, mounted onto SMEAR III.The LCP presented in Figure 1a is a portable low-cost sensor that can be attach to citizen's bag for tracking the measurement of air pollutants.The microsensors installed inside LCP in Figure 1a are the same as the mini-sensors inside the four LCPs installed near the highquality reference station in Figure 1b.While, the LCPs in Figure 1b are powered by connecting them to the electricity grid.The LCPs are set up to transmit their readings every 2 minutes and the LCPs measurement campaign was carried sparsely between November 2019 and February 2020.In the main experiment, we demonstrate how calibration and proxy models can operate together using measurements from lowcost sensors for sensor calibration and BC estimation.

C. System Implementation
The overall methodology has been implemented following the framework shown in Figure 2. The framework consists of three layers: a sensing layer, a server layer, and an application layer.In the sensing layer low-cost mini-sensors and stationary reference station sensors continuously measure the air quality and transmit the measurements to the server layer (i.e.backend).The difference between mobile and stationary minisensors is that the portable mini-sensors are connected to mobile phones for transmitting the air quality data to the server (deployed on the edge or in the cloud), while the mini-sensors in reference stations directly transmit the air quality data to the server using available 4G/5G connections through a Rest Application Programming Interface (API).The backend links the measurements with those collected from a professional grade reference station, and is responsible for learning the calibration and proxy models used by our approach.Data transmission can take place through any type of wireless medium, including short range communication system, e.g., WiFi and BLE, or long range cellular or IoT communication protocols [62].The data transmission interval can be considered to follow a desired update rate and transmission interval.Our current deployment uses 2 minute cycle for sampling air quality measurements, and these are sent to a server hourly.
The server layer is responsible to processing the data.First, the air quality data are preprocessed, synchronized, and the quality of data is checked.Next, the processing pipeline calibrates the measurements of the low-cost sensors.The calibrated data are used both on the portable device to provide information about the air quality, and given as input to the proxy modeling to estimate BC levels.The BC proxy estimates are further transmitted to the application layer where they can be accessed by the end user.Hence, the end users can not only observe the accurate pollutants concentrations of the measured pollutants from the portable device but also access information about the personal BC exposure.The calibration models are currently deployed on portable low-cost sensors and used in two projects to compensate pollutant concentrations.Finally on the application layer, pollution hotspot maps are made according to the data from the portable devices and these can used to support further applications, such as route planning.

IV. BC ESTIMATING WITH PROXY VARIABLES
We first develop the proxy modeling approach and demonstrate its overall feasibility using data from two reference stations.Specifically, in each location, we estimate the BC levels using the measurements of other pollutants and environmental variables in the same location.The approach of testing on two locations with different characteristics is used to demonstrate the generality of approach.We then further build on this result and develop the proxy modeling and calibration for low-cost sensors, and demonstrate the feasibility and benefits of our overall approach in the subsequent sections.

A. Proxy Estimation Pipeline
The proxy modeling pipeline follows a traditional machine learning pipeline.First, the preprocessing step uses a multivariate imputation by chained equations (MICE) imputer [63] to fill in missing values.In the experiments, the imputer is trained separately for each training set split, and the same imputer is then used both for the training and testing data.Next, the features are scaled using standardization.Similarly to the imputer, the scaler is learned separately for each training data split.Training the imputer and the scaler only on the training data and separately for each fold to prevent data leakage and keep bias to a minimum.
After preprocessing, we perform feature selection using recursive feature elimination (RFE) on the whole set of features, until only one feature is left.In our experiments, we train the models using the highest ranked feature, adding the rest of the features one by one by following the ranking, and computing the performance for each set of features.This procedure is performed separately for every training split.For estimating BC concentrations, we test machine learning models that have shown robust performance for other pollutants: MLR, SVR, decision tree regression (DTR), AdaBoost regression (ABR), gradient boosting regression (GBR), RFR, and MLP.Indeed, the machine learning models used in our work are based on regression models.Since we believe that varying machine learning architecture will not provide significant performance differences, therefore the architecture of the models is determined based on the default settings provided by the machine learning scikit-learn library [64].

B. Performance Evaluation
We evaluate the performance of the different models using 10-fold cross-validation.The folds are generated by using the KFold function in the Scikit-learn library in Python and the dataset is equally divided into folds according to time due to the characteristics of time-series data.We consider all common error measures, root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), mean bias error (MBE), and R-squared (R 2 ) listed in Table II, where y i , ŷi , and ȳi represent the target value, the predicted value, the mean of observed target values respectively, and n is the number of samples.We use multiple measures because different measures can rank models differently due to the fact that different measures focus on different aspects of performance [28].RMSE focuses on outliers, MAE focuses on the average performance, MAPE expresses the error in proportion to the target values, MBE measures bias, and R 2 measures the correlation.We use MAE as the cost function for training the models as we are interested in a model that approximates well the target data on average without focusing on outliers.For every error measure, we obtain the overall score by averaging the scores obtained in every fold.We also plot target diagrams, which allow us to quickly visually compare the performance of different models, in line with best practice in air quality research [65], [66].Target diagrams are plotted by using centered root mean squared error (CRMSE) and MBE divided by the standard deviation of the target values.CRMSE is computed similarly to RMSE, but subtracting from the predicted and target values their respective means.

C. Comparing the models
The results of proxy estimation are shown in Table III.Station 1 and Station 2 are used to represent the SMEAR III and Mäkelänkatu reference stations respectively for simplicity.On the Station 2 measurements, MLP performs better than MLR, whereas for Station 1 the reverse is true.The Station 1 measurements are from a low density urban area and from a high altitude above traffic, which results in the relationship between pollutants being simple.In contrast, the Station 2 measurements come from an area with high traffic and from a station that is closer to the street level, which results in a more complex relationship between the variables.SVR has a relatively low error, but it suffers from high bias.The performance of GBR and RFR is accurate with low bias.Overall, these two models have very similar performance across the two reference stations.The ABR by far has the worst performance.This is due to the regression models used by ABR being too simple to capture the complex relationship between variables and hence models that can simultaneously capture relationships between multiple variables are needed for BC estimation.
The target diagrams for Station 1 and Station 2 are shown in Figure 3.In the target diagram for Station 1, every model except ABR is inside the circle and all the models inside the circle are very close to each other.In the target diagram for Station 2, the three points which represent GBR, RFR and MLP overlap.RFR has the lowest MAPE, but it is the most biased one in those three models.MLP appears to be the best model, since it has the lowest RMSE and MAE, suggesting a good performance on both average values and outliers.It has also the lowest bias of the three and the highest overall correlation with the target variable.Overall, MLR is the least biased model, the best model for Station 1 seems to be GBR, and the best model for Station 2 seems to be MLP.
We also evaluated the performance of using different sets of features.The results are shown in Table IV    is different for different folds.Hence, in practice using all available features is sufficient as any potential improvements in performance come at the cost of generality.For this reason, in the remainder of the paper we use all input features for proxy estimation.
Overall, the results show that proxy variables can be used to estimate BC levels.The correlation in the estimates is consistently high, suggesting the proxy variables capture the overall trend in the measurements.The absolute error in the estimates is slightly higher, as can be evidenced from the MAE values.From the target diagrams, we can observe that this is due to a bias in the estimates, i.e., there is a systematic error in the estimates.This result further motivates the use of calibration as part of the pipeline as it enables eliminating the bias in the estimations -besides overcoming inaccuracy in the low-cost sensor measurements.

V. BC ESTIMATION ON LOW COST SENSORS
The results in the previous section demonstrated that proxy variables can be used to estimate BC concentrations with reasonable accuracy, at least if we use data from reference stations to train the models.In this section we further demonstrate that these results generalize to low-cost sensors.

A. Estimation pipeline
The BC estimation pipeline operates similarly to the proxy variable estimation, first using MICE to impute missing values, followed by feature scaling.As input features for calibration we use all available features from the low-cost sensors: CO, O 3 , all available PM measurements of different sizes, and weather measurements: T, RH and P.

B. Experiments
We conduct our experiment on SMEAR III reference station and nearby installed LCPs (Figure 1b).There are four LCP sensors marked with LCP1, LCP2, LCP3, and LCP4.The correlations between those four LCP sensors are presented in Table V.The LCPs overall have very high consistency, but there were periodic hardware failures on three of the four sensors that affected the O 3 measurements.The O 3 measurements of LCP3 are the most consistent and best match with the reference station, and hence we use this low-cost sensor as a basis in the experiment and evaluation.Note that there are also periods where some data from the reference stationor the LCPs -is not available due to maintenance, hardware failures, connectivity failures, and other factors.Smaller gaps in the data are handled using imputation whereas longer gaps have been excluded from the analysis.
The evaluation using a similar process as in the previous section, i.e., using separate training and testing splits and calculating a wide range of error measurements.We validate the calibration pipeline and the overall BC estimation pipeline using experiments with a separate train-test split (50/50) instead of cross-validation due to having less data, for reasons described above.For a short period of time (around a week), the ground truth BC measurements are missing due to a hardware failure happening during a holiday period.We impute the values for these missing days by training a proxy model using the reference station measurements only and using the output of this model on the measurements from lowcost sensors as the BC estimate for this period.Air quality measurements are heavily correlated in time [28] and hence removing these measurements would result in discontinuity that breaks the models that are trained on the data.For validating the calibration results, we can only consider periods where the calibrated sensor values of the proxy variables can be compared to the reference values.As shown in section IV, most of the co-pollutants are only available intermittently.The use of a 50/50 split ensures there are sufficient measurements for training and testing the calibration models for all the variables considered in the evaluation.The reference stations we have used are among the leading observation stations worldwide and hence the issue of missing data unfortunately is a reality that any data modeling approach must addresswhich is also why we incorporate it into our evaluation.As error metrics we use the same measures as before, i.e., the same error measures as in section IV, namely RMSE, MAE, MAPE, MBE, and R 2 , and as models we consider MLR, SVR, DTR, ABR, GBR, RFR, and MLP.
To obtain a baseline for comparison, we also evaluate the performance of a model trained on LCP3 against a model trained on SMEAR III.To make the model as similar as possible, instead of using all available features from reference station, we select features as similar as possible to the ones we select from the LCP3, namely CO, O 3 , PM 2.5 , PM 10 , T, RH and P. To ensure consistency in the comparison of models trained on SMEAR III and LCP3, we use the same train-test split for every model.This means that the timestamp indices of the training data used to train a Station 1 model are the same as the indices in the training data used to train an LCP3 model and the same is true for the test data.

C. Model Comparison
Table VI compares the BC proxy models trained on the lowcost sensor data to those trained from the reference sensor data.The results generally are very similar and the best performing models with lowest bias are GBR and RFR.When trained on LCP3 data, the performance of these two models is close to the performance obtained with reference station data (i.e., SMEAR III).The MAE is slightly higher for the low-cost sensors and the correlation is smaller, due to noise in the measurements.Nevertheless, the same general trend remains and the differences in performance are not significant.Figure 4 further demonstrates this point by comparing the estimates BC levels between the models trained on LCP3 data and reference station data.The general trend is accurately captured and both models are capable of distinguishing between harmful and non-harmful levels of BC.Indeed, the main difference between the two models comes during the highest BC levels where the lower sensitivity of the low-cost sensors may result in underestimating the overall level of BC.We also explore the potential of transferring the models trained on one low-cost sensor to another low-cost devices.The sensors often contain variations across devices and in practice it is not possible to use every device to train the proxy model as this would require co-locating them next to the reference station for a sufficiently long period of time.In case calibration transfer is possible, then only a small set of sensors could be placed close to a reference station and the other devices could simply use the model trained from these measurements [13].To test this, we test the models trained on LCP3 against the measurements obtained on the other lowcost sensors.As mentioned, the O 3 sensor on these devices had some hardware issues and hence O 3 was excluded from the model.The experiment is performed by using the best two models: GBR and RFR. Figure 5 and Figure 6 show the results for the GBR and RFR models.The target diagrams align for all devices and all points are inside the target.This suggests that a model trained on an LCP works well on other LCPs without need to retrain it.

SENSORS
The previous section demonstrates that BC concentrations can be estimated with reasonable accuracy from proxy variables and on low-cost sensors.As the final step we demonstrate how calibrating the low-cost sensor measurements that are

A. Calibration of proxy variables
As calibration targets, we select variables from Station 1 corresponding to variables available from our LCPs, having a low percentage of missing values in the period of the LCP measurement campaign, namely T, RH, P, CO, PM 2.5 and PM 10 .We remove the samples where values are missing from these target variables, as imputing them would result in data leakage and bias.We perform variable scaling as in the previous parts of our study.We test proxy calibration using GBR and RFR since these have consistently been the best performing models.For simplicity we performed the experiments of this part of our study only on a single lowcost sensor as the results of the previous section demonstrated that models transfer across low-cost sensors.
In Table VII, we show the results obtained by calibrating the low cost sensor measurements to predict pollutants to be used as proxy variables.For T we do not provide MAPE and MAE, as temperature is measured using an interval scale ( • C), on which proportions do not make sense.As shown in Table VII, RFR has a better performance than GBR for every variable, therefore we decide to use RFR to perform the rest of the experiments in this part of our study.The errors are low across the board and the correlation is very high, suggesting that the calibration performs well on the measurements.

B. Using calibrated proxy variables to improve BC estimation
Using low-cost sensor measurements as proxies to estimate BC level is beneficial as low-cost sensors help to increase the spatial and temporal resolution of information due to higher deployment density.To test the performance of estimating BC concentrations without and with calibrated proxy variables, we test multiple combinations of features and calibrated proxy variables using an RFR model.In Table VIII, we show the results of selected combinations.As we can see, the best results are obtained with a combination of the regular variables plus calibrated proxy features T, RH, P, and PM 10 .We obtain the second best results adding calibrated PM 2.5 .PM 10 is chosen over PM 2.5 plausibly because the BC measured underwent aging process at a high rate, which increases its coating thickness  and hence results in a larger diameter [67].We can also notice that the proxy variable CO does not improve the results, but it even worsens them compared to using LCP features only.This could be because CO emissions from vehicular traffic have decreased to a background level in Helsinki due to three-way catalysts in vehicles.Values close to background level are not beneficial in predicting BC in this study [68].Using calibrated proxy features only is worse than using a combination of LCP features and calibrated proxy features together.

VII. DISCUSSION AND ROADMAP
Firstly, in section IV, we have shown that BC concentrations can be reliably estimated from proxy variables and we have identified the best performing machine learning for this task, which are GBR, RFR, and MLP.Secondly, in section V, we have shown that estimating BC concentration using measurements from low-cost air quality sensors as proxy variables is also feasible, and similarly to the results of the first perspective, the best models are GBR and RFR, with RFR the best overall.We have also compared the results obtained with low-cost air quality sensor to results obtained from Station 1 with the same data, and seen that a model built on data from low-cost air quality sensors has a performance close to the same model built on high-quality data.This results is particularly significant as it suggests portable devices carried by citizens could supplement professional-grade stations and often detailed insights into the BC concentrations in urban environments.We have also shown that a model trained on one low-cost sensor is transferrable to other sensors without the need to retrain it.Thirdly, in section VI, we have shown that prior calibration of low-cost air quality sensors and adding the calibrated variables to the main model for estimating BC concentration further helps to improve performance.
Naturally our study also presents some limitations.First, as results in section IV indicate, low-cost sensing components are prone to failures.In our case, O 3 sensors from three low-cost sensors failed and they needed to be removed from the data.In actual deployments, it would be essential to have mechanisms that can automatically validate the measurements and to detect such failures -at least in terms of without needing to manually inspect the values.Second, in terms of analysis there were also some limitations.We could not perform imputation on Station 1 because it would lead to bias in the calibration of the LCPs and we could not calibrate O 3 because too many values were missing from the target value.Nevertheless, the results were consistent across all combinations that were tested.Indeed, for all imputed combinations and for those cases where O 3 values were available, the results were in line with the results of other variables and data sources, suggesting that the results are robust.Another limitation is the somewhat short measurement campaign for the low-cost sensors.Ideally, a measurement campaign should last at least a year, so that measurements can span across every season and the sensor can be tested in every condition it can encounter outdoors.However, we only had access to sparse data spanning three months during winter for this study.Ensuring sufficient retention for sensor use is a critical issue for low-cost sensors and any measurement campaigns are likely to suffer from the same issue of sparsity and limited data as our measurements.Thus, the limitations in our data reflect characteristics of real-world datasets.

VIII. CONCLUSION
We contributed a novel affordable and portable solution for estimating BC concentrations using low-cost air quality monitoring devices and machine learning techniques.Our approach builds on an innovative approach that uses other pollutants and environmental variables as proxies that are used to estimate overall BC concentration.As low-cost sensors tend to suffer from noisy measurements and inaccuracies, we further incorporate sensor calibration to improve the quality of the measurements that are used as inputs for proxy modeling to enable robust and accurate modeling on low-cost sensors.We conducted experiments using a combination of groundtruth measurements from a high quality measurement station and low-cost sensor measurements from two locations with different urban characteristics.Our results showed that a model trained on low-cost data from sensors for measuring PM of various sizes, CO, NO 2 , O 3 , and weather variables (T, RH and P), approximates well the true concentration of BC, almost as accurately as a similar model trained on high-quality data from an atmospheric station.The best performing machine learning models are GBR and RFR.The results also show that the performance of BC estimates can be improved by adding calibrated proxy variables as features, i.e. the output of models that calibrate low-cost air quality sensors to predict pollutants that correlate with BC.Overall, our research offers a new way to estimate BC using low-cost air quality sensors.This allows the monitoring of BC more densely than using conventional methods, which in turn allows better estimating health risks faced by individuals, the generation of high resolution pollution maps, and providing detailed information to support policy making.
X. Liu * , F. Concas, N. H. Motlagh, S. Varjonen, P. Nurmi and S. Tarkoma are with the Department of Computer Science, University of Helsinki, Finland (Corresponding author: xiaoli.liu@helsinki.fi)M.A. Zaidan is with the Joint International Research Laboratory of Atmospheric & Earth System Sciences, Nanjing University, China, the Institute for Atmospheric & Earth System Research (INAR), University of Helsinki, Finland and Nanjing Atmospheric Environment & Green Development Research Institute (NAGR), China.H. Timonen is with the Finnish Meteorological Institute, Helsinki, Finland.P. L. Fung, T. Hussein, T. Petäjä and M. Kulmala are with the Institute for Atmospheric & Earth System Research (INAR), University of Helsinki, Finland.

Fig. 2 :
Fig. 2: Framework for the implementation of a BC proxy.

Fig. 3 :
Fig.3: Target diagrams showing the performance of selected machine learning models for estimating BC concentration using proxy variables on high-quality data.

Fig. 6 :
Fig. 6: Target diagrams showing the performance of an RFR model, trained on one LCP and tested on every LCP.

TABLE I :
Available instruments measuring variables used in BC proxy development and the available LCPs variables.

TABLE II :
Error measurements used for performance evaluation.

TABLE III :
Performance of selected machine learning models for the estimation of BC concentration.RMSE, MAE and MBE values are expressed in ng/m 3 .Results have been obtained with 10-fold cross-validation on the whole available data.All the available features are used.

TABLE IV :
Results of feature selection for each crossvalidation fold, after training two selected models on data from Mäkelänkatu.All the performance scores are calculated using MAE, expressed in ng/m 3 .

TABLE V :
Correlations between LCP sensors.Each row indicates the variable on which a correlation is computed, each column indicates the pair of LCP sensors on which the correlation is computed.

TABLE VI :
Performance of selected machine learning models for the estimation of BC concentration using proxy variables from low-cost air quality sensors.RMSE, MAE and MBE values are expressed in ng/m 3 .Results have been obtained with a 50-50 validation split with random sampling on the available data.All available features are used.
Target BC values vs. values predicted by a BC proxy model trained on LCP 3. Target diagrams showing the performance of a GBR model, trained on one LCP and tested on every LCP.

TABLE VII :
Calibration of proxy variables on LCP 3. MAPE and MBE are missing from T as it is measured using an interval scale ( • C), on which ratios and percentages do not make sense.

TABLE VIII :
Estimation of LCP 3 on BC with an RFR model, using combinations of LCP features and calibrated proxy variables.Only results of noteworthy combinations are shown.