Mining Public Datasets for Modeling Intra-City PM2.5...

Mining Public Datasets for Modeling Intra-City PM2.5Concentrations at a Fine Spatial Resolution

Yijun LinSpatial Sciences Institute

University of Southern [email protected]

Fan PanSpatial Sciences Institute


Yao-Yi ChiangSpatial Sciences Institute


Dimitris StripelisInformation Sciences Institute


Jose Luis AmbiteInformation Sciences Institute


Sandrah P. EckelDepartment of Preventive MedicineUniversity of Southern California

[email protected]

Rima HabreDepartment of Preventive MedicineUniversity of Southern California

[email protected]

ABSTRACTAir quality models are important for studying the impact of airpollutant on health conditions at a �ne spatiotemporal scale. Exist-ing work typically relies on area-speci�c, expert-selected pollutionemissions, dispersion and transport features (e.g., meteorology) forbuilding the model for each combination of study areas, pollutanttypes, and spatiotemporal scales. In this paper, we present a datamining approach that utilizes publicly available OpenStreetMap(OSM) data to automatically generate an air quality model for theconcentrations of �ne particulate ma�er less than 2.5 m in aerody-namic diameter at various temporal scales. Our experiment showsthat our (domain-) expert-free model could generate accurate PM2.5concentration estimates, which can be used to improve air qual-ity models that traditionally rely on expert-selected input. Ourapproach also quanti�es the impact on air quality from a vari-ety of geographic features (i.e., how various types of geographicfeatures such as parking lots and commercial buildings a�ect airquality and from what distance) representing mobile, stationaryand area natural and anthropogenic air pollution sources. �isapproach is particularly important for enabling the construction ofcontext-speci�c spatiotemporal models of air pollution, allowinginvestigations of the impact of air pollution exposures on sensitivepopulations such as children with asthma at scale.

CCS CONCEPTS•Information systems →Spatial-temporal systems;

KEYWORDSPM2.5 Concentration, Air �ality Model, Geospatial Data Mining

ACM Reference format:Yijun Lin, Fan Pan, Yao-Yi Chiang, Dimitris Stripelis, Jose Luis Ambite, San-drah P. Eckel, and Rima Habre. 2017. Mining Public Datasets for ModelingIntra-City PM2.5 Concentrations at a Fine Spatial Resolution. In Proceedingsof ACM SIGSPATIAL, Redondo Beach, CA, USA, November, 2017 (SIGSPA-TIAL’17), 10 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONFine particulate ma�er (PM2.5) consists of particles less than 2.5m in aerodynamic diameter that once inhaled can penetrate therespirable region of the lungs and contribute to respiratory andcardiovascular disease. Typical primary sources of contributionsto �ne particulate ma�er include stationary and moving vehicleexhausts, burning sources (e.g., wood-burning stoves and wild�res),re�neries, and power plants. Secondary PM2.5 is also formed as aresult of photochemical reactions in the atmosphere in the pres-ence of precursor gases and solar radiation. Epidemiological studieshave shown associations between exposure to PM2.5 and varioushealth conditions, including lung and respiratory disease [21], suchas asthma [5, 6]. In the 1993 landmark air pollution “Harvard SixCities Study” [4] and other recent studies [16], researchers reportedassociations between the levels of exposure to �ne particulate con-centrations and the risk of mortality and morbidity for cities allover the world. As a result of this scienti�c evidence and severalother similar studies, many countries have set national ambientmass-based air quality standards for PM2.5. In the United States,the Environmental Protection Agency (US EPA) set PM2.5 stan-dards and established the air quality index (AQI) to communicaterelative health risk levels of current pollution levels compared tothe standard, which is converted by PM2.5 concentration. �e USAQI ranges from 0 to 500 and consists of six categories: “Good”,“Moderate”, “Unhealthy for Sensitive Groups”, “Unhealthy”, “VeryUnhealthy”, and “Hazardous”. From an ambient pollutant concen-tration value, one can calculate the corresponding AQI and itshealth risk category for each type of regulated air pollutant. Forexample, a 20 g/m3 PM2.5 measurement corresponds to an AQIof 68 and is in the “Moderate” category, which means that “Airquality is acceptable; however, for some pollutants there may be amoderate health concern for a very small number of people who areunusually sensitive to air pollution.” A 60 g/m3 PM2.5 measurementcorresponds to an AQI of 153 and is in the “Unhealthy” category.

In the US, the EPA’s ambient air monitoring network provideshourly PM2.5 measurements at its regulatory air monitoring sta-tions through the “Air �ality System” (AQS). �ese monitoringstations are established for regulatory purposes with strict sitingcriteria to capture regional and urban scale contributions to airpollution levels within an area. �ese air monitoring stations alsoexist in many other countries. Scientists and government agen-cies use measurement data from these stations to build and val-idate air quality models (AQMs) to explain and predict the pastand future air pollution levels for unmonitored locations (e.g.,[1, 3, 12, 14, 15, 18, 19, 23, 24, 27, 28]). Predictions from these modelscan then be used to study the associations between long-term airpollution exposure and health impact at �ner spatial scales (thansimply using the monitored data) [25, 26].

One popular approach to predicting long-term spatial variationsin air pollution levels is land-use regression (LUR) (e.g., [2, 9, 13])while more recent work uses machine learning techniques (e.g.,[3, 11, 18, 20, 22]) and big data (e.g., [27, 28]). Existing air qualitymodels typically consider expert-selected (unique) characteristicsin a neighborhood including various types of geographical fea-tures (e.g., elevation), proximity to roadways and tra�c conditions,population density, and meteorological data. �e idea is that airpollutants in “nearby” locations could be spatially autocorrelatedor demonstrate comparable concentrations at a given time. �isis because geographically proximate locations are surrounded bysimilar human-made and natural features (emissions and disper-sion pa�erns), including mountains, oceans, roads, factories, andvarious land-use types. However, building an air quality modelthat produces accurate air quality concentration predicts at a �nespatiotemporal scale to capture the intra-city air pollution surfaceis challenging because there are no universal means to de�ne andquantify location neighborhood of highest in�uence on local air qual-ity, especially across various cities and regions. Speci�cally, separatemodels require expert-selected location characteristics before themodel ��ing process to achieve the best regression or machinelearning results. (e.g., distance to the ocean has a high correlationto air quality in San Diego but not in every coastal city). �e impactof each neighboring location characteristic on air quality can varysigni�cantly across di�erent types air pollutants, time, and space.Moreover, some of the data used in previous studies can be di�cultor expensive to obtain and are not frequently available, such as�ne-scale, and real-time meteorological data and tra�c volumes.(See Section 5 for a review on related work)

�is paper presents a novel data mining approach that buildsan accurate PM2.5 model from publicly available geospatial data,OpenStreetMap (OSM), without using expert knowledge in select-ing air quality predictors. Our approach utilizes the PRISMS-DSCICinfrastructure [17] as the data integration and analytics platformto investigate the AQS data of PM2.5 concentrations and OSM data.�e PRISMS-DSCIC (Pediatric Research using Integrated SensorMonitoring Systems - Data and So�ware Coordination and Inte-gration Center) is an NIH-NIBIB (National Institutes of Health- National Institute of Biomedical Imaging and Bioengineering)funded initiative to address pediatric asthma as a chronic diseaseof childhood. PRISMS-DSCIC is responsible for collecting, storing,integrating, and analyzing real-time environmental, physiological

and behavioral data obtained from heterogeneous sensors and tra-ditional data sources to help researchers to predict and preventasthma a�acks e�ciently. Using publicly available data that have aglobal coverage with �ne details (in many countries), such as theOSM data, has the advantage that the same approach can applyto many areas across the globe without manual tuning to accom-modate available datasets for every study area. Similarly, a recentproject using OSM data to generate pa�erns of human activities inVienna, Austria demonstrated promising results [10].

Our approach uses the AQS data from twelve SCAQMD (SouthCoast Air �ality Management District) monitoring stations in theLos Angeles Metropolitan Area (LAMA) and geographic data fromOSM to automatically build an air quality model. �e model demon-strates on how di�erent types of OSM features impact PM2.5 AQIsand from what distance at a given time in LAMA. OSM contains mil-lions of geographic features in LAMA, including points-of-interest,land-use areas, water areas, and road networks (see Section 2). Ouralgorithm �rst identi�es the air monitoring stations that have a sim-ilar temporal pa�ern of PM2.5 AQIs on a temporal resolution. �enusing the temporal similarity, the algorithm trains a random forestmodel to generate the “importance” of individual OSM features(represented by points, lines, and polygons) together with theirgeographic distances to the monitoring stations (from 100-meter to3,000-meter radii). For example, suppose the stations that have asimilar temporal pa�ern of PM2.5 AQI all have a large factory within1,000 meters but other stations do not, then the feature-distancepair (factory, 1000-meter bu�er) could have a high importance onpredicting PM2.5 concentrations. We call the geographic character-istics (e.g., factory within 1,000 meters) weighted by the importancethe “geo-context”. In short, the geo-context represents how eachtype of OSM features impact PM2.5 AQIs in LAMA and from whatdistance during the period when the AQI data are available.

To predict the PM2.5 concentration at a location, P, at a giventime, our algorithm �rst generates the geo-context of P and thegeo-context of all available monitor stations in the study area. �enthe algorithm trains a second random forest model using the geo-context and the PM2.5 AQIs at available monitor stations to predictthe PM2.5 concentration at the location P. �is process works like arecommendation system and helps reduce the prediction errors byconsidering the temporal e�ect on the geo-context. For example,a large university campus within 1,500 meters can have a highimpact on the PM2.5 concentration during rush hours but not atnight. �e result is an expert-free air quality model for intra-cityPM2.5 predictions. Our �ndings can be used to improve air qualitymodels that traditionally rely on geographically weighted interpo-lations or regressions from (spatially) sparse monitoring stationsand can 1) highlight important features or nonlinear interactionsamongst them that might have been previously missed with moretraditional supervised approaches and 2) be incorporated into moresophisticated prediction models to select and quantify importantgeographic features related to air quality. �is �nding is partic-ularly important in the study of air pollution and the impact onrelevant populations, such as children with asthma.

�e remainder of this paper is organized into four additionalsections. Section 2 presents an overview of the data source. Section

2

3 describes our approach for modeling PM2.5 concentrations. Sec-tion 4 presents an experiment and evaluation of the results. Finally,Section 5 concludes the paper with a discussion of future work.

2 DATA SOURCESAQS (Air �ality System) Data

We use the AQS data collected in PRISMS-DSCIC. PRISMS-DSCICqueries the EPA�s AirNow web service every hour using multiplezip codes to retrieve the AQS data. For every zip code, PRISMS-DSCIC queries the AirNow service and stores the associated spa-tiotemporal observations. �e observations contain two parts: theenvironmental air quality indexes (AQI) and the pollution category(Good, Moderate, Unhealthy for Sensitive Groups, Unhealthy, VeryUnhealthy, Hazardous) that each AQI measurement correspond. OnPRISMS-DSCIC, there are 12 monitoring stations provide observa-tions of PM2.5 AQI in the Los Angeles Metropolitan Area (Figure 1).In this paper, our approach uses the PM2.5 AQI observations from2016-10-30 12:00:00 to 2017-06-10 12:00:00 with one-hour intervals.

Figure 1: Monitoring Station Locations

Geographic Data

OpenStreetMap (OSM) is an open source, crowdsourced map,which allows people to edit and access global geographic data freely.OSM provides a variety of geographic data types with detaileddatasets covering many areas in the world. PRISMS-DSCIC containsa copy of OSM data from Metro Extracts1 that covers the entire LosAngeles County, including the locations of all the available PM2.5monitoring stations. Example OSM data types (map or geographicfeatures) include land uses, roads, water areas, buildings, aero ways,ocean, etc.2 OSM land-use polygons describe the primary use ofland by the human, such as industrial, residential, and commercialuse. OSM road lines include many types of roads, streets, or pathssuch as motorways, living streets, and footways. OSM water areasare bodies of water, such as lakes or ponds. OSM building types,like point locations of apartments, factories, commercial structures,could re�ect the population density and tra�c volumes in a local1h�ps://mapzen.com/data/metro-extracts/2h�p://wiki.openstreetmap.org/wiki/Map Features

area. OSM aero ways are linear features that represent the physicalinfrastructure used to support aircra�, air travels, spacecra�, andspace �ights, which is a large air pollution source.

3 MODELING PM2.5 CONCENTRATIONFigure 2 shows our overall approach for building a PM2.5 con-centration model from OSM and AQS data automatically. A�er apreprocessing step for data cleaning (Section 3.1), our approachgroups available monitoring stations to identify similar tempo-ral pa�erns on PM2.5 AQIs for di�erent time resolutions (hourly,daily, monthly) using the K-means clustering (i.e., each station is apoint in the multidimensional space where each dimension is anhour/day/month) (Section 3.2). Our approach uses the clusteringresult in the next step to quantify the impact of a geographic featuretype to PM2.5 AQIs. �en the approach generates a “geographicabstraction” for each monitoring station automatically (Section 3.3).�e geographic abstraction is a summary of various geographicfeatures for the location using neighborhoods of various sizes. Forexample, the geographic abstraction can contain the length of dif-ferent road types (e.g., primary and secondary roads), the countsof various location types (e.g., commercial and residential build-ings), the area size of open spaces (e.g., parks, golf courses), andhydrography (e.g., rivers and ocean) within neighborhoods of 100-meter to 3,000-meter radii. Next, the approach trains a randomforest model to quantify the importance of individual componentsin the geographic abstraction based on their supports in groupingmonitoring stations of similar temporal pa�erns on PM2.5 AQIs(Section 3.4). We call the geographic abstraction weighted usingthe calculated importance the “geo-context”. Finally, the approachuses the geo-context to compute the similarity of the surroundingcharacteristics for producing the PM2.5 concentration predictionfor locations that do not have a monitoring station (Section 3.5).�e following subsections explain each component in our approachin details.

3.1 Data PreprocessingIn practice, data are generally incomplete (lacking values) and noisy(containing outliers), especially for streaming data. �e AQS dataquality also su�ers from unknown measurement uncertainty andexceptional events that might a�ect the measurement process. Miss-ing values and errors can have a large impact on the performanceof analytic algorithms. �erefore, the �rst step of our approachis data preprocessing including removing outliers and eliminatingmissing values in the AQS data.

3.1.1 Removing Outliers. �ere are several ways to remove dataoutliers such as computing a sliding window value, clustering todetect and remove outliers, and applying regression analysis tosmooth the data. To handle streaming data with a temporal au-tocorrelation, using a sliding window to �lter out noisy data ise�ective. Our approach calculates the median of a six-hour slidingwindow. For example, suppose we have a series of streaming PM2.5AQIs with the interval of one hour, [· · · , 20, 30, 35, 3, 50, 60, 55,· · · ], the sudden drop of AQI of 3 is considered as an outlier. Byapplying a six-hour sliding window, we replace the sudden drop bythe median of the window [20, 30, 35, 3, 50, 60, 55], that is 35.

3

Figure 2: Overall approach for automatically building aPM2.5 concentration model from OSM and AQS data

3.1.2 Eliminating Missing Values. �e simplest way to eliminat-ing missing values is just ignoring the data tuple when the valueis missing. Imputation methods such as using the a�ribute meanto �ll in the missing value or predicting for the missing ones bymachine learning algorithms can also achieve satisfactory results,especially in building a recommendation system where lots of thedependent values are not available. Our approach eliminates themissing values by removing the timestamp that does not have avalue of PM2.5 AQI because �lling missing values would require anaccurate prediction of the temporal autocorrelation, which mightnot be robust if the input data are not representative. In our case,the timespan of our AQS data is less than one year.

3.2 Grouping Stations on PM2.5 AQIsIn this section, our goal is to identify monitoring stations thathave “similar” time-series PM2.5 AQIs. We use this information togenerate the geo-context in a later step. We de�ne “similar” as insimilar temporal pa�ern on the PM2.5 AQIs. Our algorithm clustersthose monitoring stations with similar temporal pa�erns in thesame group. For example, urban areas would show a higher PM2.5AQI during workdays than rural areas, so urban areas could begrouped together in one cluster, and rural areas are together inanother.

Our approach uses K-means to cluster the available monitoringstations based on the collected time-series PM2.5 AQIs. K-meansclustering is a common method to identify groups in the dataset,with the number of groups represented by the input variable K. �ealgorithm works iteratively to assign each data point to one of theK groups. �us, data points are clustered based on the similarity oftheir feature vector in the Euclidean space.

We construct a feature vector for each monitoring station usingtheir time-series PM2.5 AQIs. Table 1 shows an example of thePM2.5 AQIs for the monitoring station, Central LA CO for �ve hours.From the example, our approach generates the feature vector as [50,53, 55, 57, 61]. In our dataset, we have the AQS data covering 5,352hours, so for clustering hourly PM2.5 AQIs, each feature vector hasa total of 5,352 components. In the 5,352-multidimensional space,we have 12 points where each point corresponding to a monitorstation.

K-means is a type of unsupervised learning technique, and weneed to de�ne the number of groups, K, beforehand. However, thecorrect choice of K is o�en unknown in advance. Increasing Kwithout a penalty will always reduce the amount of error in theresulting clustering, to the extreme case of zero errors if each datapoint is a cluster (i.e., when K equals the number of data points).In our approach, we use the elbow method to determine the valueof K. �e idea of the elbow method is to run K-means clusteringon the dataset for a range of values of K (e.g., K from 1 to 12 in ourexperiment). For each value of K, we calculate the within set sum ofsquared errors (WSSSE), which is the sum of the distances betweeneach point and centroid in each K partition. �en we plot a linechart of the WSSSE for each K value. �e line chart would looklike an arm, and the “elbow” of the arm is the best choice of K. Forexample, Figure 3 shows that when K equals to 8, the trend becomesslow. �erefore, we choose K equals to 8 as the number of clusters.Figure 4 shows the clustering result of twelve locations using hourlyAQIs. We can �nd that all the coastal areas are grouping togetherwhile Central LA is itself in group because it has a very di�erenttemporal pa�erns of the PM2.5 AQI. A�er determining the best K,our approach uses the K-means results of the identi�ed best K tolabel the monitoring stations. For example, two monitoring stationsthat in the same cluster will have the same group label. In the nextstep, our approach uses the group label of each monitoring stationto quantify how each OSM feature supports this station clusteringresult.

Table 1: Example for 5-hour PM2.5 AQI in Central LA CO

Monitorinд Station Timestamp PM2.5 AQICentral LACO 2017 − 03 − 04 12 : 00 : 00 50Central LACO 2017 − 03 − 04 13 : 00 : 00 53Central LACO 2017 − 03 − 04 14 : 00 : 00 55Central LACO 2017 − 03 − 04 15 : 00 : 00 57Central LACO 2017 − 03 − 04 16 : 00 : 00 61

3.3 Generating Geographic AbstractionPM2.5 concentrations are in�uenced by its surrounding geographicfeatures [2]. In this section, our approach computes a geographic

4

Figure 3: �e elbow point for choosing the best K

Figure 4: Clustering result of twelve monitoring stations

abstraction to describe the surrounding environment for a location.We use the available geographic data from OpenStreetMap, whichincludes land use, roads, buildings, water ways, aero ways, ocean,etc. For each monitoring station, we construct a series of concentriccircles (bu�ers) with radii from 100 meters to 3,000 meters withinterval of 100 meters.Length of Line Features

Our approach computes the sum of lengths of di�erent featuretypes to create a geographic abstraction for line geographic featuresfrom OSM, e.g., roads and aero ways. Figure 5 shows an example ofroads around the monitoring station A with the 100-meter and 200-meter bu�ers. In the example, there is a total of three roads of twotypes, the pedestrian and motorway roads. (Both “Pedestrian” and“Motorway” are OSM feature types.) For each type, we sum up thelength of road segments within the bu�er. As Figure 5 shows, withinthe 100-meter bu�er, the monitoring station A contains 23-meter(m) Pedestrian and 30m Motorway. Within the 200-meter bu�er,it has 43m Pedestrian and 200m Motorway. �us, our approach

generates the components for the geographic abstraction vectorfor the station A as:

[23, 30, 43, 200, · · · ]Each component represents an abstraction of a unique geo-

graphic feature type within a speci�c distance to the monitoringstation. �e example contributes four components to the abstrac-tion vector: the “Pedestrian” road length in the 100-meter bu�er, the“Pedestrian” road length in the 200-meter bu�er, the “Motorway”length in the 100-meter bu�er, and the “Motorway” length in the200-meter bu�er. Our approach iterates through all available lineOSM features to generate an abstraction for every feature type foreach bu�er size.

Figure 5: Example for roads in the 100-meter and 200-meterbu�ers

Area of Polygon Geographic FeaturesFor polygon geographic features such as land uses and water

areas, our approach computes the sum of the overlapping areasbetween each type of the features and the bu�ers. Figure 6 showsan example of the land use around the monitoring station A withinthe 100-meter and 200-meter bu�ers. In this example, there is atotal of four area features of two land-use types, park and industrialland-use. (Both “Park” and “Industrial” are OSM feature types.) Foreach type, our approach calculates the sum of the overlapping areasof the feature type and the bu�ers (i.e., we only compute the arealocated within the bu�er). As in Figure 6, for the 100-meter bu�er,the station A contains 500-square-meter (m2) of park areas. For the200-meter bu�er, it has 950m2 park areas and 740m2 industrial areas.�us, our approach generates the components for the geographicabstraction vector for the station A as:

[500, 0, 950, 740, · · · ]�e example contributes four components to the abstraction

vector: “Park” areas in the 100m bu�er, “Park” areas in the 200mbu�er, “Industrial” areas in the 100m bu�er, and “Industrial” areain the 200m bu�er. Our approach iterates through all availablepolygon OSM features to generate an abstraction for every featuretype for each bu�er size.

5

Figure 6: Example for land use in the 100-meter and 200-meter bu�er

Count for Point FeaturesOur approach computes the count of individual types of point

features (e.g., building types) to represent the geographic abstrac-tion for point OSM features. �e number of buildings in an areacould re�ect population density and tra�c pa�erns. Figure 7 showsan example of some buildings around the monitoring station Awithin the 100-meter and 200-meter bu�ers. In the example, thereis a total of 12 buildings of two building types, apartment and fac-tory buildings. (Both “Apartment” and “Factory” are OSM featuretypes.) In this example, there are two apartments within the 100-meter bu�er and eight apartments and three factories within the200-meter bu�er. �us, our approach generates the componentsfor the geographic abstraction vector for the station A as:

[2, 0, 8, 3, · · · ]�e example contributes four components to the abstraction

vector: “Apartment” counts in the 100m bu�er, “Apartment” countsin the 200m bu�er, “Factory” counts in the 100m bu�er, “Factory”counts in the 200m bu�er. Our approach iterates through all avail-able point OSM features to generate an abstraction for every featuretype for each bu�er size.Distance to the ocean

�e geographic abstraction vector also includes a component ofthe geographic distance from a location to the ocean. For example,suppose the distance from the monitoring station A to ocean is4000m, the approach generates a feature vector component as:

[4000, · · · ]Generating Vector as Geographic Abstraction

Our approach generates a vector as geographic abstraction foreach location. For example, to construct a geographic abstractionvector for the monitoring station A, we combine all the componentsmentioned above to form a new vector as,

[23, 30, 43, 200, 500, 0, 950, 740, 2, 0 , 8, 3, 4000]Each column of the vector represents the value of a unique

geographic feature type with a speci�c bu�er size. Our approach

creates bu�ers from 100 meters to 3,000 meters with an interval of100 meters. �ere are more than 3,500 columns in each geographicabstraction vector. Our approach generates the vectors for the 12monitoring stations and together the vectors constitute a matrix(Figure 8). In the next step, our approach quanti�es the importanceof individual components in the geographic abstraction vectors(column in the matrix).

Figure 7: Example for buildings in the 100-meter and 200-meter bu�er

Figure 8: Geographic abstraction matrix

3.4 Computing geo-context3.4.1 Computing Geographic Importance. In many cases, build-

ing an air quality model requires air quality experts to decide whichgeographic types and what bu�er sizes should be considered inthe modeling process. However, this process is expensive and timeconsuming. In this section, we present a method to automaticallyidentify which geographic types with what bu�er size have themost impact on PM2.5 concentration.

Our approach uses the random forest technique to quantify theimportance of individual components in the geographic abstractionvector. Random forest techniques provide an easy way to assess fea-ture importance for classi�cation or regression tasks. Our approach

6

uses the grouped monitoring stations as the label (the dependentvariable) and their geographic abstractions as the predictor featuresto train a random forest model. We use the random forest imple-mentation provided in Spark MLlib on PRISMS-DSCIC to derivethe feature importance for each component in the geographic ab-straction. �e idea is that the feature components with a higherimportance indicate that the components have a higher impact onthe clustering result (based on the PM2.5 AQI temporal pa�erns).�e features with zero importance means they are not important atall in classifying the temporal PM2.5 AQI pa�erns of the monitoringstations used in the K-means process.

3.4.2 Constructing Geo-Context. �e feature importance helpsus identify which types and what bu�er size ma�ers in predictingPM2.5 AQI. For each component in the geographic abstraction, wemultiply its value by its importance. In this way, we can rewardthose important features and penalize unimportant ones. For in-stance, suppose there is a large university area (e.g., 3000m2) ina 100-meter bu�er, but it has zero importance (i.e., it has no rela-tionship with the similarity of PM2.5 AQI at di�erent locations),we eliminate its value as it does not exist. We call this weightedgeographic abstraction the “geo-context”. �e geo-context replacesthe original geographic abstraction and become a description ofthe geographic environment for a location for predicting PM2.5concentration.

3.5 Predicting PM2.5 ConcentrationTo predict PM2.5 concatenation at a certain time for locations thatdo not have air quality sensors, our approach trains a second ran-dom forest model with the geo-context (as the predictors) and thePM2.5 AQI (as the dependent variable) at that time from all availablemonitoring stations. �en we construct the geographic abstrac-tion (Section 3.2) for the locations for prediction and compute thegeo-context by applying the feature importance (Section 3.3). Next,we use the trained random forest model to predict the PM2.5 AQIfor the targeting locations and �nally convert the predictd AQIs toPM2.5 concentrations.

4 EXPERIMENTWe utilized the AQS data (AirNow) and OpenStreetMap data col-lected in PRISMS-DSCIC for the experiment. We conducted theexperiment using the Apache Hue interface, which can be used tooperate on an interactive session with the Spark cluster on PRISMS-DSCIC. �e platform applies Spark 2.1.0 and Scala 2.11.8. All geospa-tial computing was done in PostGIS and statistical analysis wasdone in Scala, version 2.11.8 and Spark MLlib, version 2.1.0.

We performed data preprocessing on the AQS data includingremoving outliers, eliminating missing values, and aggregating datato lower temporal resolutions (from hourly to daily and monthly).�e timespan is seven entire months, 233 days, and 5,352 hours.Figure 9 shows an example result before (a) and a�er (b) removingoutliers using a sliding window of six hours. Our approach also re-moved the timestamp that did not contain PM2.5 AQIs. To conductour experiment for di�erent temporal resolutions, we computedthe mean for daily and monthly PM2.5 AQIs for each monitoringstation to generate the daily and monthly data.

4.1 Experimental SettingsIn the experiment, we tested the performance of both the geo-graphic abstraction and geo-context for generating hourly, daily,and monthly predictions of PM2.5 in in Los Angeles MetropolitanArea. We veri�ed our results using leave-one-out cross-validationand compared our results with the inverse-distance weighing (IDW)method. We started by taking one monitoring station out (i.e., thetarget station) and using the remaining 11 stations to predict PM2.5concentration at the hourly, daily, and monthly temporal resolution.We then used the le�-out station as the ground truth to calculatethe prediction accuracy.

(a)

(b)

Figure 9: Example result of removing outlier before (a) anda�er (b) for Monitoring Station in Central LA

4.2 Experimental Results and DiscussionWe compared the result of our approach using both geographicabstraction and geo-context to IDW using the same dataset. IDW isthe most frequently used deterministic models in spatial interpola-tion. We evaluated the performance using RMSE and MAE. RMSE(Root-mean-square deviation) measures the di�erences betweenvalues predicted by a model and the actual values. MAE (Meanabsolute error) measures the absolute di�erence between two con-tinuous variables. We tested on 7 months with monthly data, 233days with daily data, and 168 hours (one day, including 24 hours,randomly chosen in every month) with hourly data. We computedthe overall RMSE and MAE for all target locations (12 monitoringstations). Table 2 presents the evaluation result for month, daily,and hourly, respectively. Figure 10 shows the monthly predictionerror for all monitoring stations using three methods. Our approachachieves the best performance with the smallest errors.

Our approach using either geo-context or geographic abstractiongenerated competitive low RMSE and MAE as other expert-curatedmodels (See Section 5). Using geo-context generated more accurateresults from using the geographic abstraction, and both methodswere comparable to IDW while IDW cannot provide �ne-scalepredictions (see next paragraph). We also performed the pairedt-test and found that the geo-context MAE results and IDW are

7

statistically di�erent with 95% con�dence. For hourly, daily, andmonthly predictions, the t-test results were p = 1.73212E-15, p =1.31243E-05, and p = 0.002, respectively.

Table 2: Result for monthly, daily, and hourly predictionevaluation

Geo − context Geo −Abstraction IDW

RMSE (Monthly) 2.53984 2.62391 2.88263MAE (Monthly) 1.86657 1.93673 2.18675RMSE (Daily) 4.33786 4.35857 4.10172MAE (Daily) 3.26140 3.28176 3.10185

RMSE (Hourly) 7.38823 7.59260 6.66106MAE (Hourly) 5.06559 5.12406 4.54779

To demonstrate our results in predicting �ne scale predictionsof PM2.5 concentration, we created a 1-mile apart �shnet coveringmost of the City of Los Angeles (604 points). We used our approachto predict the PM2.5 AQI monthly mean for each location on the�shnet using geo-context. Our approach generated a list of featureimportance based on the monitoring stations. Table 3 shows thetop 15 features ranked by importance. �e results demonstratethat the identi�ed the feature types with high importance usingthe geo-context are similar to other studies in analyzing PM2.5concentration (See Section 5). Motorway, primary and tertiaryroads re�ect tra�c volume. “village green”, “farmland”, “pitch”are open spaces of green area. “University”, “residential”, “retail”are the places a�ract tra�c and people. “Wetland”, “Industrial’,“garages” are the sources of water pollutants and air pollutants. Byautomatically quantifying the importance of individual geographicfeature types, we could easily explain those geographic featuretypes a�ect PM2.5 concentrations and from what distance. Figure 11shows the PM2.5 AQI predictions of our approach and IDW for Dec2016 and Jan 2017. As expected, IDW could not generate �ne-scalepredictions while our approach successfully identi�ed intra-cityareas where the air quality is typically poor (e.g., the south part ofthe city near the port of San Pedro and downtown Los Angeles).

Table 3: Example for feature importance

Geo Name Bu f f er Size (meter ) Geo type Importance (%)land use 1100 wetland 0.0051177land use 1300 university 0.004450road 600 rail 0.0044327

land use 1200 villaдe дreen 0.0037241road 700 primary 0.0035520

land use 1900 f armland 0.0031458land use 2700 villaдe дreen 0.0030063road 800 residential 0.0028980

buildinд 2000 retail 0.0027980buildinд 900 industrial 0.0027576road 500 tertiary 0.0027357

land use 900 pitch 0.0026613buildinд 2900 school 0.0025681buildinд 1700 дaraдes 0.0025361road 1300 motorway 0.0023724

Figure 10: prediction errors for all monitoring stations

8

Figure 11: PM2.5 AQIs prediction for Dec 2016 and Jan 2017using geo-context (le�) and IDW (right)

5 RELATEDWORK�ere exists an abundant literature on air quality modeling andprediction. (�e reader is referred to [9] for a review on land-useregression (LUR) methods and [25, 26] for comprehensive reviewson air quality predictions using various methods). �e basic and theleast computationally expensive methods use spatial interpolations,such as inverse distance weighting (IDW) and Kriging. �e methodsdo not explicitly consider neighborhood characteristics and cannotgenerate results at a �ne-scale with sparse monitoring stations.

Sophisticated and more accurate air quality modeling and pre-diction methods typically include two steps. First, a domain expertuses knowledge in previous studies on air quality models (AQMs)and statistical methods to test and select the independent variables(predictors). �is variable selection step includes choosing a pre-dictor type (e.g., the length of the primary roads, time-lagged airquality data, and regional average humidity) and a spatial distancefor which the predictor would have a high impact on the dependentvariable (pollutant concentrations). �is step largely depends onthe availability of a dataset and previous studies of similar pollu-tants. For example, some studies used crowdsourced data [7] orarea speci�c data (e.g., dense sensors on public transport vehicles[8]) that are not available elsewhere. If a dependent variable is notavailable for a study area, a common approach is to compute surro-gate variables. For example, if tra�c dynamics are not available,

the domain expert would use remotely sensed data or availableroad data as the surrogate [9]. �is step needs to repeat for eachstudy areas and pollutant types [9].

Once the predictor variables are chosen, the second step is to se-lect and test the modeling and prediction methods. �e mainstreammethods include the classical dispersion models, LUR models, andmore recently machine learning and data mining models. Disper-sion models o�en require very detailed data (e.g., building heightsand distances between neighboring buildings) and area speci�c pa-rameters [25], which can be di�cult to generalize and transferredto other locations. Also, dispersion models are usually computa-tionally expensive. In comparison, LUR models have the advantagethat 1) the resulting models are human-explainable and 2) theyhave the less computational requirement (than dispersion modelsand machine learning methods).

Since the �rst LUR study on air pollution modeling in 1997 [2],many LUR models and features are used to study air pollutionmodeling and predictions (e.g., [9, 13]). However, they heavilyrely on expert-selected predictors including predictor types andtheir �nite spatial radii, and every study area requires a domainexpert to select and �ne-tune the variables. For example, the sameradius selected for transportation features in one area might meansomething else in another. (Road density within 500 meters inLos Angeles likely captures very di�erent processes than the samevariable in rural Montana.) �e result is that LUR models in theliterature demonstrated signi�cantly di�erent error ranges in theirpredicts (see [9, 25, 26]). For example, Liu et al. [12] reported highR2 for their NO and PM2.5 LUR models for Shanghai, China, buttheir RMSE for the 35 veri�ed locations was 194.59 (g/m3). �eLUR studies reviewed in Hoek et al. [25] had their RMSE rangedfrom 1.6 to 9.8 (g/m3) for various types of air pollutants in studyareas across the globe, and their temporal resolutions are commonlylow (e.g., seasons). Moreover, LUR models rarely deal with spatiale�ects (e.g., spatial non-stationarity) except a more recent studythat built a wind model to improve traditional LUR and had a 10-20%improvement on the prediction [1].

With more datasets and so�ware tools becoming available, manystudies start to adopt machine learning techniques for building airquality models or predictors [3, 18, 20, 22, 27, 28]. �e advantageof using machine learning techniques include 1) the capability tohandle large volumes and varieties of data types and formats (e.g.,categorical vs. numerical data), 2) having more accurate predictionresults because machine learning methods are less in�uenced bythe choice of parameters or speci�c dataset (e.g., see [3] for a com-parison of LUR and Random Forest), and 3) requiring less experte�orts in selecting input features. Among others, a notable work isthe Microso� Urban Air system [27, 28] that generates air qualitypredictions covering large areas. While these machine learningmethods could achieve more accurate results than the popular LURmodels, the price to pay is that the machine learning models areo�en not easily translatable to policy makers or urban planners(e.g., prediction results from multiple machine learning models).Also, many of the existing studies still tested with region speci�cdata sets that are di�cult to obtain in many places.

In comparison, our approach is similar to LUR models in thatthe results are explainable (i.e., the geo-context), but our approachdoes not require expert-selected predictors. Contrast to dispersion

9

models and the advanced machine learning models (e.g., [27, 28]),our approach is less computationally expensive but generate lessaccurate results because of the limitation in our predictors. Forexample, using a static geographic data source, currently our modelcaptures spatial variability but not temporal variability, and futureextensions of this work will aim to incorporate meteorology tocapture temporality be�er. More types of globally available datasuch as the WorldClim (global climate data with 1 km2 resolution)and satellite imagery could be helpful in improving predictionresults of our approach in the future. In sum, the previous studiestypically rely on expert-selected and regional available predictors,and our approach is expert-free and can generate an accurate modelfor predicting intra-city PM2.5 concentrations from OSM data.

6 DISCUSSIONS AND FUTUREWORK�is paper presented a data mining approach to build an accuratemodel to predict PM2.5 concentration by automatically selecting im-portant geographic features without using expert knowledge. �eadvantages of our approach include 1) it can quantify the in�uenceof geographic features on air quality, which helps us do geographicfeature selection for air quality analysis without using the domainknowledge; 2) we use the easily accessible OpenStreetMap to con-struct geographic abstraction instead of using data that is expensiveand di�cult to obtain; 3) the model performed well in predictingPM2.5 concentration and could generate �ne-scale predictions. Weplan to improve the work presented in this paper in several ways.First, we plan to test the approach for other cities. �an We plan toincorporate other time-series data, such as weather information totackle the challenges in modeling spatial e�ects.

ACKNOWLEDGMENTSWe thank Yi Chen for generating the maps of PM2.5 predictions.

REFERENCES[1] Stefania Bertazzon, Markey Johnson, Kristin Eccles, and Gilaad G Kaplan. 2015.

Accounting for spatial e�ects in land use regression for urban air pollutionmodeling. Spatial and spatio-temporal epidemiology 14 (2015), 9–21.

[2] David J Briggs, Susan Collins, Paul Ellio�, Paul Fischer, Simon Kingham, Erik Le-bret, Karel Pryl, Hans van Reeuwijk, Kirsty Smallbone, and Andre Van Der Veen.1997. Mapping urban air pollution using GIS: a regression-based approach.International Journal of Geographical Information Science 11, 7 (1997), 699–718.

[3] Cole Brokamp, Roman Jandarov, MB Rao, Grace LeMasters, and Patrick Ryan.2017. Exposure assessment models for elemental components of particulatema�er in an urban environment: A comparison of regression and random forestapproaches. Atmospheric Environment 151 (2017), 1–11.

[4] Douglas W Dockery, C Arden Pope, Xiping Xu, John D Spengler, James H Ware,Martha E Fay, Benjamin G Ferris Jr, and Frank E Speizer. 1993. An associationbetween air pollution and mortality in six US cities. New England journal ofmedicine 329, 24 (1993), 1753–1759.

[5] Jingchun Fan, Shulan Li, Chunling Fan, Zhenggang Bai, and Kehu Yang. 2016.�e impact of PM2. 5 on asthma emergency department visits: a systematicreview and meta-analysis. Environmental Science and Pollution Research 23, 1(2016), 843–850.

[6] Michael Guarnieri and John R Balmes. 2014. Outdoor air pollution and asthma.�e Lancet 383, 9928 (2014), 1581–1592.

[7] David Hasenfratz, Olga Saukh, Silvan Sturzenegger, and Lothar �iele. 2012.Participatory air pollution monitoring using smartphones. Mobile Sensing (2012),1–5.

[8] David Hasenfratz, Olga Saukh, Christoph Walser, Christoph Hueglin, MartinFierz, and Lothar �iele. 2014. Pushing the spatio-temporal resolution limit ofurban air pollution maps. In Pervasive Computing and Communications (PerCom),2014 IEEE International Conference on. IEEE, 69–77.

[9] Gerard Hoek, Rob Beelen, Kees De Hoogh, Danielle Vienneau, John Gulliver,Paul Fischer, and David Briggs. 2008. A review of land-use regression models to

assess spatial variation of outdoor air pollution. Atmospheric environment 42, 33(2008), 7561–7578.

[10] Jamal Jokar Arsanjani, Marco Helbich, Mohamed Bakillah, Julian Hagenauer,and Alexander Zipf. 2013. Toward mapping land-use pa�erns from volunteeredgeographic information. International Journal of Geographical Information Science27, 12 (2013), 2264–2278.

[11] Lianfa Li, Fred Lurmann, Rima Habre, Robert Urman, Ed Rappaport, Beate Ritz,Jiu-Chiuan Chen, Frank Gilliland, and Jun Wu. [n. d.]. Constrained Mixed-E�ectModels with Ensemble Learning for Prediction of Nitrogen Oxides Concentra-tions at a High Spatiotemporal Resolution. Environmental Science Technology.Under review.

[12] Chao Liu, Barron H Henderson, Dongfang Wang, Xinyuan Yang, and Zhong-renPeng. 2016. A land use regression application into assessing spatial variationof intra-urban �ne particulate ma�er (PM 2.5) and nitrogen dioxide (NO 2)concentrations in City of Shanghai, China. Science of �e Total Environment 565(2016), 607–615.

[13] DK Moore, M Jerre�, WJ Mack, and N Kunzli. 2007. A land use regression modelfor predicting ambient �ne particulate ma�er across Los Angeles, CA. Journalof Environmental Monitoring 9, 3 (2007), 246–252.

[14] Zev Ross, Paul B English, Rusty Scalf, Robert Gunier, Svetlana Smorodinsky,Steve Wall, and Michael Jerre�. 2006. Nitrogen dioxide prediction in SouthernCalifornia using land use regression modeling: potential for environmentalhealth analyses. Journal of Exposure Science and Environmental Epidemiology 16,2 (2006), 106–114.

[15] Zev Ross, Michael Jerre�, Kazuhiko Ito, Barbara Tempalski, and George D�urston. 2007. A land use regression for predicting �ne particulate ma�erconcentrations in the New York City region. Atmospheric Environment 41, 11(2007), 2255–2269.

[16] Jonathan M Samet, Francesca Dominici, Frank C Curriero, Ivan Coursac, andSco� L Zeger. 2000. Fine particulate air pollution and mortality in 20 US cities,1987–1994. New England journal of medicine 343, 24 (2000), 1742–1749.

[17] Dimitris Stripelis, Jose Luis Ambite, Yao-Yi Chiang, Sandrah P Eckel, and RimaHabre. 2017. A Scalable Data Integration and Analysis Architecture for SensorData of Pediatric Asthma. In Data Engineering (ICDE), 2017. 1407–1408.

[18] Wei Sun, Hao Zhang, Ahmet Palazoglu, Angadh Singh, Weidong Zhang, andShiwei Liu. 2013. Prediction of 24-hour-average PM 2.5 concentrations using ahidden Markov model with di�erent emission distributions in Northern Califor-nia. Science of the total environment 443 (2013), 93–103.

[19] Darren Wilton, Adam Szpiro, Timothy Gould, and Timothy Larson. 2010. Im-proving spatial concentration estimates for nitrogen oxides using a hybrid mete-orological dispersion/land use regression model in Los Angeles, CA and Sea�le,WA. Science of the total environment 408, 5 (2010), 1120–1130.

[20] Xia Xi, Zhao Wei, Rui Xiaoguang, Wang Yijie, Bai Xinxin, Yin Wenjun, and DonJin. 2015. A comprehensive evaluation of air pollution prediction improvement bya machine learning method. In Service Operations And Logistics, And Informatics(SOLI), 2015. 176–181.

[21] Yu-Fei Xing, Yue-Hua Xu, Min-Hua Shi, and Yi-Xin Lian. 2016. �e impact ofPM2. 5 on the human respiratory system. Journal of thoracic disease 8, 1 (2016),E69.

[22] Wei Xu, Cheng Cheng, Danhuai Guo, Xin Chen, Hui Yuan, Rui Yang, and Yi Liu.2014. PM2. 5 Air �ality Index Prediction Using an Ensemble Learning Model.In International Conference on Web-Age Information Management. Springer, 119–129.

[23] Xiaofan Yang, Yixuan Zheng, Guannan Geng, Huan Liu, Hanyang Man, ZhaofengLv, Kebin He, and Kees de Hoogh. 2017. Development of PM 2.5 and NO 2 modelsin a LUR framework incorporating satellite remote sensing and air quality modeldata in Pearl River Delta region, China. Environmental Pollution 226 (2017),143–153.

[24] Liang Zhai, Bin Zou, Xin Fang, Yanqing Luo, Neng Wan, and Shuang Li. 2016.Land Use Regression Modeling of PM2. 5 Concentrations at Optimized SpatialScales. Atmosphere 8, 1 (2016), 1.

[25] Yang Zhang, Marc Bocquet, Vivien Mallet, Christian Seigneur, and AlexanderBaklanov. 2012. Real-time air quality forecasting, part I: History, techniques, andcurrent status. Atmospheric Environment 60 (2012), 632–655.

[26] Yang Zhang, Marc Bocquet, Vivien Mallet, Christian Seigneur, and AlexanderBaklanov. 2012. Real-time air quality forecasting, part II: State of the science,current research needs, and future prospects. Atmospheric Environment 60 (2012),656–676.

[27] Yu Zheng, Furui Liu, and Hsun-Ping Hsieh. 2013. U-air: When urban air qualityinference meets big data. In Proceedings of the 19th ACM SIGKDD internationalconference on Knowledge discovery and data mining. 1436–1444.

[28] Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, andTianrui Li. 2015. Forecasting �ne-grained air quality based on big data. InProceedings of the 21th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. 2267–2276.

10

Mining Public Datasets for Modeling Intra-City PM2.5...

Documents

Transcript of Mining Public Datasets for Modeling Intra-City PM2.5...