A Novel Methodology for Georeferenced Address...

24
Georeferenced address listing - 1 1 A Novel Methodology for Georeferenced Address Listing Abstract: This paper presents a novel methodology for address listing. This methodology builds on publically available georeferenced databases and geospatial methodologies that include spatial sampling, identification of site domain, reverse geocoding, web-mining and the use of US Postal Services database (for address validation). To demonstrate the application of this methodology, four methods of spatial sampling were employed to draw four sets of sample sites in the Chicago Metropolitan Statistical Area (MSA). Using the local spatial autocorrelation, the domain (or area) of each sample site was determined. A random list of residential addresses was developed with the aid of reverse geocoding and web-mining and publically available datasets within the sampling domain of each site. The sampled residential addresses were compared with the US Postal Service’s address database. Our analysis suggests that 98.7% of the addresses were valid residential addresses. These results are important and have important implications for survey research for developing location based list of addresses in a cost-effective manner. Since the addresses are geo-referenced, we can develop multi-level socio-physical contexts around respondents at no extra burden on their effort and time. Keywords: reverse geocoding, spatial sampling, geo-information revolution.

Transcript of A Novel Methodology for Georeferenced Address...

Page 1: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 1

1

A Novel Methodology for Georeferenced Address Listing

Abstract: This paper presents a novel methodology for address listing. This methodology builds

on publically available georeferenced databases and geospatial methodologies that include

spatial sampling, identification of site domain, reverse geocoding, web-mining and the use of US

Postal Services database (for address validation). To demonstrate the application of this

methodology, four methods of spatial sampling were employed to draw four sets of sample sites

in the Chicago Metropolitan Statistical Area (MSA). Using the local spatial autocorrelation, the

domain (or area) of each sample site was determined. A random list of residential addresses was

developed with the aid of reverse geocoding and web-mining and publically available datasets

within the sampling domain of each site. The sampled residential addresses were compared with

the US Postal Service’s address database. Our analysis suggests that 98.7% of the addresses were

valid residential addresses. These results are important and have important implications for

survey research for developing location based list of addresses in a cost-effective manner. Since

the addresses are geo-referenced, we can develop multi-level socio-physical contexts around

respondents at no extra burden on their effort and time.

Keywords: reverse geocoding, spatial sampling, geo-information revolution.

Page 2: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 2

2

A Novel Methodology for Georeferenced Address Listing

INTRODUCTION: Recent advances in geo-spatial technologies allow georeferencing a vast

amount of (non-geographic) data/information we collect, ranging from a bank transaction to

making a phone call. For example, we can locate an airplane or predict weather at a given

location in real time, identify the location where from a phone call originates, and trace a

person’s interaction across geographic space and time with the smart cell phones. Most of these

data are stored and maintained digitally, and some of these data are available publically. With the

availability of these georeferenced data, we have entered into a new era of geo-information

revolution. This, in turn, has a phenomenal impact on our day-to-day lives, ranging from finding

our neighbors’ names and phone numbers to planning a trip. This is also changing the way we

collect survey data and conduct social science research. Utilizing these publically available

datasets, namely Google Earth, White Pages and US Postal Services (USPS) database, this article

demonstrates how these datasets can be utilized for address listing at or around a given location.

Once the locations are identified (with the aid spatial sampling) the proposed methodology can

be used to develop list of addresses at or around the identified locations.

A major challenge of conventional sampling and survey methods is that a finite, complete and

up-to-date sampling frame of respondents is rarely available. But geospatial methodologies do

allow us to account for every inch of geographic areas with its socio-physical attributes. This

means spatial sampling can be more effective to sample areas than population. Spatial sampling

has been used extensively for sampling natural resources. But its application to social surveys

has been limited, because of the unavailability of a complete sampling frame of human

population with the locational (or georeferenced) information. Traditionally, a pseudo-sampling

Page 3: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 3

3

frame of geographic areas is created to implement spatial sampling, because a finite and

complete sampling frame of natural resources is difficult to construct, especially for large

geographic areas. Such a sampling frame is constructed by overlaying a geometric grid over the

area of interest, referred to as the sampling domain. Once the sampling frame is constructed and

characterized, any classical sampling methods (such as random systematic and random stratified)

can be employed to draw a sample of locations or areas. The selection of locations or areas,

however, does not necessarily represent respondents or residential units or households.

This paper demonstrates the application of reverse geocoding coupled with web-mining and

USPS database to construct and validate a list of residential addresses at or around a given

location. Spatial sampling (and its integration with the reverse geocoding) offers several

advantages over non-spatial social survey methodology. First, it ensures better spatial coverage

and representation of population distribution across geographic space as it takes into account

spatial distribution population a fine geographic scale, such as 90m spatial resolution. Human

population is not distributed homogenously, and can range significantly from rural to urban areas

and within urban areas. Generally, population data are aggregated, analyzed, and available at

higher-order geographic units (such as census blocks in the US) due to confidentiality issues, and

their shapes and sizes can vary greatly. Integration of satellite remote sensing with the existing

secondary datasets, such as the US Census, allows us to develop population estimates at a very

fine spatial resolution; LandScan population concentration data at 90m spatial resolution for the

continental US, and 400m and 1km spatial resolution worldwide, are some examples of such

datasets (ORNL, 2008). The availability of these data formulates bases to draw a spatially

representative sample of human population.

Page 4: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 4

4

Second, spatial sampling can capture spatial heterogeneity in the distribution of socio-economic

and demographic characteristics, which is critically important for social surveys. The distribution

of human population and its socio-economic characteristics witness significant spatial disparity

and segregation (Osborne & Rose, 1999). Spatial analytical methods, such as local spatial

autocorrelation and semivariance, can be used to quantify and characterize geographic space by

socio-economic and demographic attributes, and construct homogenous strata in terms of socio-

economic characteristics. This, in turn, can ensure socio-economic representation of population

across geographic space. Third, spatial sampling is likely to ensure better population

representation as compared to traditional methods used for social surveys. A finite and complete

sampling frame of population is needed to draw a representative sample, but an up-to-date and

complete sampling frame of population is rarely available. For spatial sampling, however, we

rely on the sampling frame of areas, which can be updated frequently and constructed at any

spatial resolution. Once the sampled areas and/or locations are identified, the reverse geocoding,

presented in this paper, can generate a list of addresses within the selected areas and/or at or

around the sample sites. Fourth, reverse geocoding adds locational reference (or georeference) to

the selected sample. This, in turn, is important to attach multi-level socio-physical environmental

contexts to the selected sample. With the increasing importance of place and space, these

contextual data are becoming important to study social, behavioral, and health outcomes,

because peoples’ long-term exposure to their immediate place-specific socio-physical and

chemical environment is likely to influence their attitudes, behavior, social and economic

outcomes, and health (Caughy & O'Campo, 2006; Chen, Gong, & Paaswell, 2008; Frumkin,

2003). Integration of survey data with the multi-level socio-physical contextual data from

Page 5: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 5

5

multiple sources is likely to augment the scope of the survey data and advance interdisciplinary,

multi-level social science research.

The main objectives of this paper are: (a) to develop, test and validate the reverse geocoding to

identify addresses at or around given locations, (b) to extract a list of residential addresses with

the aid of publically available datasets, and (c) to validate the residential list using the USPS

database. The remainder of this article is organized into three sections. The first section describes

the study area, data used, and methods employed. The second section presents the

implementation of reverse geocoding with the spatial sampling design, and the final section

discusses the findings of this research along with their implications and limitations.

METHODS AND MATERIALS

Study Area: The pilot survey was conducted in the urban Chicago, MSA. Given that there is a

significant gradient in socio-physical and chemical environment, the study area is ideally suited

for this experiment, as it will demonstrate how effectively the proposed method represents

enumeration list in this area with very diverse population in terms of socio-economic

characteristics and population concentration.

Data: The data for this research come from a variety of sources, namely Google Earth, White

Pages, the US Census, and Oak Ridge National Laboratories. The latter two datasets were

utilized to construct a pseudo-sampling frame of human population and characterize it with the

socio-economic characteristics. The Google Earth dataset was utilized for reverse geocoding to

Page 6: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 6

6

develop a valid list of addresses at/around the selected sites. The White Pages database was

utilized to extract a list of residential units from the valid addresses extracted from Google Earth.

Method: Four different methods of sampling, namely clustered random (adopted for GSS) (R.

Harter, Eckman, English, & O’Muircheartaigh, 2010), optimized clustered random, spatially

random, and optimized spatial, were employed to draw four different sets of sample sites. The

first two methods were implemented in two stages: first, census tracts were drawn, and then

random points were simulated within the selected census tracts. The first set of census tracts was

selected using the GSS design, in which the 1692 urban census tracts were implicitly stratified by

income and race; 18 census tracts were then systematically sampled with probability proportional

to their 2000 population size. The second set of census tracts was selected using an optimization

method to better represent socio-economic characteristics of the target population. In the first

stage, we optimized the selection of census tracts. Specifically, we utilized the census data on

population, income, education and race/ethnicity to construct a composite index, and then

selected 18 census tracts that maximized the semi-variance in the composite index and controlled

for spatial autocorrelation in the selected Census Tracts. Once the Census Tracts were selected,

250 random points were simulated within the selected Tracts inversely proportionate to their

population size.

For the spatial random sampling, the study domain (Urban Chicago MSA) was partitioned into

400m x 400m pixels, and characterized by income, education and race/ethnicity using the 2000

US Census Data. Then a composite index was constructed for each pixel. These pixels served as

a pseudo sampling frame. Even though the classical sampling method was feasible, we sampled

Page 7: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 7

7

the pixels randomly with the control for spatial auto-correlation. Specifically, two pixels within

the extent of local spatial correlation were not selected to avoid the selection of redundant

sample sites. In the optimized design, however, the selected set of pixel captured the maximum

semivariance in the composite index and minimized spatial autocorrelation (Fig 1) (N. Kumar,

2007, 2009).

IMPLEMENTATION: Unlike traditional social surveys, for the spatial social survey we

sample locations/areas first and then we sample respondents at/around the sampled

locations/areas (Fig 2). The spatial sampling integrated with reverse geocoding and web-mining

is implemented in four steps as illustrated in Fig 3. These steps include: selection of sample sites

using a spatial sampling method, developing a list of valid addresses at/around the sample sites,

extraction of residential units at the selected addresses and sampling residential units, and

validation of sampled residential addresses.

Step 1: Selection of sample sites using a spatial sampling design: Since population representation

is an important requirement for social surveys, a sample of locations/areas which represent

population distribution is selected in the first step. A spatially organized sampling frame of

human population, however, is rarely available. We utilize high resolution LandScan(ORNL,

2008) data to construct a pseudo-sampling frame of population (organized at 400 x 400m spatial

resolution). Since SES data were not available at the spatial resolution of the LandScan

population data, the coarse resolution (at the tract level) the 2000 US Census data were used to

impute SES at the spatial resolution of LandScan population data. Values of several adjacent

Page 8: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 8

8

census tracts were used to impute value for a given pixel even though a pixel may be completely

within a census tract. This approach gives higher weight to nearer census tracts and also takes

into account the relative location of pixel with respect to locations of other (adjacent) census

tracts. This resulted in a pseudo-sampling frame of population with socio-economic

characteristics. The four methods of spatial sampling, described above, were employed to draw

four different sets of sample sites using this sampling frame (Fig 1).

Step 2: Sampling domain for address listing: Since selecting a site (using a spatial sampling

method) may not represent a respondent (or household or address), it is important that the

domain or area within which to extract addresses is defined. There are various ways to demarcate

the site domain, including an address closest to the site, pixel within which the site is located or

jurisdiction or census unit within which the site is located. If only one respondent is to be

selected around each site and non-response is not a concern, the first option may be appropriate.

Since the shape and size of administrative/census unit can vary, local spatial autocorrelation can

be employed to determine the domain or area within which to select addresses. The local spatial

autocorrelation determines the geographic extents within which similarity in the value (of the

chosen attribute) exists (N. Kumar, 2007, 2009; Naresh Kumar, Liang, Linderman, & Chen,

2010). Because the extent of local spatial autocorrelation can vary geographically, the domain

(or area) a site represents can vary as well (Rachel Harter, Eckman, English, &

O'Muircheartaigh, 2010). For our survey in Chicago MSA, we computed site-specific extent

using the local spatial autocorrelation.

Page 9: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 9

9

Step 3: Developing a list of addresses at/around the selected sample sites: Utilizing Google

Earth, we construct a valid list of addresses within the sampling domain of each site. The process

of locating/situating an address onto the earth’s surface is called geo-coding (Rushton et al.,

2006). For spatial social surveys, however, we have the locations and need to extract addresses

at/around these locations. This process is called reverse-geocoding (Curtis, Mills, & Leitner,

2006). This involves two important decisions: what geographic extent (or spatial domain) to use

to search for addresses around a sample site and how many addresses to extract.

If the goal is to sample residential addresses, address listing using reverse geocoding may be not

sufficient. Because in commercial areas all addresses could be commercial. Keeping this

constraint in mind, we integrated reverse geocoding with web mining. This verifies whether

valid residential units are present within the domain of a site. If the required number of

residential units is not present within the domain, it can be expanded and iteratively a valid list of

residential units can prepared.

Ideally, all addresses present within the domain of a sample site must be included in the list from

which to draw respondents. This process, however, can be computationally intensive, especially

in a densely populated area. Depending on the number residential units to be drawn around each

sample site and computation time, the number of residential addresses to include in the list must

be chosen with care. This number can be set to 10-20 times the number of respondents to be

drawn from the list. In the proposed study, we set this number to (randomly selected) 20 valid

residential addresses within the domain of each site, and one of these 20 was chosen randomly.

Page 10: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 10

10

We utilize the Google Application Programming Interface (API) for reverse geocoding

addresses. This application allows us to pass on a location (with longitude and latitude), and

extracts an address at/around the location. For example, passing on a location with longitude ~

-91.53615 and latitude ~ 41.661335 returns 35 E. Jefferson St., Iowa City, IA 52245 (Fig 4). In

addition, this application can return ten different levels of information at/around a location,

including Street Name, City, County, State, and Country. Since the location can be a business or

an institute or a residential building, such as a house, an apartment complex, or a condominium,

it is important to determine whether valid addresses are present within the site domain. The

White Pages database allows us to determine whether an address has residential units.

Since there can be multiple valid addresses within the geographic extent/domain of a sample site,

it is important that these addresses are chosen randomly unless all addresses are extracted from

the extent of a sample site. An ideal way to select addresses randomly is to simulate random

locations within the geographic extent of a sample site and select addresses at/around the

simulated locations. In this study, we simulate random points within the domain a sample site,

find an address at/around the simulated points, validate these addresses, and then pass them on to

White Pages to extract valid residential units at these addresses. Since we simulate point

locations within the extent of selected site randomly and then select the residential addresses

randomly for the survey, each residential address has equal likelihood of selection without

replacement. Our implementation (through the built-in constraint) also ensured that each address

within the extent of each site has only one chance of selection. Iteratively, we develop a list of

twenty residential addresses around each site.

Page 11: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 11

11

Step 4: Selecting the required sample from the address list: The proposed method can produce a

list of addresses (Ni) at or within the domain of a site. Once the list is prepared any conventional

method of sampling can be employed to select the desired number of residential units (ni) around

each site. Nonetheless, the decision about ni can be influenced by a number of factors. For our

survey, we inflated ni to account for the anticipated non-response and in-eligibility, due to non-

residential addresses and household that did not speak a supported language (Rachel Harter, et

al., 2010). Assuming a 10% response rate and 70% eligibility rate, we should sample ni = 1/(0.70

× 0.10) = 15 households around each site

For parameter estimation and deriving robust inferences from the survey data, it is important that

probability of sampling an address is known. Let k=1,…,Ni index the residential addresses with

geo-location (xk,yk) within the domain (or area a site represents). Let j=1,…,Li index LandScan

pixels that ith sample site represents (this also includes the pixel in which the sample site is

located). Suppose MOSj is the measure of size of pixel j, which contains the number of addresses

within pixel j. The probability an address will be selected from jth pixel can be expressed as

| = ∑ for j=1,…,Li

Since the number of addresses present within each pixel (Nji) can vary, the overall selection

probability of an address selection within jth pixel can be adjusted by number of valid addresses

present in a pixel, as

∑ ×

Page 12: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 12

12

Step 4: Validation of the selected sample using US Postal Service Database: To demonstrate the

validity of reverse geocoding, we created a list of twenty addresses around each of 951 samples

sites (chosen using the four different methods of spatial sampling described above), and one

residential unit was drawn randomly from each list. We validated these addresses using

AccuZIP6, a comprehensive mailing list management system (AccuZIP, 2010). AccuZIP6 is

Coding Accuracy Support System (CASS) and Presort Accuracy Verification and Evaluation

(PAVE), certified by the United States Postal Service (USPS). Of these 951 addresses, 938

(98%) were valid residential addresses. The thirteen addresses that were not validated are shown

in Fig 5. As evident from this figure, these addresses are distributed randomly and no systematic

bias seems evident. These addresses were investigated further and provided insight into the

limitations of reverse geocoding. These thirteen invalid addresses were not matched due to four

different reasons, namely, outdated data on dismantled buildings (Fig 6a), wrong nomenclature

for building type (apartment building listed as a town house) (Fig 6b), mismatch in the address

format between the USPS database and the White Pages database (Fig 6c), and the duplicated

listing of an address in multiple towns (Fig 6d). Although the number of invalid addresses

retrieved using reverse geocoding is significant, it will be important to validate all addresses and

evaluate the reasons for invalid addresses. It can serve dual purposes: provide a sense of

completeness of addresses in the list, and help us understand the reasons for invalid addresses.

DISCUSSION: A finite and complete enumeration listing of population in question is critically

important for social surveys. Field listing and existing database lists (such as the US Census and

USPS addresses) are widely used and accepted listing methods. Utilizing publically available

datasets, this paper presents a novel method of constructing a list of addresses. Given the

Page 13: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 13

13

computation and time constraint, we do not develop the entire list of residential addresses.

Simulating a random location within the extent of sample sites and then identifying the closest

residential address(es) around the simulated point ensured equal likelihood of selection of each

address. The sampling strategy also ensured that each address within the extent of a site has only

one chance of selection. The proposed method utilizes two different datasets, Google Earth

(Google, 2010) and White Pages (WhitePages, 2010), and can develop a list of addresses,

residential and household units, by geographic locations or by census/administrative units.

Although the proposed method can develop a list of residential addresses for any geographic

and/or administrative units and can be used for traditional social surveys, its unique benefits are

realized when utilized with the spatial sampling methods.

The proposed method has several important implications for social surveys. First, it can help

overcome the problems of over- and under-coverage of the population that the traditional

methods of listing suffer from. Over-coverage involves listing of units not in the area of interest;

under-coverage involves missing residential units that are of interest for the study. If residential

units of interest are missing or undesired units are included unintentionally, it can lead to

sampling bias. The field listing method has been accepted as a standard method for survey

research. It involves training an enumerator to identify residential units in the study area. The

enumerator is instructed to count a variety of residential units. This can include multi-family

dwelling units, counting mailboxes, door bells, or utility meters. The field listed data are usually

validated with respect to census data.

Page 14: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 14

14

While under-coverage is a well-known issue associated with the field listing method, it is also

expensive and time consuming. Thus researchers have recently begun to utilize the database

listing method. For example, the United States Postal Services (USPS) maintains a database for

all the mail delivery points in the country. Commercial firms are licensed to the databases

(AccuZIP, 2010; Valassis, 2010). The National Opinion Research Center (NORC) utilizes these

data for drawing samples for GSS. These services can provide a list of mailing addresses (with

names in many cases) by different census units, including census block. Metadata about the type

of each address are also available. Although a census block is a fine spatial unit, location

accuracy in geocoding these data can pose the problem of under-coverage by eliminating un-

geocoded addresses. In addition, the cost associated with acquiring these data could be an

important issue.

As discussed earlier, the proposed method of address listing augments the scope of spatial

sampling for social surveys. Spatial sampling can ensure better spatial coverage and population

representation than the traditional methods of sampling. Since the enumeration list is drawn

using the geographic extent, it can help assess the population (by SES and demographic

characteristics) a list of residential addresses represents. The availability of precise locations can

also ensure contextualization of the survey data by socio-physical environments, such as the

number of recreational facilities or fast food stores or schools or pharmacies within the identified

spatial extent of a sample site. This, in turn, can attract researchers’ interest in these data from a

wide variety of disciplines, including public health, economics, ecology and sociology.

Page 15: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 15

15

Although the reverse geocoding, integrated with White Pages, offers a unique and unprecedented

opportunity to develop a location-based address list, there are several limitations of this

methodology. First, the list is developed based on location or geographic extent or a boundary.

This means, locations or geographic units need to be defined for developing the list. Although

we suggest the use of local spatial autocorrelation (in the chosen attribute) to define this extent

(or site domain), pre-defined census or administrative units can also be used as site domain.

Extracting a complete list of residential addresses within the site domain can be computationally

intensive, especially for densely populated urban areas. To make our application computationally

efficient, we simulate random points within the site domain, reverse geocode these points and

then identify valid residential addresses closest to the random points. Since the address list

comprises addresses around the random points and the final respondents are drawn randomly

from the list, each address has equal likelihood of selection.

The reliability of enumeration units developed using the proposed method is largely dictated by

the quality and completeness of geo-referenced address dataset. If these data are not updated, the

unavailability of new addresses or the availability of abundant addresses could pose the problems

of under-coverage and over-coverage, respectively. Although 98% of the samples were valid

residential addresses, a few were not valid, because some buildings (or addresses) were

dismantled. In addition, there are certain limitations on the access to these publically available

datasets, for example Google Earth allows extraction of 15,000 addresses/day and White Pages

allows for two addresses/second and 200 addresses/day. Nonetheless, the commercial license to

these datasets may help overcome these problems. The validation of results presented in this

study applies to the Chicago MSA. It is likely that the quality and completeness of the

Page 16: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 16

16

georeferenced data used may vary geographically. The future research will be geared towards

implementation and validation of this methodology of enumeration listing nationally.

Intensifying competition is increasingly creating pressure among commercial firms to maintain a

complete and reliable dataset and develop more advanced features, such as API services, to

attract customers and businesses. A few years ago, Google Earth and Mapquest were the only

publically available datasets of georeferenced addresses. In recent years, however, Microsoft and

Yahoo! have also ventured in this area and are making available similar types of datasets. Since

the lack of geo-referenced datasets on addresses is still problematic in many other countries,

especially in developing countries, the application of the proposed method of listing can be of

little use for these countries. Because of modern advances in inexpensive geo-spatial

technologies it is likely that georeferenced address databases can be developed for these

countries and the application of the proposed method can be extended to these countries as well.

Page 17: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 17

17

References

AccuZIP. (2010). AccuZIP6 5.0 Retrieved 08/16/2010, 2010, from http://www.accuzip.com/

Caughy, M. O., & O'Campo, P. J. (2006). Neighborhood poverty, social capital, and the

cognitive development of African American preschoolers. Am J Community Psychol,

37(1-2), 141-154.

Chen, C., Gong, H. M., & Paaswell, R. (2008). Role of the built environment on mode choice

decisions: additional evidence on the impact of density. Transportation, 35(3), 285-299.

Curtis, A., Mills, J. W., & Leitner, M. (2006). Spatial confidentiality and GIS: re-engineering

mortality locations

from published maps about Hurricane Katrina. International Journal of Health Geographics,

5(44), . doi: 10.1186/1476-072X-5-44

Frumkin, H. (2003). Healthy places: Exploring the evidence. American Journal of Public Health,

93(9), 1451-1456.

Google. (2010). Google Map. http://maps.google.com/maps/

Harter, R., Eckman, S., English, N., & O'Muircheartaigh, C. (2010). Applied Sampling for

Large-Scale Multi-Stage Area Probability Designs. In J. Wright & P. Marsden (Eds.),

Handbook of Survey Research (2 ed., Vol. , pp. 169-199): Emerald Group Publishing

Harter, R., Eckman, S., English, N., & O’Muircheartaigh, C. (Eds.). (2010). Applied Sampling

for Large-Scale Multi-Stage Area Probability Designs. Bingley, UK: Emerald Group

Publishing Limited.

Kumar, N. (2007). Spatial Sampling for a Demography and Health Survey. Population Research

and Policy Review, 26(5-6), 581-599. doi: 10.1007/s11113-007-9044-7

Page 18: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 18

18

Kumar, N. (2009). An optimal spatial sampling design for intra-urban population exposure

assessment. Atmospheric Environment, 43(5), 1153-1155.

Kumar, N., Liang, D., Linderman, M., & Chen, J. (2010). An Optimal Spatial Sampling Design

for Social Surveys. Iowa City.

ORNL. (2008). LandScanTM Global Population Database, from http://www.ornl.gov/landscan/.

Osborne, T., & Rose, N. (1999). Do the social sciences create phenomena?: the example of

public opinion research. British Journal of Sociology, 50(3), 367-396.

Rushton, G., Armstrong, M. P., Gittler, J., Greene, B. R., Pavlik, C. E., West, M. M., &

Zimmerman, D. L. (2006). Geocoding in cancer research - A review. American Journal

of Preventive Medicine, 30(2), S16-S24. doi: DOI 10.1016/j.amepre.2005.09.011

Valassis. (2010). Valassis. http://www.valassis.com

WhitePages. (2010). WhitePages.com. http://www.whitepages.com/

Page 19: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

F

Fig 1: Sample ssites selected ussing four spatiaal sampling me

Ge

ethods in urban

eoreferenced a

n Chicago, MS

address listin

SA

ng - 19

19

Page 20: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 20

20

Fig 2: A contrast between social survey and spatial social surveys

Spatial Social Survey

SocialSurvey

Defining and charactering

population

SamplingMethod

CharacteringSamplingDomain

PseudoSampling

Frame

SamplingFrame

A sample oflocations

Reverse Geocoding

Enumeration list of addresses at/around

sample locations

A sample ofIndividuals/Addresses/Households

A Sample ofAddresses/Buildings/

Spatial Social Survey

SocialSurvey

Defining and charactering

population

SamplingMethod

CharacteringSamplingDomain

PseudoSampling

Frame

SamplingFrame

A sample oflocations

Reverse Geocoding

Enumeration list of addresses at/around

sample locations

A sample ofIndividuals/Addresses/Households

A Sample ofAddresses/Buildings/

Spatial Social Survey

SocialSurvey

Defining and charactering

population

SamplingMethod

CharacteringSamplingDomain

PseudoSampling

Frame

SamplingFrame

A sample oflocations

Reverse Geocoding

Enumeration list of addresses at/around

sample locations

A sample ofIndividuals/Addresses/Households

A Sample ofAddresses/Buildings/

Page 21: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Georeferenced address listing - 21

21

Fig 3: Schematics of integrating reverse geocoding and white pages.

A sampleof geo-referenced

locations

Final sample of residentialaddresses around sample sites

Sampling Method

ContextualizedSpatial SamplingPseudo Frame Reverse Geocoding Application

Location Evaluation

Valid Street Addresses at/around

Sample Sites

List of residential addresses at/around

sample sites

Street Address Mining Application

Addresses validationand filtering

Publically availableaddress database

(white pages/yellow pages)

Invalid Location Location adjustment

A sampleof geo-referenced

locations

Final sample of residentialaddresses around sample sites

Sampling Method

ContextualizedSpatial SamplingPseudo Frame Reverse Geocoding Application

Location Evaluation

Valid Street Addresses at/around

Sample Sites

Reverse Geocoding Application

Location Evaluation

Valid Street Addresses at/around

Sample Sites

List of residential addresses at/around

sample sites

Street Address Mining Application

Addresses validationand filtering

Publically availableaddress database

(white pages/yellow pages)

Street Address Mining Application

Addresses validationand filtering

Publically availableaddress database

(white pages/yellow pages)

Invalid Location Location adjustment

Page 22: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Fig 4:: An example o

of residential aaddress listing a

Ge

around a samp

eoreferenced a

le site.

address listin

ng - 22

22

Page 23: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

Fig 5: SSpatial distribu

ution of invalidd addresses in U

Ge

Urban Chicago

eoreferenced a

o, MSA.

address listin

ng - 23

23

Page 24: A Novel Methodology for Georeferenced Address Listingweb.ccs.miami.edu/~nkumar/Manuscripts/Rev_Geocoding_Final_IJPOQ.pdf · A Novel Methodology for Georeferenced Address Listing Abstract:

a. DismanEllyn, IL, 6a dismantle

c. Address Route 59, WGoogle maChicago, IL“1 N640 S

tled building: 60137 is valid ed building.

format: The aWest Chicago,aps the addressL 60185”. Thetate Route 59 W

“468 Duane Staddress, but it

address is “1 N, IL, 60185”. Ins is “1N640 Rtee WhitePage’s aWest Chicago,

Fig 6: S

t, Glen represents

bb

640 State n the e 59, West address is IL 60185”

dtI

Some example

b. Wrong nombe an apartmen

d. Double listintowns, e.g. 43 IIckenham Ln, C

es of invalid ad

Ge

menclature: Thent building, but

ng: some addreIckenham Ln, Campton Hills

ddresses.

eoreferenced a

e address abovt it is listed as a

esses are listedElgin, IL 6012

s, IL 60124.

address listin

ve seems to a house.

d in two 24” and “43

ng - 24

24