This tutorial will guide the user through the creation of population-weighted wealth maps and statistics for administrative areas of interest. After an explanation of micro-level wealth and population estimates, the user will learn to obtain this data for a country of interest alongside the boundaries for the country's administrative areas. These population and wealth data are next joined by location, aggregated to administrative areas, and the wealth data weighted by population. The user is finally guided through validation of their results and exporting the results as maps and tabular statistics. Throughout, the user will be introduced to widely useful Python packages and code as well as Google Earth Engine, a powerful tool for analyzing geographic and satellite imagery.
This tutorial is developed following the code and techniques developed by Emily Aiken and Joshua Blumenstock at the University of California, Berkeley. It provides an introduction to the data sources and code required to create population-weighted wealth measures and maps for any country or region of interest.
For policymakers and NGOs trying to assist the most vulnerable with limited resources, it is of primary importance to have an accurate view of their populations. Traditionally, surveys have been the only way to obtain this information, but the time and expense required for surveys - as well as the speed at which surveys can become outdated due to changing circumstances - make it challenging to identify all individuals in need.
However, with the increasing prevelance of technologies which collect data at scale - including satellites, mobile phones, and WiFi - there are also emerging possibilities for the use of this data. Machine learning models, which can consider and make sense of data on a scale impossible for humans, are being used to combine these observational data with traditional survey responses to expand their usefulness: by providing a model data such as satellite imagery, phone networks, and wifi connectivity alongside ground-truth data collected by surveys, the model can learn what features best identify characteristics such as wealth and population to create predictions for other locations where survey information is unavailable.
For the wealth data component of this tutorial, we'll use an innovative geographic wealth dataset created from non-traditional sources. The Relative Wealth Index utilizes satellite imagery as well as mobile phone data, topographic maps, and Facebook connectivity data alongside wealth surveys to develop fine-grained (2.4km resolution) estimates of the relative standard of living within countries. Developed as part of a collaboration between UC Berkeley’s Center for Effective Global Action and Facebook’s Data for Good, these wealth estimates for 93 low and middle income countries are made freely available to aid in the work of policymakers and NGOs. See Micro-Estimates of Wealth for all Low- and Middle-Income Countries by Guanghua Chi, Han Fang, Sourav Chatterjee, and Joshua Blumenstock for details of the data creation and validation. Recent work such as this working paper by Emily Aiken, Suzanne Bellue, Dean Karlan, Christopher Udry, and Joshua Blumenstock demonstrate the value of such ML-derived wealth maps.
To overcome the limitations of survey data, many are working to create comprehensive, granular, and accurate population estimates for countries across the globe. One such source is the High Resolution Population Density Maps and Demographic Estimates, which was created as a collaboration between the Center for International Earth Science Information Network (CIESEN) at Columbia University and Facebook Data for Good. These provide estimates of relative population for most countries in the world at 30m resolution and are developed from satellite imagery and census data.
WorldPop's Population Count datasets are another source of nearly-global population data. WorldPop provides an entirely open-access collection of spatial demographic datasets for Central and South America, Asia, and Africa. There are multiple different Population Count datasets provided through WorldPop, with data for individual countries and globally available using either a
bottom-up modelling approach. The
top-down approach utilizes collected census data - typically at the level of a particular administrative area - and combines these with geospatial datasets to disaggregate population estimates at 100m or 1km resoultion. These have the benefit of rolling up to administrative level counts which match existing census counts, though countries without a recent census or countries with high mobility will likely have inaccuracies in these counts that will be reflected in the WorldPop estimates. The
bottom-up approach instead uses all recent surveys available for an area alongside geospatial datasets to build estimated population counts at 100m resolution. These can provide greater accuracy where census data is outdated or where ground conditions are rapidly changing. For more information on the difference between top-down and bottom-up maps, see WorldPop's explanation of the differences. A final consideration if using the top-down estimates is whether to use data built with a
unconstrained modelling approach. The
unconstrained approach produces estimates over the entire land surface, while the
constrained approach first limits its estimation to areas already mapped as containing settlements. The constrained approach can result in more accurate population allocation for areas of no population or high population as compared to the unconstrained approach, but it is dependent on accurate settlement mapping, which is not always available. For further discussion on the benefits and costs to using each method, see WorldPop's explanation of the differences. Compared to the High Resolution Population Density Maps described above, WorldPop's Population counts have the additional benefit of providing population data for different years, from 2000-2020. This can be useful if performing any historical analyses requiring population values.
It is often the case that countries are governed and services provided according to hierarchical administrative areas. Recognizing and planning interventions around these areas is thus often useful and necessary for leaders both within and external to each country. There are a few organizations which aim to produce accurate, updated geographic datasets for administrative areas which are available for the use of researchers and policymakers. GADM is one of the most comprehensive of these soruces, providing maps of all countries and associated sub-divisions. The Food and Agriculture Organization's (FAO's) Global Administrative Unit Layers (FAO GAUL) also provides geographic boundaries for administrative units across many countries in the world, with the goal of creating an accurate, standardized source of historical and current administrative areas.
In geospatial data analysis, data can be classified into two categories: raster and vector data. A graphic comparison between raster and vector data can be found in the World Bank Nighttime Lights Tutorial module 2, section 1.
In this tutorial, we will use vector and raster data. Geospatial data in vector format are often stored in a shapefile, a popular format for storing vector data developed by ESRI. The shapefile format is actually composed of multiple individual files which make up the entire data. At a minimum, there will be 3 file types included with this geographic data (.shp, .shx, .dbf), but there are often other files included which store additional information. In order to be read and used as a whole, all file types must have the same name and be in the same folder. Because the structure of points, lines, and polygons are different, each shapefile can only contain one vector type (all points, all lines, or all polygons). You will not find a mixture of point, line, and polygon objects in a single shapefile, so in order to work with these different types in the same analysis, multiple shapefiles will need to be used and layered. For more details on shapefiles and file types, see this documentation.
Raster data, on the other hand, is stored in Tagged Image File Format (TIFF or TIF). A GeoTIFF is a TIFF file that follows a specific standard for structuring meta-data. The meta-data stored in a TIFF is called a tif tag and GeoTIFFs often contain tags including spatial extent, coordinate reference system, resolution, and number of layers.
More information and examples can be found in sections 3 & 4 of the Earth Analytics Course.
One option we'll consider for sourcing administrative areas will be Google Earth Engine. For all necessary Python setup and an introduction to our use of the GEE Python API, see the World Bank Nighttime Light Tutorial, module 2 sections 2-5. In particular, before proceeding you will need to have
geemap installed on your machine, and you will need to apply for a Google Earth Engine account here. It may take a day or longer for your Google Earth Engine account to be granted access.
Two of the primary packages we'll be using, Pandas and GeoPandas, must be installed according to their installation instructions: Pandas Installation and GeoPandas Installation. If you're on Windows, GeoPandas installation can occasionally be temperamental - using an environment, as described in the World Bank Nighttime Lights Tutorial, can often circumvent any issues, but if you're still having problems, there are a number of guides online, such as this Practial Data Science guide or this Medium post by Nayane Maia, which provide installation help. Using Windows Subsystem for Linux (WSL) can also make use of tricky packages like GeoPandas easier.
For this tutorial, we'll demonstrate the process of creating population-weighted wealth estimates for all administrative levels in the country of Jordan. This process can be replicated for any country or region of interest, and the sources of data can be swapped as appropriate to use the most accurate sources of population, wealth, and administrative areas.
The relative wealth indices created by Chi et al., which we'll use in this tutorial, can be downloaded as
csv files by country from the Humanitarian Data Exchange. To download, find the file associated with your country of interest ('Jordan_relative_wealth_index.csv' for this example) and select 'download'.
Save the relative wealth index file in the same folder as this Python script is saved for easiest use. Once downloaded, we can read our file into Python using the
Pandas Python package and convert it to geographic data using
jor_relative_wealth_index.csv is the
csv file of Jordan downloaded from the Humanitarian Data Exchange - to utilize a different country's data, replace the file path with the path and name of your file. (Note: only the name of the file is included in this example because the csv file is in the same folder as this Python code. This is leveraging relative paths instead of absolute paths - for more information if you're unfamiliar with relative paths, see this Earth Data Science tutorial).
import pandas as pd import geopandas as gpd
rwi = pd.read_csv('jor_relative_wealth_index.csv') # Use latitude and longitude to create geographic points rwi = gpd.GeoDataFrame(rwi, geometry=gpd.points_from_xy(rwi['longitude'], rwi['latitude'])) rwi.set_crs('epsg:4326',inplace=True)
|0||32.110495||35.804443||0.602||0.548||POINT (35.80444 32.11050)|
|1||31.700129||35.584717||0.454||0.555||POINT (35.58472 31.70013)|
|2||31.400535||35.716553||-0.189||0.498||POINT (35.71655 31.40053)|
|3||31.942840||35.848389||1.432||0.487||POINT (35.84839 31.94284)|
|4||32.184911||36.331787||0.464||0.549||POINT (36.33179 32.18491)|
|2782||32.685619||35.958252||0.103||0.564||POINT (35.95825 32.68562)|
|2783||30.817346||36.068115||-0.239||0.424||POINT (36.06812 30.81735)|
|2784||30.078601||35.211182||-0.258||0.544||POINT (35.21118 30.07860)|
|2785||31.475524||36.002197||-0.215||0.489||POINT (36.00220 31.47552)|
|2786||31.886886||35.584717||0.284||0.546||POINT (35.58472 31.88689)|
2787 rows × 5 columns
We can see a snapshot of our relative wealth index in a tabular form above. We can view this data as a map using
matplotlib, a Python package widely used for creating visualizations.
import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 1, figsize=(10,10)) # column='rwi' defines which column from our dataframe to color the points by rwi.plot(ax=ax, column='rwi',legend=True) ax.axis('off') ax.set_title('Relative Wealth Index Values') plt.show()
While we don't yet have a background map defining the country's borders, we can see the fine grain of the relative wealth index data in the map of Jordan above. The darker colors indicate the lowest wealth areas, relative to the rest of the country, while the yellow points represent the highest wealth areas.
The population maps from WorldPop can be accessed via their API, which can enable greater reproducibility and ease of use especially if multiple datasets are needed. The
gather_worldpop_data() function below will allow the user to access data from WorldPop via the API by simply selecting the desired WorldPop data (population for this example), country, and year they are interested in. However, if desired, the datasets can also be downloaded as GeoTIFF files from the WorldPop website.
You'll need to install the packages
requests, which will allow us to use WorldPop's API to access the data, and
rioxarray, which will allow us to read in the GeoTIFF file we download from WorldPop and convert it to a GeoPandas GeoDataFrame. To install these packages, follow their respective installation instructions: requests Installation and rioxarray Installation.
import requests import rioxarray def gather_worldpop_data(data_type, country_iso=None, year=2015): """ Build the url to pull WorldPop data from the API Inputs: data_type (string): Data type options are 'pop' (population), 'births', 'pregnancies', and 'urban_change'.capitalize country_iso (string): The 3-letter country code, if desired. Default will be global. year (int): the 4-digit year of interest for data. Default will be 2015. Return (str, rioxarray DataArray): returns the name of the .tif file downloaded onto your computer containing the data and the DataArray containing the population counts read in using rioxarray. """ # Build the API url according to user selection url_base = "https://www.worldpop.org/rest/data" url = url_base + '/' + data_type + '/wpgp' if country_iso: url = url + '?iso3=' + country_iso # Request the desired data; filter by year json_resp = requests.post(url).json() json_resp = json_resp['data']['popyear' == year] # Obtain exact .geotiff file name for the desired data geotiff_file = json_resp['files'] print('Obtaining file',geotiff_file) geotiff_data = requests.get(geotiff_file) file_name = 'worldpop_' + country_iso + '_' + str(year) + '.tif' print('Writing to',file_name) with open(file_name,'wb') as f: f.write(geotiff_data.content) # Read in the WorldPop data as a GeoTIFF worldpop_raster = rioxarray.open_rasterio(file_name) return file_name, worldpop_raster
jordan_pop_file, jordan_pop = gather_worldpop_data('pop','JOR',2019)
Obtaining file https://data.worldpop.org/GIS/Population/Global_2000_2020/2000/JOR/jor_ppp_2000.tif Writing to worldpop_JOR_2019.tif
We can now convert this data to a GeoDataFrame to be able to easily work with it in Python and join it with our other datasets. We'll do this according to the process described in this stackoverflow post.
jordan_pop = jordan_pop.squeeze().drop('spatial_ref').drop('band') jordan_pop.name = 'population' worldpop_pop_df = jordan_pop.to_dataframe().reset_index() worldpop_pop_df.head()
We now have our coordinates - latitude and longitude - along with the associated population values. We can see some data cleansing is needed to remove coordinates with no population value (denoted by -99999.0 population), and we'll also want to convert these coordinates into a geographic type we can use with GeoPandas.
# remove populations below 0 worldpop_pop_df = worldpop_pop_df[worldpop_pop_df['population'] > 0] # convert lat/long to geometry worldpop_pop_df['geometry'] = gpd.points_from_xy(worldpop_pop_df['x'], worldpop_pop_df['y']) # convert to GeoDataFrame worldpop_pop_gdf = gpd.GeoDataFrame(worldpop_pop_df) worldpop_pop_gdf.head()
|4606||33.367500||38.795833||0.003340||POINT (38.79583 33.36750)|
|4607||33.367500||38.796667||0.003368||POINT (38.79667 33.36750)|
|9818||33.366667||38.794167||0.003384||POINT (38.79417 33.36667)|
|9819||33.366667||38.795000||0.003340||POINT (38.79500 33.36667)|
|9820||33.366667||38.795833||0.003384||POINT (38.79583 33.36667)|
Excellent, we can now use this population data with our relative wealth values to create our population-weighted wealth estimates. As a final step, let's view these population values on a map.
Warning: This is a very slow, intensive map to build due to the gran