This tutorial draws inspiration for the analyses of the Subways data from “Subways and Urban Growth: Evidence from Earth” (2018), by Marco Gonzalez-Navarro and Matthew Turner. However, it does not directly follow the methodology used in this paper for all calculations. As such, the user should not expect all values and results to align directly with Gonzalez-Navarro and Turner’s.
These subways data provide the location and opening date of all subway stations worldwide for which records are available. In these data, a subway is defined as ‘as an electric-powered urban rail system isolated from interactions with automobile traffic and pedestrians’ (Gendron-Carrier et al. 2021). Under this definition, heavy rail commuter lines and streetcars are excluded to focus on intra-city subway systems. These records were compiled manually for use in Subways and Urban Growth: Evidence from Earth, Gonalez-Navarro and Turner (2018), using a variety of online sources including http://www.urbanrail.net/. The initial dataset was created between January 2012 and February 2014 by Farhan Yahya, Mahdy Saddradini, Mohamed Salat, and Fern Ramoutar, and it was subsequently updated in 2020 to include data through December 2017. Latitude, Longitude, Station Name, Subway Line, Country, City, and Year Completed are included in the main dataset.
These data were used further to approximate the route of each subway line by connecting stations along the same line by their shortest route. The geographic representations of these routes, stored individually at 5-year intervals from 1935-2000, are provided for use as well.
Subway ridership was also compiled for these public transit systems. While ridership is not available for all systems, the team obtained data for 77 subway and 40 bus systems. These data were collected from a variety of transit organizations and government sources, and they include counts of total public transit rides, heavy rail rides, bus rides and light rail rides by year, country, and city for dates between 1963 and 2014.
In geospatial data analysis, data can be classified into two categories: raster and vector data. A graphic comparison between raster and vector data can be found in this page.
More information and examples can be found in sections 3 & 4 of the Earth Analytics Course.
In this tutorial, we will primarily use vector data. Geospatial data in vector format are often stored in a shapefile, a popular format for storing vector data developed by ESRI. The shapefile format is actually composed of multiple individual files which make up the entire data. At a minimum, there will be 3 file types included with this geographic data (.shp, .shx, .dbf), but there are often other files included which store additional information. In order to be read and used as a whole, all file types must have the same name (eg, subway_stations2.shp, subway_stations2.shx) and be in the same folder. For more details on shapefiles and file types, see this documentation.
Because the structure of points, lines, and polygons are different, each shapefile can only contain one vector type (all points, all lines, or all polygons). You will not find a mixture of point, line, and polygon objects in a single shapefile, so in order to work with these different types in the same analysis, multiple shapefiles will need to be used and layered. We’ll see an example of this in section 2.
To get started with R, we provide instructions on how to download and install R on your computer. R is an open source software, which means users like you can also inspect, modify, and improve its source code.
RStudio is an integrated development environment for R. R provides the engine for running code, while RStudio is a user friendly control panel to perform various tasks. RStudio facilitates R code writing and debugging and provides tools for workspace management. RStudio can be downloaded from the RStudio IDE page.
There are numerous posts, tutorials, and courses on the internet. Once you have installed R and RStudio, pick any of the following resources to get familiar with R:
In order to perform data manipulation, we need to attach packages. The first step is downloading R packages from CRAN. For this exercise, we are going to use the packages tidyverse and sf for data manipulation, ggplot and terra for creating visualizations, haven for reading in Stata data, knitr and stargazer for creating tables, and lfe for running our usage example regression. We’ll also use remotes to download the package geodata to obtain background maps for our subways data. To download these packages, in the R or RStudio console, type the following code:
If you are not familiar with the tidyverse workflow, please refer to the R for Data Science book we suggested in the previous section.
We now attach these packages.
library(tidyverse) library(sf) library(ggplot2) library(terra) library(geodata) library(haven) library(knitr) library(stargazer) library(lfe) remotes::install_github("rspatial/geodata")
The Subways data are stored as shapefiles and available to download as a 7Z file here. 7Z files are compressed using the open-source 7-Zip tool and typically require a third-party app to access. There are many apps available to open 7Z files, but an easy option for Windows users is to download the 7-Zip app directly. MacOS users can use a tool like the Unarchiver. For a more detailed description of working with 7Z files, see this tutorial for Windows or this guide for macOS.
Within the subway_census folder, you’ll find the three subway datasets described above:
ridership. The main data is located in
station_points. We can read this into R and view the first 5 rows of the dataset with:
subway_stations <- st_read('./subway_census_v1/station_points/subway_stations2.shp', crs=4326)
## Reading layer `subway_stations2' from data source ## `C:\Users\natra\Documents\DIME-CEGA\Gonzalez-Navarro and Turner (2019)-- Subways and Urban Growth Evidence from Earth\subway_census_v1\station_points\subway_stations2.shp' ## using driver `ESRI Shapefile' ## Simple feature collection with 7886 features and 15 fields ## Geometry type: POINT ## Dimension: XY ## Bounding box: xmin: -123.1798 ymin: -34.64285 xmax: 141.4738 ymax: 60.23895 ## Geodetic CRS: WGS 84
## Simple feature collection with 5 features and 15 fields ## Geometry type: POINT ## Dimension: XY ## Bounding box: xmin: -58.44937 ymin: -34.62368 xmax: -58.37385 ymax: -34.57022 ## Geodetic CRS: WGS 84 ## CONTINENT COUNTRY1 CITY1 URBANCODE LINE_NAME LINE_NAME_ LOOP ## 1 south america argentina buenosaires 20058 Line D Line D <NA> ## 2 south america argentina buenosaires 20058 Line B Line B <NA> ## 3 south america argentina buenosaires 20058 Line D Line D <NA> ## 4 south america argentina buenosaires 20058 Line A Line A <NA> ## 5 south america argentina buenosaires 20058 Line D Line D <NA> ## STATION_NA FULLNAME STATIONID ## 1 Olleros Buenos Aires-Olleros-Line D 1047 ## 2 Carlos Gardel Buenos Aires-Carlos Gardel-Line B 1012 ## 3 Catedral Buenos Aires-Catedral-Line D 1017 ## 4 Puan Buenos Aires-Puan-Line A 1060 ## 5 Facultad de Medicina Buenos Aires-Facultad de Medicina-Line D 1025 ## TERMINAL TRANSFER_S YEAR_COMPL SUBWAYLAT SUBWAYLONG ## 1 0 0 1937 -34.5702 -58.4442 ## 2 0 0 1930 -34.6042 -58.4119 ## 3 1 1 1937 -34.6084 -58.3739 ## 4 0 0 1913 -34.6237 -58.4494 ## 5 0 0 1937 -34.5999 -58.3980 ## geometry ## 1 POINT (-58.44417 -34.57022) ## 2 POINT (-58.41186 -34.60419) ## 3 POINT (-58.37385 -34.60843) ## 4 POINT (-58.44937 -34.62368) ## 5 POINT (-58.39798 -34.59988)
We see we have precise geographic information stored in individual columns -
SUBWAYLONG - and combined into a geographic representation in
geometry. We also have additional information for each subway point to use in any analysis or visualizations. As an example, we can visualize the stations in Athens, Greece, colored by Line and include a background map of Athens for context.
First, we can select our Athens subway data and visualize these locations:
subways_athens <- subway_stations %>% filter(CITY1 == 'athens') plot(subways_athens$geometry)
We can see a definite pattern in these stations, but it would be helpful to have some geographic context in the form of a background map. We can download map data from a variety of sources, but for this example we’ll use the Global Administrative Area Database (GADM). You can download map data directly from GADM, as in this post, or you can use R to obtain GADM map data - we’ll be using R.
Geographic levels in GADM are defined as:
For our example, we are interested viewing the city of Athens alongside our subways data. We can download the map of Greece and its level 3 administrative areas in order to get this appropriate level of detail:
greece = geodata::gadm("Greece", level=3, path="./data")
The boundary data is downloaded to the path that you specified in the
path argument. The downloaded data through
gadm() will be in the PackedSpatVector class. If you want to convert it to another class (in our case, the sf class, so we can use it alongside our subways data), you can first read it using
readRDS(), then convert to a SpatVector via
vect() from the terra package, and finally convert it to a sf object.
greece = readRDS("./data/gadm36_GRC_3_pk.rds") %>% vect() %>% st_as_sf(greece)
To see what data we’ve pulled graphically, we can visualize our entire map of Greece’s level 3 administrative boundaries colored by the Level 1 regions.
Since we want to use this map alongside our subways data, we’ll need to select only the areas which the subways service. Though this subway system belongs to Athens, it extends outside of the official city limits, meaning we can’t simply filter for the Athens city limits (which would be
NAME_3 == 'Athens'), as this won’t capture the full extent of our subways. Instead, we can use the geographic subway data itself as a filter, selecting only those areas of our map of Greece which have a station along the Athens subway line. We’ll do this with the sf package’s
st_join() function, which will compare the
geometry field in our Subways data with the
geometry field in our Greece data and can return only the areas where the geometries overlap. We use
left=FALSE to ensure that this join acts as a filter rather than returning back all of our Greece data.
athens_area <- st_join(greece, subways_athens, join=st_intersects, left=FALSE) # plot the areas which are reached by the Athens subway line plot(athens_area$geometry, col=as.factor(athens_area$NAME_3)) # use as.factor() so plot() recognizes areas as categories # add a legend to identify our areas legend('left', legend=levels(as.factor(athens_area$NAME_3)),fill=unique(as.factor(athens_area$NAME_3)))
We’re now ready to combine this map of Athens with our Athens subway points. We can use
ggplot() for this map, as it enables more advanced visualizations.
ggplot() + geom_sf(data = athens_area) + geom_point(data = subways_athens, aes(x=SUBWAYLONG, y=SUBWAYLAT, color=factor(as.factor(subways_athens$LINE_NAME)))) + labs(color = 'Subway Lines', title = 'Athens Subway Lines') + xlab("Longitude") + ylab("Latitude")
The Subways data also contains route maps at five-year intervals for each subway system, which are located in the
route_maps folder. For our example, we can start with the most recent routes, from 2010, and see what the data looks like:
subway_route2010 <- st_read('./subway_census_v1/route_maps/subway_map_2010.shp', crs=4326)
## Reading layer `subway_map_2010' from data source ## `C:\Users\natra\Documents\DIME-CEGA\Gonzalez-Navarro and Turner (2019)-- Subways and Urban Growth Evidence from Earth\subway_census_v1\route_maps\subway_map_2010.shp' ## using driver `ESRI Shapefile' ## Simple feature collection with 137 features and 1 field ## Geometry type: MULTILINESTRING ## Dimension: XY ## Bounding box: xmin: -123.1798 ymin: -34.64285 xmax: 141.4738 ymax: 60.23895 ## Geodetic CRS: WGS 84
## Simple feature collection with 5 features and 1 field ## Geometry type: MULTILINESTRING ## Dimension: XY ## Bounding box: xmin: -58.47463 ymin: -34.64285 xmax: 49.95372 ymax: 50.89736 ## Geodetic CRS: WGS 84 ## urbancode geometry ## 1 20053 MULTILINESTRING ((49.9535 4... ## 2 20058 MULTILINESTRING ((-58.40205... ## 3 20107 MULTILINESTRING ((16.33526 ... ## 4 20140 MULTILINESTRING ((44.49242 ... ## 5 20144 MULTILINESTRING ((4.397926 ...
We can see that, unlike the
station_points data above, this file only contains the
urbancode as an identifying field alongside the geographic representation of the subway routes. If we wanted to work solely with the routes data, we could use
urbancode to match the appropriate countries and cities from the
station_points data with our route data. We first will select the unique urban codes with associated cities, countries, and continents from our
station_points data. Then, we can use
join() to associate these location names with our route data.
# find list of all urban codes with associated continent, country, and city names urbancode_info <- subway_stations %>% as.data.frame() %>% distinct(CONTINENT, COUNTRY1, CITY1, URBANCODE) # join urban code info with route data to ease route identification subway_route2010_add <- merge(subway_route2010, urbancode_info, by.x = 'urbancode', by.y = 'URBANCODE') # view top 5 rows of updated route data subway_route2010_add[0:5,]
## Simple feature collection with 5 features and 4 fields ## Geometry type: MULTILINESTRING ## Dimension: XY ## Bounding box: xmin: -58.47463 ymin: -34.64285 xmax: 49.95372 ymax: 50.89736 ## Geodetic CRS: WGS 84 ## urbancode CONTINENT COUNTRY1 CITY1 geometry ## 1 20053 europe azerbaijan baku MULTILINESTRING ((49.9535 4... ## 2 20058 south america argentina buenosaires MULTILINESTRING ((-58.40205... ## 3 20107 europe austria vienna MULTILINESTRING ((16.33526 ... ## 4 20140 europe armenia yerevan MULTILINESTRING ((44.49242 ... ## 5 20144 europe belgium bruxelles MULTILINESTRING ((4.397926 ...
We now have City, Country, and Continent information for each route, which will enable us to more easily select the routes we’re interested in. For example, if we’d like to view our Athens route from above:
athens_route2010 <- subway_route2010_add %>% filter(CITY1 == 'athens') plot(athens_route2010$geometry)
The Subways data also provides information on Ridership for the subway systems where it is available. This is provided as a stata data file type (.dta), so we will need to use the haven package’s
read_dta() function to bring the data into R.
ridership <- read_dta('./subway_census_v1/ridership/ridership.dta') ridership[0:5,]
## # A tibble: 5 x 8 ## urbanname urbancode country year ridership_all ridershipheavyr~ ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 Buenos A~ 20058 Argent~ 1996 NA 413547371 ## 2 Wien (Vi~ 20107 Austria 1997 NA 375700000 ## 3 Bruxelle~ 20144 Belgium 2000 170100000 NA ## 4 Belo Hor~ 20183 Brazil 1999 NA 25641000 ## 5 Recife 20268 Brazil 2004 NA 53680000 ## # ... with 2 more variables: ridershipbuses <dbl>, ridershiplightrail <dbl>
We find we have values for all rides taken in a city each year along with a breakdown of rides which were heavy rail, light rail, or only buses. The
urbancode field in this data aligns with the urban code fields in our station points and routes data sets above, which will allow us to use this ridership data in our analysis of the geographic subway data sets.
We can now turn to using these Subways data for some initial analysis.
We can first present overall summary statistics for cities which had subways in 2010. We can create values for stations, subway lines, routes, and ridership worldwide and for different world regions.
Note: If comparing these values to the values in Table 1 of “Subways and Urban Growth: Evidence From Earth” (2018), Gonzalez-Navarro and Turner, you will notice differences in values. While this dataset includes the Middle East as a
CONTINENT in the data, in the paper the countries in the Middle East are assigned to Asia (UAE, Iran, Saudi Arabia, Turkey, and Uzbekistan). Further, there are some countries assigned as Europe in this data but assigned as Asia in the paper (Armenia, Azerbaijan, Russia). Thus, the values for Asia and Africa will differ between this output and Table 1 in the paper.
subway_stations_pre2010 <- filter(subway_stations, YEAR_COMPL <= 2010)
# calculate number of stations and lines by city city_stats_stations <- subway_stations %>% as.data.frame() %>% dplyr::select(-c('geometry')) %>% group_by(CONTINENT, COUNTRY1, CITY1) %>% summarize( stations = n_distinct(STATIONID), lines = n_distinct(LINE_NAME)) # calculate mean and total stations and lines by continent region_stats_stations <- city_stats_stations %>% group_by(CONTINENT) %>% summarize( total_cities = n_distinct(CITY1), total_stations = sum(stations), mean_stations = mean(stations), total_lines = sum(lines), mean_lines = mean(lines) ) total_stats_stations <- region_stats_stations %>% summarize( total_cities = sum(total_cities), total_stations = sum(total_stations), mean_stations = mean(mean_stations), total_lines = sum(total_lines), mean_lines = mean(mean_lines) ) # calculate length of routes by city city_stats_routes <- subway_route2010_add %>% group_by(CONTINENT, COUNTRY1, CITY1) %>% summarize( route_length = st_length(geometry)) %>% as.data.frame() %>% dplyr::select(-c('geometry')) # calculate mean and total route length by continent region_stats_routes <- city_stats_routes %>% group_by(CONTINENT) %>% summarize( total_route_length = sum(route_length), mean_route_length = mean(route_length) ) total_stats_routes <- region_stats_routes %>% summarize( total_route_length = sum(total_route_length), mean_route_length = mean(mean_route_length) ) # join statistics for regions and cities full_region_stats <- merge(region_stats_stations, region_stats_routes, by='CONTINENT') full_total_stats <- cbind(total_stats_stations, total_stats_routes) full_total_stats$CONTINENT = 'Total' full_stats <- rbind(full_region_stats, full_total_stats) full_city_stats <- merge(city_stats_stations, city_stats_routes, by=c('CONTINENT','COUNTRY1','CITY1')) kable(full_stats)
|africa||1||51||51.00000||2||2.000000||59805.87 [m]||59805.87 [m]|
|asia||34||2466||72.52941||147||4.323529||3349023.95 [m]||101485.57 [m]|
|europe||51||3116||61.09804||204||4.000000||3304650.68 [m]||64797.07 [m]|
|middle east||8||177||22.12500||15||1.875000||223246.39 [m]||27905.80 [m]|
|north america||30||1598||53.26667||92||3.066667||1934802.07 [m]||64493.40 [m]|
|south america||14||478||34.14286||36||2.571429||523089.15 [m]||37363.51 [m]|
|Total||138||7886||49.02700||496||2.972771||9394618.10 [m]||59308.54 [m]|
kable(filter(full_city_stats, COUNTRY1 == 'greece'))
Ridership statistics are also available for use from this dataset. To better focus our analysis, we can include only ridership from a single year, and we can use 2010 for our example:
ridership_2010 <- filter(ridership, year==2010) subway_stats_2010 <- ridership_2010 %>% filter(!is.na(ridershipheavyrail)) %>% summarize( mean_ridership = mean(ridershipheavyrail), sd_ridership = sd(ridershipheavyrail), countries = n_distinct(country), cities = n_distinct(urbanname) ) %>% mutate(transit = 'Subway') bus_stats_2010 <- ridership_2010 %>% filter(!is.na(ridershipbuses)) %>% summarize( mean_ridership = mean(ridershipbuses), sd_ridership = sd(ridershipbuses), countries = n_distinct(country), cities = n_distinct(urbanname) ) %>% mutate(transit = 'Bus') ridership_stats_2010 <- rbind(subway_stats_2010, bus_stats_2010) %>% select(transit,everything()) ridership_stats_2010
## # A tibble: 2 x 5 ## transit mean_ridership sd_ridership countries cities ## <chr> <dbl> <dbl> <int> <int> ## 1 Subway 376836173. 639732567. 34 77 ## 2 Bus 241605938. 343003474. 17 40
We can also consider charts to display these summary statistics graphically.
ggplot(data=full_region_stats) + geom_bar(aes(x=CONTINENT,y=total_stations,fill=CONTINENT), stat='identity') + geom_bar(aes(x=CONTINENT,y=mean_stations),stat='identity') + scale_fill_brewer(palette = 'Dark2', guide=FALSE) + labs(color = 'Continent', title = 'Total and Mean Stations by Continent') + xlab("Continent") + ylab("Stations")
We can visualize the creation of subways over time as well:
stations_yearly <- subway_stations %>% as.data.frame() %>% dplyr::select(-c('geometry')) %>% group_by(YEAR_COMPL) %>% summarize( stations = n_distinct(STATIONID)) stations_yearly$total_stations = cumsum(stations_yearly$stations) plot(x=stations_yearly$YEAR_COMPL, y=stations_yearly$total_stations,type='l',xlab='Year',ylab='Operational Stations')