This tutorial draws inspiration for the analyses of the Subways data from “Subways and Urban Growth: Evidence from Earth” (2018), by Marco Gonzalez-Navarro and Matthew Turner. However, it does not directly follow the methodology used in this paper for all calculations. As such, the user should not expect all values and results to align directly with Gonzalez-Navarro and Turner’s.

1.1 Introduction to Subways Data

These subways data provide the location and opening date of all subway stations worldwide for which records are available. In these data, a subway is defined as ‘as an electric-powered urban rail system isolated from interactions with automobile traffic and pedestrians’ (Gendron-Carrier et al. 2021). Under this definition, heavy rail commuter lines and streetcars are excluded to focus on intra-city subway systems. These records were compiled manually for use in Subways and Urban Growth: Evidence from Earth, Gonalez-Navarro and Turner (2018), using a variety of online sources including http://www.urbanrail.net/. The initial dataset was created between January 2012 and February 2014 by Farhan Yahya, Mahdy Saddradini, Mohamed Salat, and Fern Ramoutar, and it was subsequently updated in 2020 to include data through December 2017. Latitude, Longitude, Station Name, Subway Line, Country, City, and Year Completed are included in the main dataset.

These data were used further to approximate the route of each subway line by connecting stations along the same line by their shortest route. The geographic representations of these routes, stored individually at 5-year intervals from 1935-2000, are provided for use as well.

Subway ridership was also compiled for these public transit systems. While ridership is not available for all systems, the team obtained data for 77 subway and 40 bus systems. These data were collected from a variety of transit organizations and government sources, and they include counts of total public transit rides, heavy rail rides, bus rides and light rail rides by year, country, and city for dates between 1963 and 2014.


2.1 Data Structure

In geospatial data analysis, data can be classified into two categories: raster and vector data. A graphic comparison between raster and vector data can be found in this page.

  • Raster data: Data stored in a raster format is arranged in a regular grid of cells, without storing the coordinates of each point (namely, a cell, or a pixel). The coordinates of the corner points and the spacing of the grid can be used to calculate (rather than to store) the coordinates of each location in the grid. Any given pixel in the grid stores one or more values (in one or more bands).
  • Vector data: Data in a vector format is stored such that the X and Y coordinates are stored for each point. Data can be represented, for example, as points, lines and polygons. A point has only one coordinate (X and Y), a line has two coordinates (at the start and end of the line) and a polygon is essentially a line that closes on itself to enclose a region. Polygons are usually used to represent the area and perimeter of continuous geographic features. Vector data stores features in their original resolution, without aggregation.

More information and examples can be found in sections 3 & 4 of the Earth Analytics Course.

In this tutorial, we will primarily use vector data. Geospatial data in vector format are often stored in a shapefile, a popular format for storing vector data developed by ESRI. The shapefile format is actually composed of multiple individual files which make up the entire data. At a minimum, there will be 3 file types included with this geographic data (.shp, .shx, .dbf), but there are often other files included which store additional information. In order to be read and used as a whole, all file types must have the same name (eg, subway_stations2.shp, subway_stations2.shx) and be in the same folder. For more details on shapefiles and file types, see this documentation.

Because the structure of points, lines, and polygons are different, each shapefile can only contain one vector type (all points, all lines, or all polygons). You will not find a mixture of point, line, and polygon objects in a single shapefile, so in order to work with these different types in the same analysis, multiple shapefiles will need to be used and layered. We’ll see an example of this in section 2.

2.2 Tools

2.2.1 Installation of R

To get started with R, we provide instructions on how to download and install R on your computer. R is an open source software, which means users like you can also inspect, modify, and improve its source code.

The Comprehensive R Archive Network (CRAN) provides links to install R under different operating systems. RStudio page provides a brief guide for troubleshooting.

2.2.2 RStudio

RStudio is an integrated development environment for R. R provides the engine for running code, while RStudio is a user friendly control panel to perform various tasks. RStudio facilitates R code writing and debugging and provides tools for workspace management. RStudio can be downloaded from the RStudio IDE page.

There are numerous posts, tutorials, and courses on the internet. Once you have installed R and RStudio, pick any of the following resources to get familiar with R:

2.2.3 Setting up Environment

In order to perform data manipulation, we need to attach packages. The first step is downloading R packages from CRAN. For this exercise, we are going to use the packages tidyverse and sf for data manipulation, ggplot and terra for creating visualizations, haven for reading in Stata data, knitr and stargazer for creating tables, and lfe for running our usage example regression. We’ll also use remotes to download the package geodata to obtain background maps for our subways data. To download these packages, in the R or RStudio console, type the following code:


If you are not familiar with the tidyverse workflow, please refer to the R for Data Science book we suggested in the previous section.

We now attach these packages.


2.2.4 Accessing the Subways Data in R

The Subways data are stored as shapefiles and available to download as a 7Z file here. 7Z files are compressed using the open-source 7-Zip tool and typically require a third-party app to access. There are many apps available to open 7Z files, but an easy option for Windows users is to download the 7-Zip app directly. MacOS users can use a tool like the Unarchiver. For a more detailed description of working with 7Z files, see this tutorial for Windows or this guide for macOS. Station Points

Within the subway_census folder, you’ll find the three subway datasets described above: station_points, route_maps, and ridership. The main data is located in station_points. We can read this into R and view the first 5 rows of the dataset with:

subway_stations <- st_read('./subway_census_v1/station_points/subway_stations2.shp', crs=4326)
## Reading layer `subway_stations2' from data source 
##   `C:\Users\natra\Documents\DIME-CEGA\Gonzalez-Navarro and Turner (2019)-- Subways and Urban Growth Evidence from Earth\subway_census_v1\station_points\subway_stations2.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 7886 features and 15 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -123.1798 ymin: -34.64285 xmax: 141.4738 ymax: 60.23895
## Geodetic CRS:  WGS 84
## Simple feature collection with 5 features and 15 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -58.44937 ymin: -34.62368 xmax: -58.37385 ymax: -34.57022
## Geodetic CRS:  WGS 84
## 1 south america argentina buenosaires     20058    Line D     Line D <NA>
## 2 south america argentina buenosaires     20058    Line B     Line B <NA>
## 3 south america argentina buenosaires     20058    Line D     Line D <NA>
## 4 south america argentina buenosaires     20058    Line A     Line A <NA>
## 5 south america argentina buenosaires     20058    Line D     Line D <NA>
##             STATION_NA                                 FULLNAME STATIONID
## 1              Olleros              Buenos Aires-Olleros-Line D      1047
## 2        Carlos Gardel        Buenos Aires-Carlos Gardel-Line B      1012
## 3             Catedral             Buenos Aires-Catedral-Line D      1017
## 4                 Puan                 Buenos Aires-Puan-Line A      1060
## 5 Facultad de Medicina Buenos Aires-Facultad de Medicina-Line D      1025
## 1        0          0       1937  -34.5702   -58.4442
## 2        0          0       1930  -34.6042   -58.4119
## 3        1          1       1937  -34.6084   -58.3739
## 4        0          0       1913  -34.6237   -58.4494
## 5        0          0       1937  -34.5999   -58.3980
##                      geometry
## 1 POINT (-58.44417 -34.57022)
## 2 POINT (-58.41186 -34.60419)
## 3 POINT (-58.37385 -34.60843)
## 4 POINT (-58.44937 -34.62368)
## 5 POINT (-58.39798 -34.59988)

We see we have precise geographic information stored in individual columns - SUBWAYLAT and SUBWAYLONG - and combined into a geographic representation in geometry. We also have additional information for each subway point to use in any analysis or visualizations. As an example, we can visualize the stations in Athens, Greece, colored by Line and include a background map of Athens for context.

First, we can select our Athens subway data and visualize these locations:

subways_athens <- subway_stations %>% filter(CITY1 == 'athens')

We can see a definite pattern in these stations, but it would be helpful to have some geographic context in the form of a background map. We can download map data from a variety of sources, but for this example we’ll use the Global Administrative Area Database (GADM). You can download map data directly from GADM, as in this post, or you can use R to obtain GADM map data - we’ll be using R.

Geographic levels in GADM are defined as:

  • level 0: National
  • level 1: State/province/equivalent
  • level 2: County/district/equivalent
  • level 3/4: Smaller administrative levels

For our example, we are interested viewing the city of Athens alongside our subways data. We can download the map of Greece and its level 3 administrative areas in order to get this appropriate level of detail:

greece = geodata::gadm("Greece", level=3, path="./data")

The boundary data is downloaded to the path that you specified in the path argument. The downloaded data through gadm() will be in the PackedSpatVector class. If you want to convert it to another class (in our case, the sf class, so we can use it alongside our subways data), you can first read it using readRDS(), then convert to a SpatVector via vect() from the terra package, and finally convert it to a sf object.

greece = readRDS("./data/gadm36_GRC_3_pk.rds") %>% vect() %>% st_as_sf(greece)

To see what data we’ve pulled graphically, we can visualize our entire map of Greece’s level 3 administrative boundaries colored by the Level 1 regions.

plot(greece$geometry, col=as.factor(greece$NAME_1))

Since we want to use this map alongside our subways data, we’ll need to select only the areas which the subways service. Though this subway system belongs to Athens, it extends outside of the official city limits, meaning we can’t simply filter for the Athens city limits (which would be NAME_3 == 'Athens'), as this won’t capture the full extent of our subways. Instead, we can use the geographic subway data itself as a filter, selecting only those areas of our map of Greece which have a station along the Athens subway line. We’ll do this with the sf package’s st_join() function, which will compare the geometry field in our Subways data with the geometry field in our Greece data and can return only the areas where the geometries overlap. We use left=FALSE to ensure that this join acts as a filter rather than returning back all of our Greece data.

athens_area <- st_join(greece, subways_athens, join=st_intersects, left=FALSE)

# plot the areas which are reached by the Athens subway line
plot(athens_area$geometry, col=as.factor(athens_area$NAME_3)) # use as.factor() so plot() recognizes areas as categories
# add a legend to identify our areas
legend('left', legend=levels(as.factor(athens_area$NAME_3)),fill=unique(as.factor(athens_area$NAME_3)))

We’re now ready to combine this map of Athens with our Athens subway points. We can use ggplot() for this map, as it enables more advanced visualizations.

ggplot() +
  geom_sf(data = athens_area)  +
  geom_point(data = subways_athens, aes(x=SUBWAYLONG, y=SUBWAYLAT, color=factor(as.factor(subways_athens$LINE_NAME)))) +
  labs(color = 'Subway Lines',
       title = 'Athens Subway Lines') +
  xlab("Longitude") +
  ylab("Latitude") Route Maps

The Subways data also contains route maps at five-year intervals for each subway system, which are located in the route_maps folder. For our example, we can start with the most recent routes, from 2010, and see what the data looks like:

subway_route2010 <- st_read('./subway_census_v1/route_maps/subway_map_2010.shp', crs=4326)
## Reading layer `subway_map_2010' from data source 
##   `C:\Users\natra\Documents\DIME-CEGA\Gonzalez-Navarro and Turner (2019)-- Subways and Urban Growth Evidence from Earth\subway_census_v1\route_maps\subway_map_2010.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 137 features and 1 field
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: -123.1798 ymin: -34.64285 xmax: 141.4738 ymax: 60.23895
## Geodetic CRS:  WGS 84
## Simple feature collection with 5 features and 1 field
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: -58.47463 ymin: -34.64285 xmax: 49.95372 ymax: 50.89736
## Geodetic CRS:  WGS 84
##   urbancode                       geometry
## 1     20053 MULTILINESTRING ((49.9535 4...
## 2     20058 MULTILINESTRING ((-58.40205...
## 3     20107 MULTILINESTRING ((16.33526 ...
## 4     20140 MULTILINESTRING ((44.49242 ...
## 5     20144 MULTILINESTRING ((4.397926 ...

We can see that, unlike the station_points data above, this file only contains the urbancode as an identifying field alongside the geographic representation of the subway routes. If we wanted to work solely with the routes data, we could use urbancode to match the appropriate countries and cities from the station_points data with our route data. We first will select the unique urban codes with associated cities, countries, and continents from our station_points data. Then, we can use join() to associate these location names with our route data.

# find list of all urban codes with associated continent, country, and city names
urbancode_info <- subway_stations %>% as.data.frame() %>% distinct(CONTINENT, COUNTRY1, CITY1, URBANCODE)

# join urban code info with route data to ease route identification
subway_route2010_add <- merge(subway_route2010, urbancode_info, by.x = 'urbancode', by.y = 'URBANCODE')

# view top 5 rows of updated route data
## Simple feature collection with 5 features and 4 fields
## Geometry type: MULTILINESTRING
## Dimension:     XY
## Bounding box:  xmin: -58.47463 ymin: -34.64285 xmax: 49.95372 ymax: 50.89736
## Geodetic CRS:  WGS 84
##   urbancode     CONTINENT   COUNTRY1       CITY1                       geometry
## 1     20053        europe azerbaijan        baku MULTILINESTRING ((49.9535 4...
## 2     20058 south america  argentina buenosaires MULTILINESTRING ((-58.40205...
## 3     20107        europe    austria      vienna MULTILINESTRING ((16.33526 ...
## 4     20140        europe    armenia     yerevan MULTILINESTRING ((44.49242 ...
## 5     20144        europe    belgium   bruxelles MULTILINESTRING ((4.397926 ...

We now have City, Country, and Continent information for each route, which will enable us to more easily select the routes we’re interested in. For example, if we’d like to view our Athens route from above:

athens_route2010 <- subway_route2010_add %>% filter(CITY1 == 'athens')
plot(athens_route2010$geometry) Ridership Data

The Subways data also provides information on Ridership for the subway systems where it is available. This is provided as a stata data file type (.dta), so we will need to use the haven package’s read_dta() function to bring the data into R.

ridership <- read_dta('./subway_census_v1/ridership/ridership.dta')
## # A tibble: 5 x 8
##   urbanname urbancode country  year ridership_all ridershipheavyr~
##   <chr>     <chr>     <chr>   <dbl>         <dbl>            <dbl>
## 1 Buenos A~ 20058     Argent~  1996            NA        413547371
## 2 Wien (Vi~ 20107     Austria  1997            NA        375700000
## 3 Bruxelle~ 20144     Belgium  2000     170100000               NA
## 4 Belo Hor~ 20183     Brazil   1999            NA         25641000
## 5 Recife    20268     Brazil   2004            NA         53680000
## # ... with 2 more variables: ridershipbuses <dbl>, ridershiplightrail <dbl>

We find we have values for all rides taken in a city each year along with a breakdown of rides which were heavy rail, light rail, or only buses. The urbancode field in this data aligns with the urban code fields in our station points and routes data sets above, which will allow us to use this ridership data in our analysis of the geographic subway data sets.


We can now turn to using these Subways data for some initial analysis.

3.1 Summary Statistics

We can first present overall summary statistics for cities which had subways in 2010. We can create values for stations, subway lines, routes, and ridership worldwide and for different world regions.

Note: If comparing these values to the values in Table 1 of “Subways and Urban Growth: Evidence From Earth” (2018), Gonzalez-Navarro and Turner, you will notice differences in values. While this dataset includes the Middle East as a CONTINENT in the data, in the paper the countries in the Middle East are assigned to Asia (UAE, Iran, Saudi Arabia, Turkey, and Uzbekistan). Further, there are some countries assigned as Europe in this data but assigned as Asia in the paper (Armenia, Azerbaijan, Russia). Thus, the values for Asia and Africa will differ between this output and Table 1 in the paper.

subway_stations_pre2010 <- filter(subway_stations, YEAR_COMPL <= 2010)
# calculate number of stations and lines by city
city_stats_stations <- subway_stations %>% 
  as.data.frame() %>% dplyr::select(-c('geometry')) %>%
  group_by(CONTINENT, COUNTRY1, CITY1) %>%
    stations = n_distinct(STATIONID),
    lines = n_distinct(LINE_NAME))

# calculate mean and total stations and lines by continent
region_stats_stations <- city_stats_stations %>%
  group_by(CONTINENT) %>%
    total_cities = n_distinct(CITY1),
    total_stations = sum(stations),
    mean_stations = mean(stations),
    total_lines = sum(lines),
    mean_lines = mean(lines)

total_stats_stations <- region_stats_stations %>%
    total_cities = sum(total_cities),
    total_stations = sum(total_stations),
    mean_stations = mean(mean_stations),
    total_lines = sum(total_lines),
    mean_lines = mean(mean_lines)

# calculate length of routes by city
city_stats_routes <- subway_route2010_add %>%
  group_by(CONTINENT, COUNTRY1, CITY1) %>%
    route_length = st_length(geometry)) %>%
  as.data.frame() %>% dplyr::select(-c('geometry'))

# calculate mean and total route length by continent
region_stats_routes <- city_stats_routes %>%
  group_by(CONTINENT) %>%
    total_route_length = sum(route_length),
    mean_route_length = mean(route_length)

total_stats_routes <- region_stats_routes %>%
    total_route_length = sum(total_route_length),
    mean_route_length = mean(mean_route_length)

# join statistics for regions and cities
full_region_stats <- merge(region_stats_stations, region_stats_routes, by='CONTINENT')
full_total_stats <- cbind(total_stats_stations, total_stats_routes)
full_total_stats$CONTINENT = 'Total'
full_stats <- rbind(full_region_stats, full_total_stats)
full_city_stats <- merge(city_stats_stations, city_stats_routes, by=c('CONTINENT','COUNTRY1','CITY1'))
CONTINENT total_cities total_stations mean_stations total_lines mean_lines total_route_length mean_route_length
africa 1 51 51.00000 2 2.000000 59805.87 [m] 59805.87 [m]
asia 34 2466 72.52941 147 4.323529 3349023.95 [m] 101485.57 [m]
europe 51 3116 61.09804 204 4.000000 3304650.68 [m] 64797.07 [m]
middle east 8 177 22.12500 15 1.875000 223246.39 [m] 27905.80 [m]
north america 30 1598 53.26667 92 3.066667 1934802.07 [m] 64493.40 [m]
south america 14 478 34.14286 36 2.571429 523089.15 [m] 37363.51 [m]
Total 138 7886 49.02700 496 2.972771 9394618.10 [m] 59308.54 [m]
kable(filter(full_city_stats, COUNTRY1 == 'greece'))
CONTINENT COUNTRY1 CITY1 stations lines route_length
europe greece athens 52 3 51629.7 [m]

Ridership statistics are also available for use from this dataset. To better focus our analysis, we can include only ridership from a single year, and we can use 2010 for our example:

ridership_2010 <- filter(ridership, year==2010)

subway_stats_2010 <- ridership_2010 %>%
  filter(!is.na(ridershipheavyrail)) %>%
    mean_ridership = mean(ridershipheavyrail),
    sd_ridership = sd(ridershipheavyrail),
    countries = n_distinct(country),
    cities = n_distinct(urbanname)
  ) %>% mutate(transit = 'Subway')
bus_stats_2010 <- ridership_2010 %>%
  filter(!is.na(ridershipbuses)) %>%
    mean_ridership = mean(ridershipbuses),
    sd_ridership = sd(ridershipbuses),
    countries = n_distinct(country),
    cities = n_distinct(urbanname)
  ) %>% mutate(transit = 'Bus')

ridership_stats_2010 <- rbind(subway_stats_2010, bus_stats_2010) %>% select(transit,everything())
## # A tibble: 2 x 5
##   transit mean_ridership sd_ridership countries cities
##   <chr>            <dbl>        <dbl>     <int>  <int>
## 1 Subway      376836173.   639732567.        34     77
## 2 Bus         241605938.   343003474.        17     40

3.2 Charts

We can also consider charts to display these summary statistics graphically.

ggplot(data=full_region_stats) +
  geom_bar(aes(x=CONTINENT,y=total_stations,fill=CONTINENT), stat='identity') +
  geom_bar(aes(x=CONTINENT,y=mean_stations),stat='identity') +
  scale_fill_brewer(palette = 'Dark2', guide=FALSE) +
  labs(color = 'Continent',
       title = 'Total and Mean Stations by Continent') +
  xlab("Continent") +

We can visualize the creation of subways over time as well:

stations_yearly <- subway_stations %>%
  as.data.frame() %>% dplyr::select(-c('geometry')) %>%
  group_by(YEAR_COMPL) %>%
    stations = n_distinct(STATIONID))

stations_yearly$total_stations = cumsum(stations_yearly$stations)

plot(x=stations_yearly$YEAR_COMPL, y=stations_yearly$total_stations,type='l',xlab='Year',ylab='Operational Stations')