1 INTRODUCTION

1.1 Introduction to Remote Sensing

This section is adopted from module 1, section 1 of Open Nighttime Lights. Please refer to it for more details and a video introduction of remote sensing.

Remote sensing is the science of identifying, observing, collecting, and measuring objects without coming into direct contact with them. This can be accomplished through many devices that carry sensors and capture the characteristics of Earth remotely. Sensors on board satellites also record the electromagnetic energy that is reflected or emitted from objects on Earth. There are two types of sensors:

  • Passive sensors record the natural energy that is (naturally) reflected or emitted from the Earth’s surface (e.g. sunlight, moonlight, city lights).
  • Active sensors provide their own energy source for illumination (e.g. RADAR, LIDAR).

1.2 Introduction to Forest Cover Data

In this exercise, we will primarily work with the Vegetation Continuous Fields (VCF) data provided by the Land Processes Distributed Active Archive Center (LP DAAC), a component of NASA’s Earth Observing System Data and Information System (EOSDIS). The MOD44B Version 6 VCF is a yearly representation of surface vegetation from 2000 to 2020 at 250 m resolution. Each pixel stores a percentage of three ground cover components: percent tree cover, percent non-tree cover, and percent bare.

The ground cover percentages are estimates from a machine learning model based on the combination of the Moderate Resolution Imaging Spectroradiometer (MODIS) data and other high resolution data from NASA and Google Earth. The machine learning model incorporates the visible bandwidth as well as other bandwidth such as brightness temperature (from MODIS bands 20, 31, 32).

The VCF data utilize thermal signatures and other correlates to distinguish forest and non-forest plantation, which is an improvement compared to the Normalized Differenced Vegetation Index (NDVI). For this use case, VCF also improves on the Global Forest Cover (GFC) data set, another source used to study deforestation, which only provides binary data points. GFC records baseline forest cover in the year 2000 and includes a binary indicator for the year of deforestation for each 30m × 30m pixel. If over 90% of the forest cover was lost in a pixel by a given year, then the pixel is marked as deforested, while a pixel is marked as reforested if the forest cover went from 0 in 2000 to a positive value. The VCF records continuous changes in percent of ground cover components, which provides more details than the GFC data.

2 INTRODUCTION TO GEOSPATIAL DATA AND TOOLS

2.1 Data Structure

In geospatial data analysis, data can be classified into two categories: raster and vector data. A graphic comparison between raster and vector data can be found in this page.

  • Raster data: Data stored in a raster format is arranged in a regular grid of cells, without storing the coordinates of each point (namely, a cell, or a pixel). The coordinates of the corner points and the spacing of the grid can be used to calculate (rather than to store) the coordinates of each location in the grid. Any given pixel in the grid stores one or more values (in one or more bands).
  • Vector data: Data in a vector format is stored such that the X and Y coordinates are stored for each point. Data can be represented, for example, as points, lines and polygons. A point has only one coordinate (X and Y), a line has two coordinates (at the start and end of the line) and a polygon is essentially a line that closes on itself to enclose a region. Polygons are usually used to represent the area and perimeter of continuous geographic features. Vector data stores features in their original resolution, without aggregation.

In this tutorial, we will use both vector data and raster data. Geospatial data in vector format are often stored in a shapefile format. Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines, or all polygons). You will not find a mixture of point, line, and polygon objects in a single shapefile. Raster data, on the other hand, is stored in Tagged Image File Format (TIFF or TIF). A GeoTIFF is a TIFF file that follows a specific standard for structuring meta-data. The meta-data stored in a TIFF is called a tif tag and GeoTIFFs often contain tags including spatial extent, coordinate reference system, resolution, and number of layers. We will see examples in section 2.2.

More information and examples can be found in sections 3 & 4 of the Earth Analytics Course.

2.2 Tools

2.2.1 Installation of R

To get started with R, we provide instructions on how to download and install R on your computer. R is an open source software, which means users like you can also inspect, modify, and improve its source code.

The Comprehensive R Archive Network (CRAN) provides links to install R under different operating systems. RStudio page provides a brief guide for troubleshooting.

2.2.2 RStudio

RStudio is an integrated development environment for R. R provides the engine for running code, while RStudio is a user friendly control panel to perform various tasks. RStudio facilitates R code writing and debugging and provides tools for workspace management. RStudio can be downloaded from the RStudio IDE page.

There are numerous posts, tutorials, and courses on the internet. Once you have installed R and RStudio, pick any of the following resources to get familiar with R:

2.2.3 Setting up Environment

In order to perform data manipulation, we need to attach packages. The first step is downloading R packages from CRAN. For this exercise, we are going to use the package luna to download data from MODIS and the packages terra, tidyverse, raster, and sf for data manipulation. To download these packages, in the R or RStudio console, type the following code (you will need the remotes package in order to download luna):

install.packages(c("terra", "remotes", "tidyverse", "raster", "sf"))
remotes::install_github("rspatial/luna")

We follow this tutorial to get MODIS data with luna. For details of the terra package, please refer to the package manuscript and this tutorial. If you are not familiar with the tidyverse workflow, please refer to the R for Data Science book we suggested in the previous section.

Note: If you receive the message ‘These packages have more recent versions available. It is recommended to update all of them. Which would you like to update?’, it is best to update all packages. If some of these updates fail (potentially due to a permission issue: ‘cannot remove prior installation of package ’), you may need to manually update them below.

We now attach these packages.

library(terra)
library(luna)
library(tidyverse)
library(raster)
library(sf)

Troubleshooting Note: If you receive the message “package or namespace load failed”, it may be because of incompatible versions for dependent packages. If this is the case, you will see a message like: “namespace ‘dplyr’ 1.0.2 is already loaded, but >= 1.0.4 is required”. To proceed, you will need to manually update the package identified, which can be done using install.packages() (for example: install.packages("dplyr")). Once the update completes, try to re-attach the package which failed (for example: library(tidyverse)). Continue repeating this process if you see further errors until you are able to successfully attach all packages.

2.2.4 Accessing VCF in R

Once the required packages have been attached, we can access VCF in R. We prefer using R for its ability to download large numbers of files and enable regular, automated updates. For other tools to access MODIS data, see this NASA website.

We can first use luna to check the list of data products available from MODIS. Since luna can also access data from the LANDSAT and SENTINEL platforms, we add "^MOD|^MYD|^MCD" to narrow our scope to MODIS data. The printed results below list six products from MODIS.

MODIS.product = getProducts("^MOD|^MYD|^MCD")
head(MODIS.product)
##        provider             concept_id short_name version
## 806  LPDAAC_ECS C1000000120-LPDAAC_ECS     MOD44B     051
## 1433 LPDAAC_ECS C1000000400-LPDAAC_ECS   MCD43D10     006
## 1447 LPDAAC_ECS C1000000401-LPDAAC_ECS   MCD43D33     006
## 1452 LPDAAC_ECS C1000000402-LPDAAC_ECS   MCD43D45     006
## 1454 LPDAAC_ECS C1000000403-LPDAAC_ECS   MCD43D26     006
## 1456 LPDAAC_ECS C1000000404-LPDAAC_ECS   MCD43D49     006

The product name for VCF is MOD44B. We can use the function productInfo to launch the information page of VCF.

productInfo("MOD44B")

We can query MODIS and only download a subset of the data. We need to specify the start and end dates and our area of interest (AOI). The date format is “yyyy-mm-dd”. Suppose here we want to subset data from 2010 to 2012.

start.date = "2010-01-01"
end.date   = "2012-12-31"

In order to subset your area of interest, you need to provide a “map” to getModis(). This can be obtained from online databases such as the global administrative area database (GADM). You can download map data directly from GADM, as in this post, or you can use R to obtain GADM map data. We will use R below, which requires first installing the package geodata.

remotes::install_github("rspatial/geodata")

Geographic levels in GADM are defined as:

  • level 0: National
  • level 1: State/province/equivalent
  • level 2: County/district/equivalent
  • level 3/4: Smaller administrative levels

For our example, we are interested in India at the district level. We can download the map of India and its level 2 administrative areas with the following code:

india = geodata::gadm("India", level=2, path="./data")

The boundary data is downloaded to the path that you specified in the path argument. The downloaded data through gadm() will be in the PackedSpatVector class. If you want to convert it to another class (for example, the sf class, which is easier to work with in R), you can first read it using readRDS(), then convert to a SpatVector via vect() from the terra package, and finally convert it to a sf object.

india = readRDS("./data/gadm36_IND_2_pk.rds") %>% vect() %>% st_as_sf(india)

The map we downloaded is at the district level (level 2). Assume our AOI is the state of Odisha. Each row of the data represents a county in Odisha, and the geospatial information for each county is stored in the last column: geometry. We can filter to obtain the boundaries for our AOI, which will return aoi in vector format, stored as a data frame in R.

aoi = india %>% filter(NAME_1 == "Odisha")
head(aoi)
## Simple feature collection with 6 features and 13 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: 82.63184 ymin: 20.14274 xmax: 87.48266 ymax: 21.97341
## CRS:           BOUNDCRS[
##     SOURCECRS[
##         GEOGCRS["unknown",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]],
##                 ID["EPSG",6326]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433],
##                 ID["EPSG",8901]],
##             CS[ellipsoidal,2],
##                 AXIS["longitude",east,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]],
##                 AXIS["latitude",north,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]]]],
##     TARGETCRS[
##         GEOGCRS["WGS 84",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             CS[ellipsoidal,2],
##                 AXIS["latitude",north,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##                 AXIS["longitude",east,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##             ID["EPSG",4326]]],
##     ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
##         METHOD["Geocentric translations (geog2D domain)",
##             ID["EPSG",9603]],
##         PARAMETER["X-axis translation",0,
##             ID["EPSG",8605]],
##         PARAMETER["Y-axis translation",0,
##             ID["EPSG",8606]],
##         PARAMETER["Z-axis translation",0,
##             ID["EPSG",8607]]]]
##   GID_0 NAME_0    GID_1 NAME_1 NL_NAME_1      GID_2    NAME_2
## 1   IND  India IND.26_1 Odisha        NA IND.26.1_1    Anugul
## 2   IND  India IND.26_1 Odisha        NA IND.26.2_1  Balangir
## 3   IND  India IND.26_1 Odisha        NA IND.26.3_1 Baleshwar
## 4   IND  India IND.26_1 Odisha        NA IND.26.4_1   Bargarh
## 5   IND  India IND.26_1 Odisha        NA IND.26.5_1     Bauda
## 6   IND  India IND.26_1 Odisha        NA IND.26.6_1   Bhadrak
##            VARNAME_2 NL_NAME_2   TYPE_2 ENGTYPE_2 CC_2   HASC_2
## 1                 NA        NA District  District   NA IN.OR.AN
## 2           Balangir        NA District  District   NA IN.OR.BL
## 3 Balasore, Baleswar        NA District  District   NA IN.OR.BW
## 4                 NA        NA District  District   NA IN.OR.BR
## 5                 NA        NA District  District   NA IN.OR.BD
## 6                 NA        NA District  District   NA IN.OR.BH
##                         geometry
## 1 POLYGON ((84.74501 20.60569...
## 2 POLYGON ((82.81747 20.16052...
## 3 MULTIPOLYGON (((87.12389 21...
## 4 POLYGON ((83.29642 20.84884...
## 5 POLYGON ((84.78773 20.57517...
## 6 MULTIPOLYGON (((87.01651 20...
ggplot(data = aoi) +
  geom_sf()

Now that we have our AOI as well as time frame, we can filter the MODIS VCF data on these values and see what is available.

vcf.files = getModis("MOD44B", start.date, end.date, aoi, download = F)
head(vcf.files)
## [1] "MOD44B.A2009065.h26v07.006.2017081034835.hdf"
## [2] "MOD44B.A2009065.h25v07.006.2017081034600.hdf"
## [3] "MOD44B.A2009065.h26v06.006.2017081034818.hdf"
## [4] "MOD44B.A2009065.h25v06.006.2017081034537.hdf"
## [5] "MOD44B.A2010065.h26v07.006.2017081045754.hdf"
## [6] "MOD44B.A2010065.h25v07.006.2017081045519.hdf"

The products we are going to download are tiled products. For details of tiled products, the tilling system, and the naming convention, please refer to the MODIS overview page here. In essence, we will be downloading grids of maps that cover our AOI.

To actually download these files from the NASA server, you will need a username and password. Please register on NASA Earth Data if you haven’t done so.

The following code will download the files. Replace the path value with the location on your computer where you would like to store these files. Replace the username and password values with your NASA Earth Data credentials from above. (Example: getModis("MOD44B", start.date, end.date, aoi, download = T, path = "./data/VCFexample", username = "myusername", password = "mypwd")).

getModis("MOD44B", start.date, end.date, aoi, download = T, path = YourPathHere,  username = YourNASAUserName, password = YourPassWord)

The data format from MODIS is HDF and may include sub-datasets. We can use terra to read these files and create raster files. For example,

hdf.example = rast("./data/VCFexample/MOD44B.A2009065.h25v06.006.2017081034537.hdf")
hdf.example
## class       : SpatRaster 
## dimensions  : 4800, 4800, 7  (nrow, ncol, nlyr)
## resolution  : 231.6564, 231.6564  (x, y)
## extent      : 7783654, 8895604, 2223901, 3335852  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs 
## sources     : MOD44B.A2009065.h25v06.006.2017081034537.hdf:MOD44B_250m_GRID:Percent_Tree_Cover  
##               MOD44B.A2009065.h25v06.006.2017081034537.hdf:MOD44B_250m_GRID:Percent_NonTree_Vegetation  
##               MOD44B.A2009065.h25v06.006.2017081034537.hdf:MOD44B_250m_GRID:Percent_NonVegetated  
##               ... and 4 more source(s)
## names       : MOD44~Cover, MOD44~ation, MOD44~tated, MOD44~ality, MOD44~er_SD, MOD44~ed_SD, ...

We can find basic information such as the coordinate reference system, number of cells, and resolution from the above output. There are 7 layers in each of the VCF tiled files. We are interested in the percent tree coverage layer.

names(hdf.example)
## [1] "MOD44B_250m_GRID:Percent_Tree_Cover"        
## [2] "MOD44B_250m_GRID:Percent_NonTree_Vegetation"
## [3] "MOD44B_250m_GRID:Percent_NonVegetated"      
## [4] "MOD44B_250m_GRID:Quality"                   
## [5] "MOD44B_250m_GRID:Percent_Tree_Cover_SD"     
## [6] "MOD44B_250m_GRID:Percent_NonVegetated_SD"   
## [7] "MOD44B_250m_GRID:Cloud"

A quick plot of the data can be done with the plotRBG() function. We will dive into data processing and basic operations in the next section.

plotRGB(hdf.example, stretch = "lin")