This page was generated from classification.ipynb.
Interactive online version:
Catchment Classification#
[1]:
import itertools
from collections import defaultdict
from pathlib import Path
import cytoolz.curried as tlz
import numpy as np
import pandas as pd
import hydrosignatures as hsg
import pygeohydro as hgh
import pynhd
import pynldas2 as nldas
from pynhd import NLDI, StreamCat
In this tutorial, we use HyRiver libraries to classify catchments based on their hydrological characteristics from drought propagation mechanisms prospective. We follow Apurv et al., (2017). They classify catchments into three types as shown in the figure below.
In this figure \(P\) is precipitation, \(E_p\) is potential evapotranspiration.
We follow a similar approach, however instead of using physical modeling, we employ a machine learning algorithm. Moreover, we carry out our analysis for the 531 USGS stations that was used in Newman et al. (2017) (a subset of CAMELS). However, instead of using the limited catchment attributes available in the CAMELS dataset, we use StreamCat that includes over 600 attributes.
We use pygeohydro.get_camels
function to get the entire CAMELS dataset and then use its Newman_2017
variable to get our stations of interest.
[2]:
basins, qobs = hgh.get_camels()
Note that both basins
and qobs
include all catchment attributes of CAMELS, the only difference is that basins
has geometries of the stations’ drainage area, and qobs
has streamflow observations. For the purpose of this tutorial, we only need the location of stations and their IDs.
[3]:
stations = basins.loc[basins["Newman_2017"], ["gauge_lon", "gauge_lat"]]
For the purpose of this tutorial, we compute three climate indices: aridity, phase, and seasonality. Thus, we need to retrieve precipitation and potential evapotranspiration time series. We use PyNLDAS2. Note that NLDAS2 has a spatial resolution of 0.125 degree, so drainage area of most of our stations, are probably going to be within few pixels. Therefore, it’s not really necessary to get gridded data for each station watershed and a single pixel, would be enough. Although, using 20 or 30 years of climate data works better, we use only 10 years (2000-2010) so the notebook runs faster. Feel free to increase the time period.
Since we are getting 10 years of hourly climate data for 531 stations, it’s going to be memory intensive, and of course, time-consuming. Considering that we only need the aridity, phase, and seasonality index values, and not the actual climate data, it’s more efficient to retrieve stations’ climate data in chunks, compute the indices, and store them.
We compute the aridity index using Budyko, 1974, the phase index using a simple Spearman correlation coefficient between mean monthly summaries of \(P\) and \(E_p\), and seasonality index using Walsh and Lawler, 1981 method.
If mean monthly correlation is positive, we assume in-phase, otherwise out-of-phase. Although, technically, the threshold for in-phase is 0.3, for out-of-phase -0.3, and in-between no-phase, we use 0.0 as the threshold for simplicity and, more importantly, matching the table from Apurv et al., (2017). They used a bit more complex method that detects in/out of phase and there’s no no-phase.
For precipitation seasonality index, the threshold is 0.6, based on the following table from Walsh and Lawler, 1981.
For aridity index, we want to use UNEP’s classification table which has a threshold of 0.65 for arid/humid classification. However, hydrosignatures.compute_ai
computes the inverse of UNEP’s definition which is common in the hydrology literature. So, we need to invert the index before using it.
Note that the first time you run the following cell, it’s going to take a while and might fail due to service issues, since we are requesting for large amount of data. However, if failed you can rerun, and it should work. After the first time running the cell successfully, if you run it again, since the NLDAS service calls have been cached, the cell should run in just a few minutes.
[4]:
sig_paath = Path("cache/signatures.parquet")
if sig_paath.exists():
sig = pd.read_parquet(sig_paath)
else:
sig = defaultdict(dict)
station_its = zip(stations.index, zip(stations["gauge_lon"], stations["gauge_lat"]))
for coord_ids in tlz.partition_all(50, station_its):
sid, coords = zip(*coord_ids)
clm = nldas.get_bycoords(coords, "2000-01-01", "2010-01-01", sid, variables=["prcp", "pet"])
mean_monthly = hsg.compute_mean_monthly(clm, index_abbr=True)
corr = mean_monthly.corr("spearman").xs("pet", level=1, axis=1).xs("prcp", level=1, axis=0)
sig["phase"].update(dict(zip(corr.index, np.diag(corr) > 0))) # if in-phase
ai = hsg.compute_ai(clm.xs("pet", level=1, axis=1), clm.xs("prcp", level=1, axis=1))
sig["aridity"].update((1 / ai > 0.65).to_dict()) # if humid
si = hsg.compute_si_walsh(clm.xs("prcp", level=1, axis=1))
sig["seasonality"].update((si > 0.6).to_dict()) # if seasonal
sig = pd.DataFrame(sig)
sig.index.name = "station_id"
sig.to_parquet("cache/signatures.parquet")
sig.head()
[4]:
phase | aridity | seasonality | type | |
---|---|---|---|---|
station_id | ||||
01022500 | False | True | False | type1 |
01031500 | True | True | False | type3 |
01047000 | True | True | False | type3 |
01052500 | True | True | False | type3 |
01054200 | True | True | False | type3 |
Now, we need to determine the type of each station based on the table.
[5]:
def get_class(inphase: bool, humid: bool, seasonal: bool) -> str:
if not inphase:
return "type1"
if inphase and not humid:
return "type2"
if inphase and humid and not seasonal:
return "type3"
return "type1"
sig_itr = sig[["phase", "aridity", "seasonality"]].itertuples(index=False, name=None)
sig["type"] = list(itertools.starmap(get_class, sig_itr))
sig.head()
[5]:
phase | aridity | seasonality | type | |
---|---|---|---|---|
station_id | ||||
01022500 | False | True | False | type1 |
01031500 | True | True | False | type3 |
01047000 | True | True | False | type3 |
01052500 | True | True | False | type3 |
01054200 | True | True | False | type3 |
Next, we need to get watersheds’ characteristics. For this purpose, first we find their associated NHDPlus ComID using pynhd.NLDI
.
[6]:
nldi = NLDI()
comids = nldi.getfeature_byid("nwissite", [f"USGS-{i}" for i in stations.index])
StreamCat contains around 600 characteristics from a variety of sources. Since, we are interested in physical characteristics of these watersheds, first we need to filter out those characteristics that are not related. To do so, we use StreamCat.metrics_df
. Note that, in PyNHD, the main function for querying StreamCat is streamcat
. There is also a class called StreamCat
that contains useful information about the dataset such as available options for characteristic (called metrics in
StreamCat), their descriptions, and so on. So, it’s always a good idea to start with instantiating this class and figure out the available options.
[7]:
sc = StreamCat()
sc.metrics_df.head()
[7]:
AOI | DATE_DOWNLOADED | FINAL_TABLE | INDICATOR_CATEGORY | METADATA | METRIC_DESCRIPTION | METRIC_NAME | METRIC_UNITS | SOURCE_NAME | SOURCE_URL | UUID | YEAR | SLOPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Cat, Ws | 02-13-2016 | Dams | Anthropogenic | https://edg.epa.gov/metadata/rest/document?id=... | Total possible volume of all reservoirs (NID_S... | DamNIDStor | cubic meters/square kilometer | USGS NAWQA | ftp://ftpext.usgs.gov/pub/er/va/reston/NAWQA_E... | 46EB14C8-FC84-4BA8-9C7F-CC4938FD1FF3 | NaN | NaN |
1 | Cat, Ws | 02-13-2016 | Dams | Anthropogenic | https://edg.epa.gov/metadata/rest/document?id=... | Normal (most common) volume of all reservoirs ... | DamNrmStor | cubic meters/square kilometer | USGS NAWQA | ftp://ftpext.usgs.gov/pub/er/va/reston/NAWQA_E... | 46EB14C8-FC84-4BA8-9C7F-CC4938FD1FF3 | NaN | NaN |
2 | Cat, Ws | NaN | Elevation | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Mean AOI elevation in meters. | Elev | Meters | NHDPlusV2 | http://www.horizon-systems.com/NHDPlus/NHDPlus... | 1730F0BC-7019-4821-8B31-4A6E7B3DA625 | NaN | NaN |
3 | Cat, Ws | 04-16-2015 | GeoChemPhys1 | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Mean % of lithological ferric oxide (Fe2O3) co... | Fe2O3 | Percent | USGS Sciencebase | https://www.sciencebase.gov/catalog/folder/534... | 46EB14C8-FC84-4BA8-9C7F-CC4938FD1FF3 | NaN | NaN |
4 | Cat, Ws | 03-17-2016 | AgriculturalNitrogen | Anthropogenic | https://edg.epa.gov/metadata/rest/document?id=... | Mean rate of synthetic nitrogen fertilizer app... | Fert | mean rate kilogram Nitrogen/hectare/year | EnviroAtlas | https://enviroatlas.epa.gov/enviroatlas/DataFa... | 46EB14C8-FC84-4BA8-9C7F-CC4938FD1FF3 | NaN | NaN |
We narrow down the list of metrics, by first looking at FINAL_TABLE
column of the metrics’ dataframe.
By going over these options, we select those that are related to physical characteristics of the catchments.
[8]:
sc.metrics_df[sc.metrics_df.METRIC_DESCRIPTION.str.contains("index", case=False)]
[8]:
AOI | DATE_DOWNLOADED | FINAL_TABLE | INDICATOR_CATEGORY | METADATA | METRIC_DESCRIPTION | METRIC_NAME | METRIC_UNITS | SOURCE_NAME | SOURCE_URL | UUID | YEAR | SLOPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | Cat, Ws | 04-01-2016 | Kffact | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Mean soil erodibility (Kf) factor (unitless) o... | AgKffact | mean surface soil erodibility factors | Penn State Soil Info | http://www.soilinfo.psu.edu/index.cgi?soil_dat... | EBB945ED-0151-4DBF-8FE0-806165A58973 | NaN | NaN |
9 | Cat, Ws | 03-10-2015 | BFI | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Baseflow is the component of streamflow that c... | BFI | Percent | USGS Water Mission Area | http://water.usgs.gov/GIS/metadata/usgswrd/XML... | 66C0ED41-2707-4732-A906-E9D89E8F5A6B | NaN | NaN |
35 | Other | 09-04-2017 | NRSA_PredictedBioCondition | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Predicted probability that a stream segment is... | prG_BMMI[Year] | Percent | Hill et al. 2017 | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... | E4DB0FCC-4CB4-43CD-BE0B-C82858D85D36 | 2008, 2009 | NaN |
46 | Cat, Ws | 06-14-2016 | Kffact | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Mean soil erodibility (Kf) factor (unitless) o... | Kffact | index value | Penn State Soil Info | http://www.soilinfo.psu.edu/index.cgi?soil_dat... | AA3ADCFD-0758-47FC-A069-31E95A116A3D | NaN | NaN |
116 | Cat, Ws | 10-01-2018 | WetIndex | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Mean Composite Topographic Index (CTI) [Wetnes... | WetIndex | Avg. rate (mm) | EnviroAtlas | http://enviroatlas.epa.gov/enviroatlas/datafac... | E4DB0FCC-4CB4-43CD-BE0B-C82858D85D36 | NaN | NaN |
[9]:
themes = (
"Dams",
"Elevation",
"Kffact",
"BFI",
"STATSGO_Set1",
"NLCD",
"STATSGO_Set2",
"PRISM_0809",
"Runoff",
"Precip_Minus_EVT",
"WetIndex",
)
metrics = sc.metrics_df[sc.metrics_df.FINAL_TABLE.isin(themes)]
metrics.head()
[9]:
AOI | DATE_DOWNLOADED | FINAL_TABLE | INDICATOR_CATEGORY | METADATA | METRIC_DESCRIPTION | METRIC_NAME | METRIC_UNITS | SOURCE_NAME | SOURCE_URL | UUID | YEAR | SLOPE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Cat, Ws | 02-13-2016 | Dams | Anthropogenic | https://edg.epa.gov/metadata/rest/document?id=... | Total possible volume of all reservoirs (NID_S... | DamNIDStor | cubic meters/square kilometer | USGS NAWQA | ftp://ftpext.usgs.gov/pub/er/va/reston/NAWQA_E... | 46EB14C8-FC84-4BA8-9C7F-CC4938FD1FF3 | NaN | NaN |
1 | Cat, Ws | 02-13-2016 | Dams | Anthropogenic | https://edg.epa.gov/metadata/rest/document?id=... | Normal (most common) volume of all reservoirs ... | DamNrmStor | cubic meters/square kilometer | USGS NAWQA | ftp://ftpext.usgs.gov/pub/er/va/reston/NAWQA_E... | 46EB14C8-FC84-4BA8-9C7F-CC4938FD1FF3 | NaN | NaN |
2 | Cat, Ws | NaN | Elevation | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Mean AOI elevation in meters. | Elev | Meters | NHDPlusV2 | http://www.horizon-systems.com/NHDPlus/NHDPlus... | 1730F0BC-7019-4821-8B31-4A6E7B3DA625 | NaN | NaN |
7 | Cat, Ws | 04-01-2016 | Kffact | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Mean soil erodibility (Kf) factor (unitless) o... | AgKffact | mean surface soil erodibility factors | Penn State Soil Info | http://www.soilinfo.psu.edu/index.cgi?soil_dat... | EBB945ED-0151-4DBF-8FE0-806165A58973 | NaN | NaN |
9 | Cat, Ws | 03-10-2015 | BFI | Natural | https://edg.epa.gov/metadata/rest/document?id=... | Baseflow is the component of streamflow that c... | BFI | Percent | USGS Water Mission Area | http://water.usgs.gov/GIS/metadata/usgswrd/XML... | 66C0ED41-2707-4732-A906-E9D89E8F5A6B | NaN | NaN |
By looking at these metrics, we realize that some metrics are annual summaries and require specifying the year. We can use StreamCat.valid_years
to check the available years. For example, let’s get the last year of the available years for each metric. Note that, we need to replace [Year]
in METRIC_NAME
column of metrics_df
.
Now, we use these metrics and ComIDs to obtain their characteristics.
[10]:
names = metrics[metrics.YEAR.isna()].METRIC_NAME.to_list()
names_yr = metrics[~metrics.YEAR.isna()].METRIC_NAME
names += [n.replace("[Year]", str(sc.valid_years[n][-1])) for n in names_yr]
Now, to facilitate cross walking between the catchment attributes and climate indices datasets, we change the index of attrs
dataframe to be the station ID, and append the catchment types to the dataframe.
[11]:
attrs = pynhd.streamcat(names, comids=comids.comid, metric_areas="watershed")
attrs = attrs.dropna(axis=1)
attrs.columns = attrs.columns.str.lower()
attrs.head()
[11]:
comid | wsareasqkm | clayws | sandws | elevws | wetindexws | runoffws | tmean09ws | precip09ws | precip_minus_evtws | ... | pcturbmd2019ws | pcthbwet2019ws | pctconif2019ws | pcturbop2019ws | pctshrb2019ws | pctcrop2019ws | pcthay2019ws | pctmxfst2019ws | pctgrs2019ws | pcturbhi2019ws | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 272589 | 488.3202 | 5.699805 | 81.573543 | 239.6469 | 966.4871 | 362.4216 | 4.454809 | 753.066947 | 23.1532 | ... | 0.03 | 1.80 | 10.52 | 4.31 | 1.16 | 0.0 | 0.05 | 13.05 | 0.84 | 0.00 |
1 | 399452 | 38.6820 | 37.971573 | 21.478252 | 405.2075 | 783.9151 | 376.5604 | 13.602834 | 1175.682583 | 26.0265 | ... | 0.87 | 0.00 | 0.02 | 5.14 | 2.85 | 0.0 | 59.54 | 0.46 | 2.70 | 0.43 |
2 | 400496 | 808.0020 | 34.487492 | 24.813596 | 346.2187 | 746.4028 | 379.1795 | 14.033619 | 1476.165119 | 32.1875 | ... | 0.28 | 0.08 | 0.73 | 4.24 | 1.89 | 0.0 | 37.90 | 4.28 | 1.82 | 0.11 |
3 | 400822 | 106.6815 | 29.019383 | 33.615368 | 407.6582 | 755.9538 | 378.7824 | 13.712378 | 1502.645026 | 32.5037 | ... | 0.41 | 0.00 | 0.31 | 4.00 | 0.92 | 0.0 | 42.16 | 3.43 | 0.90 | 0.15 |
4 | 487460 | 158.6061 | 21.920000 | 23.720000 | 469.7755 | 580.5784 | 519.3887 | 12.217670 | 1512.615391 | 45.3002 | ... | 0.21 | 0.00 | 0.04 | 3.73 | 4.20 | 0.0 | 0.38 | 7.37 | 2.13 | 0.02 |
5 rows × 30 columns
[13]:
sid_comid = comids[["identifier", "comid"]].copy()
sid_comid["identifier"] = sid_comid["identifier"].str.replace("USGS-", "")
sid_comid["comid"] = sid_comid["comid"].astype(int)
attrs = attrs.merge(sid_comid, left_on="comid", right_on="comid")
attrs = attrs.set_index("identifier")
attrs.index.name = "station_id"
attrs["type"] = sig["type"]
attrs = attrs.drop("comid", axis=1)
attrs.head()
type_mapping = {"type1": 0, "type2": 1, "type3": 2}
attrs["type"] = attrs["type"].map(type_mapping).astype(int)
attrs["type"].value_counts().plot.bar()
[13]:
<Axes: >

[14]:
conus = pynhd.streamcat(names, conus=True, metric_areas="watershed")
conus = conus.dropna(axis=1, how="all")
conus = conus.dropna(axis=0)
conus.columns = conus.columns.str.lower()
conus.head()
[14]:
comid | wsareasqkm | clayws | sandws | elevws | wetindexws | kffactws | agkffactws | runoffws | tmean09ws | ... | pcturbmd2019ws | pcthbwet2019ws | pctconif2019ws | pcturbop2019ws | pctshrb2019ws | pctcrop2019ws | pcthay2019ws | pctmxfst2019ws | pctgrs2019ws | pcturbhi2019ws | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23796953 | 37.4472 | 36.008147 | 13.272119 | 150.9796 | 731.3332 | 0.3098 | 0.1078 | 1189.0 | 10.946465 | ... | 0.30 | 0.39 | 14.17 | 2.65 | 3.60 | 12.06 | 33.03 | 23.93 | 0.93 | 0.04 |
1 | 23797225 | 6.2469 | 40.545943 | 15.330458 | 421.2429 | 681.2311 | 0.2559 | 0.0000 | 1189.0 | 10.275770 | ... | 0.00 | 0.00 | 70.47 | 4.37 | 1.05 | 0.00 | 0.04 | 17.42 | 5.36 | 0.00 |
2 | 23797735 | 7.4619 | 21.825341 | 27.091912 | 462.4695 | 570.2685 | 0.2507 | 0.0000 | 1189.0 | 9.841584 | ... | 0.02 | 0.00 | 75.07 | 6.27 | 1.30 | 0.00 | 0.00 | 10.73 | 6.43 | 0.00 |
3 | 23797721 | 0.4617 | 20.324308 | 28.433392 | 867.2295 | 568.2242 | 0.2406 | 0.0000 | 1189.0 | 9.772891 | ... | 0.00 | 0.00 | 84.99 | 9.36 | 0.00 | 0.00 | 0.00 | 0.00 | 5.65 | 0.00 |
4 | 23801020 | 23.5935 | 19.553445 | 27.926475 | 780.2219 | 646.2592 | 0.2440 | 0.0000 | 1406.0 | 7.964721 | ... | 0.00 | 0.00 | 92.06 | 0.00 | 5.87 | 0.00 | 0.00 | 0.26 | 1.51 | 0.00 |
5 rows × 32 columns