a machine learning approach to geomorphometry -extreme

1
Morphometric characteristics Formula Units Reference 1. Drainage network Stream order(u) Hierarchical rank dimensionless Strahler (1964) Number of stream orders (Lu) Nu = N1 + N2 + ∙∙∙ + Nn dimensionless Horton (1945) Length of stream orders (Nu) Lu = L1 + L2 + ∙∙∙ + Ln km Horton (1945) Bifurcation Ratio (Rb) Rb=Nu/Nu+1 dimensionless Schumm (1956) Mean Bifurcation ratio (Rb m ) = + +. . . + dimensionless Strahler (1957) 2. Basin Geometry Total basin area (A) Plan area enclosed by basin boundary km 2 Total basin surface area (As) Surface area enclosed by basin boundary km 2 Basin perimeter (P) Length of the drainage basin boundary km Schumm (1956) Basin length (Lb) Distance from outlet to the farthest point on basin boundary km Schumm (1956) Main channel length (Lc) Length of longest network from outlet to upstream km Ayad (2015) Fitness ratio (Rf) = / dimensionless Melton (1957) Form factor (Ff) = / 2 dimensionless Horton (1932) Relative perimeter (Pr) Pr = / dimensionless Schumm (1956) Length area relation (Lar) = .4 0.6 Km 1.2 Hack (1957) Rotundity coefficient (R) = 2 /4 dimensionless Strahler (1964) Mean basin width (W) = / km Horton (1932) Compactness coefficient (C) = 0.8 ∗ / dimensionless Horton (1945) Circularity ratio (Rc) =4 ∗∗/ 2 dimensionless Strahler (1964); Miller (1953) Elongation ratio (Re) = .9 ∗ / dimensionless Schumm (1956) 3. Drainage Texture Drainage texture (Dt) = / No./km Horton (1945) Drainage density (Dd) = / km/km 2 Horton (1945) Stream frequency (Fs) = / No./km 2 Horton (1945) Constant of channel maintenance (Cm) = / = / km 2 /km Strahler (1964); Schumm (1956) Infiltration number (In) = No./km 3 Faniran (1968); Pareta and Pareta (2011) Drainage Intensity (Di) = / No/km Faniran (1968); Pareta and Pareta (2011) Average Length of Overland Flow (Lo) = 1/(2*Dd) km Horton (1945) 4. Basin Relief Height of Basin outlet The outlet height from DEM m Maximum Height of basin The maximum basin height from DEM m Basin Relief (R) = h, where H is maximum elevation and h is minimum elevation of a basin m Schumm (1956) Relief Ratio (Rr) = / dimensionless Schumm (1956) Relative Relief Ratio (Rrr) = ∗ 00/ dimensionless Melton (1957) Ruggedness Number (Rn) = dimensionless Strahler (1958) Terrain Undulation Index (T) = / dimensionless Ayad (2015) A machine learning approach to geomorphometry-extreme flood links in the Lower Colorado River Basin Lin Ji, Victor R. Baker, Hoshin V. Gupta, P.A. Ty Ferré, and Tao Liu Department of Hydrology and Atmospheric Sciences, the University of Arizona INTRODUCTION Extreme flood hazards are common in the Lower Colorado River Basin (LCRB) due to the complex terrain and entrenched river channels. Evaluating basin morphometry helps understand the physical behavior of watersheds with respect to extreme floods events. However, extracting basin morphometric characteristics is computationally expensive and time consuming. Conventional approaches lack effective tools that link morphometric indices to extreme floods, and this poses a great challenge for extreme flood prediction. In this study, we extracted 41 basin morphometric parameters for 372 watersheds in the LCRB from a 10 m DEM using ArcGIS with python script. We then employed the Random Forest (RF) regression with the GridSerachCV algorithm and Out-of-Bag (OOB) error estimation to link these morphometric features to the extreme flood- records, maximum annual peak discharge (MAP) and the peak discharge per unit area (UP). The model can also be used to understand the relative importance of geomorphometric variables in predicting the extreme floods. DATA There are 695 USGS stream gages located in the LCRB. Excluding redundant sites that stream gages on the same stream with short distance, heavily urbanized areas, sites influenced by the upstream reservoirs or dams, the big error of calculated stream orders, the 372 gaging stations with at least 10-year- records of annual peak discharge in each gage station were collected as the research watersheds. The watersheds were delineated by ArcGIS-Hydrology Toolbox. Then, the morphometric characteristics of each basin were calculated by Morphometric Toolbox with python script according to the methods and formulas (Tab. 1) Tab. 1 Formulas and methods in calculation of morphometric parameters MODEL Tab. 2 The summary of the best results of training and evaluation for random forest model . 1. Random Forest algorithm (Breiman L., 2001) Fig. 1 Locations of 372 USGS stream gages and their contributing areas delineated from USGS NED on the Lower Colorado River Basin RESULTS Fig.2 The results of GridSearchCV and OOB error rate for validation the trained RF model for two response variables, MAP and UP. 2a and 2b are the average MSE of GridSearchCV with 10-fold change with the n_estimators in different max_features for MAP and UP, respectively. 2c and 2d are the OOB error rate change with the n_estimators in different max_features for MAP and UP, respectively. a. b. c. d. Fig.4 The feature importance of the input variables in predicting MAP (left) and UP (right), respectively. CONCLUSION The results indicate that the RF model has a better estimation to UP than MAP. The results also suggest that significant improvement in predicting the MAP is achieved with the relative perimeter, total basin area, and length area relation. Similar improvement in predicting UP is achieved using the maximum height of basin, total basin relief, and relief ratio. This initial using RF shows that data-driven machine learning can help link morphometry to measures of extreme flooding, thereby advancing our understanding of regional large flood behavior and improving flood risk analyses for the Southwestern U.S. REFERENCES [1]. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. [2]. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No. 10). New York: Springer series in statistics. [3]. Liaw, A., & Wiener, M. (2002). Classification and regression by random Forest. R news, 2(3), 18-22. [4]. Sadler, J. M., Goodall, J. L., Morsy, M. M., & Spencer, K. (2018). Modeling urban coastal flood severity from crowd-sourced flood reports using Poisson regression and Random Forest. Journal of Hydrology, 559, 43-55. 2. Hyperparameter Tuning There are two very important hyperparameters need be optimized in python script: n_estimators: the number of trees in the forest based on the observations bootstrapped samples. max_features: the number of features to consider when looking for the best split. There are some options to choose: “auto”, the max_features = n_features; “sqrt”, the max_features = sqrt(n_features), “log2”, the max_features = log 2 (n_features), where n_features is the number of features in the data. GridSearchCV and OOB error rate to evaluate the trained model performance. Fig.3 The model results of the observation and estimation for training samples and evaluation samples using Random forest regression. The left is the prediction of MAP. The right is the prediction of UP. R2 MAE RMSE std Explained variance Training Testing Training Testing Training Testing Training Testing Training Testing MAP 0.918 0.706 0.022 0.047 0.198 0.172 0.002 0.006 0.885 0.336 UP 0.902 0.754 0.024 0.037 0.202 0.154 0.002 0.004 0.871 0.512

Upload: others

Post on 21-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A machine learning approach to geomorphometry -extreme

Morphometric characteristics Formula Units Reference

1. Drainage networkStream order(u) Hierarchical rank dimensionless Strahler (1964)

Number of stream orders (Lu) Nu = N1 + N2 + ∙∙∙ + Nn dimensionless Horton (1945)

Length of stream orders (Nu) Lu = L1 + L2 + ∙∙∙ + Ln km Horton (1945)

Bifurcation Ratio (Rb) Rb=Nu/Nu+1 dimensionless Schumm (1956)Mean Bifurcation ratio

(Rbm) 𝑅𝑅𝑅𝑅𝑚𝑚 =𝑁𝑁𝑁𝑁𝑁𝑁 + 𝑁𝑁𝑁

𝑁𝑁𝑁+. . . + 𝑁𝑁𝑁𝑁 − 𝑁𝑁𝑁𝑁𝑁

𝑁𝑁 − 𝑁

dimensionless Strahler (1957)

2. Basin GeometryTotal basin area (A) Plan area enclosed by basin boundary km2

Total basin surface area (As)

Surface area enclosed by basin boundary km2

Basin perimeter (P) Length of the drainage basin boundary km Schumm (1956)Basin length (Lb) Distance from outlet to the farthest point on

basin boundarykm Schumm (1956)

Main channel length (Lc) Length of longest network from outlet to upstream

km Ayad (2015)

Fitness ratio (Rf) 𝑅𝑅𝑅𝑅 = 𝐿𝐿𝐿𝐿/𝑃𝑃 dimensionless Melton (1957)Form factor (Ff) 𝐹𝐹𝑅𝑅 = 𝐴𝐴/𝐿𝐿𝑅𝑅2 dimensionless Horton (1932)

Relative perimeter (Pr) Pr = 𝐴𝐴/𝑃𝑃 dimensionless Schumm (1956)Length area relation (Lar) 𝐿𝐿𝐿𝐿𝐿𝐿 = 𝑁.4 ∗ 𝐴𝐴0.6 Km1.2 Hack (1957)Rotundity coefficient (R) 𝑅𝑅 = 𝐿𝐿𝑅𝑅2 ∗ 𝜋𝜋/4𝐴𝐴 dimensionless Strahler (1964)

Mean basin width (W) 𝑊𝑊 = 𝐴𝐴/𝐿𝐿𝑅𝑅 km Horton (1932)Compactness coefficient

(C) 𝐶𝐶 = 0.𝑁8𝑁 ∗ 𝑃𝑃/ 𝐴𝐴 dimensionless Horton (1945)

Circularity ratio (Rc) 𝑅𝑅𝐿𝐿 = 4 ∗ 𝜋𝜋 ∗ 𝐴𝐴/𝑃𝑃2 dimensionless Strahler (1964); Miller (1953)

Elongation ratio (Re) 𝑅𝑅𝑅𝑅 = 𝑁.𝑁𝑁9 ∗ 𝐴𝐴/𝐿𝐿𝑅𝑅 dimensionless Schumm (1956)3. Drainage Texture

Drainage texture (Dt) 𝐷𝐷𝐷𝐷 = 𝑁𝑁𝑁𝑁/𝑃𝑃 No./km Horton (1945)Drainage density (Dd) 𝐷𝐷𝐷𝐷 = 𝐿𝐿𝑁𝑁/𝐴𝐴 km/km2 Horton (1945)Stream frequency (Fs) 𝐹𝐹𝐹𝐹 = 𝑁𝑁𝑁𝑁/𝐴𝐴 No./km2 Horton (1945)Constant of channel maintenance (Cm)

𝐶𝐶𝐶𝐶 = 𝑁/𝐷𝐷𝐷𝐷 = 𝐴𝐴/𝐿𝐿𝑁𝑁 km2/km Strahler (1964); Schumm (1956)

Infiltration number (In) 𝐼𝐼𝑁𝑁 = 𝐹𝐹𝐹𝐹 ∗ 𝐷𝐷𝐷𝐷 No./km3 Faniran (1968); Paretaand Pareta (2011)

Drainage Intensity (Di) 𝐷𝐷𝐷𝐷 = 𝐹𝐹𝐹𝐹/𝐷𝐷𝐷𝐷 No/km Faniran (1968); Paretaand Pareta (2011)

Average Length of Overland Flow (Lo)

𝐿𝐿𝐿𝐿 = 1/(2*Dd) km Horton (1945)

4. Basin ReliefHeight of Basin outlet The outlet height from DEM m

Maximum Height of basin The maximum basin height from DEM mBasin Relief (R) 𝑅𝑅 = 𝐻𝐻 − h, where H is maximum elevation

and h is minimum elevation of a basinm Schumm (1956)

Relief Ratio (Rr) 𝑅𝑅𝐿𝐿 = 𝑅𝑅/𝐿𝐿𝑅𝑅 dimensionless Schumm (1956)Relative Relief Ratio (Rrr) 𝑅𝑅𝐿𝐿𝐿𝐿 = 𝑅𝑅 ∗ 𝑁00/𝑃𝑃 dimensionless Melton (1957)

Ruggedness Number (Rn)

𝑅𝑅𝑁𝑁 = 𝑅𝑅 ∗ 𝐷𝐷𝐷𝐷 dimensionless Strahler (1958)

Terrain Undulation Index (T)

𝑇𝑇 = 𝐴𝐴𝐹𝐹/𝐴𝐴 dimensionless Ayad (2015)

A machine learning approach to geomorphometry-extreme flood links in the Lower Colorado River Basin

Lin Ji, Victor R. Baker, Hoshin V. Gupta, P.A. Ty Ferré, and Tao LiuDepartment of Hydrology and Atmospheric Sciences, the University of Arizona

INTRODUCTIONExtreme flood hazards are common in the Lower Colorado River

Basin (LCRB) due to the complex terrain and entrenched river channels. Evaluating basin morphometry helps understand the physical behavior of watersheds with respect to extreme floods events. However, extracting basin morphometric characteristics is computationally expensive and time consuming. Conventional approaches lack effective tools that link morphometric indices to extreme floods, and this poses a great challenge for extreme flood prediction.

In this study, we extracted 41 basin morphometric parameters for 372 watersheds in the LCRB from a 10 m DEM using ArcGIS with python script. We then employed the Random Forest (RF) regression with the GridSerachCV algorithm and Out-of-Bag (OOB) error estimation to link these morphometric features to the extreme flood-records, maximum annual peak discharge (MAP) and the peak discharge per unit area (UP). The model can also be used to understand the relative importance of geomorphometric variables in predicting the extreme floods.

DATAThere are 695 USGS stream gages located in the LCRB. Excluding

redundant sites that stream gages on the same stream with short distance, heavily urbanized areas, sites influenced by the upstream reservoirs or dams, the big error of calculated stream orders, the 372 gaging stations with at least 10-year- records of annual peak discharge in each gage station were collected as the research watersheds.

The watersheds were delineated by ArcGIS-Hydrology Toolbox. Then, the morphometric characteristics of each basin were calculated by Morphometric Toolbox with python script according to the methods and formulas (Tab. 1)

Tab. 1 Formulas and methods in calculation of morphometric parameters

MODEL

Tab. 2 The summary of the best results of training and evaluation for random forest model .

1. Random Forest algorithm (Breiman L., 2001)

Fig. 1 Locations of 372 USGS stream gages and their contributing areas delineated from USGS NED on the Lower Colorado River Basin

RESULTS

Fig.2 The results of GridSearchCV and OOB error rate for validation the trained RF model for two response variables, MAP and UP. 2a and 2b are the average MSE of GridSearchCV with 10-fold change with the n_estimators in different max_features for MAP and UP, respectively. 2c and 2d are the OOB error rate change with the n_estimators in different max_features for MAP and UP, respectively.

a. b.

c. d. Fig.4 The feature importance of the input variables in predicting MAP (left) and UP (right), respectively.

CONCLUSION

The results indicate that the RF model has a better estimation to UP than MAP. The results also suggest that significant improvement in predicting the MAP is achieved with the relative perimeter, total basin area, and length area relation. Similar improvement in predicting UP is achieved using the maximum height of basin, total basin relief, and relief ratio. This initial using RF shows that data-driven machine learning can help link morphometry to measures of extreme flooding, thereby advancing our understanding of regional large flood behavior and improving flood risk analyses for the Southwestern U.S.

REFERENCES

[1]. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.[2]. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, No. 10). New York: Springer series in statistics.[3]. Liaw, A., & Wiener, M. (2002). Classification and regression by random Forest. R news, 2(3), 18-22.[4]. Sadler, J. M., Goodall, J. L., Morsy, M. M., & Spencer, K. (2018). Modeling urban coastal flood severity from crowd-sourced flood reports using Poisson regression and Random Forest. Journal of Hydrology, 559, 43-55.

2. Hyperparameter Tuning

There are two very important hyperparameters need be optimized in python script:• n_estimators: the number of trees in the forest based on the observations

bootstrapped samples.• max_features: the number of features to consider when looking for the

best split. There are some options to choose: “auto”, the max_features =n_features; “sqrt”, the max_features = sqrt(n_features), “log2”, themax_features = log2 (n_features), where n_features is the number offeatures in the data.

GridSearchCV and OOB error rate to evaluate the trained model performance.

Fig.3 The model results of the observation and estimation for training samples and evaluation samples using Random forest regression. The left is the prediction of MAP. The right is the prediction of UP.

R2 MAE RMSE std Explained variance

Training Testing Training Testing Training Testing Training Testing Training Testing

MAP 0.918 0.706 0.022 0.047 0.198 0.172 0.002 0.006 0.885 0.336UP 0.902 0.754 0.024 0.037 0.202 0.154 0.002 0.004 0.871 0.512