HGWR Model and How to Use It
A brief tutorial to Hierarchical and Geographically Weighted Regression model and hgwrr package in R.
Introduction to HGWR Model
What is HGWR model?
Hierarchical and Geographically Weighted Regression, shorted for HGWR, is a spatial modelling method designed for data of spatial hierarchical structures. Just as its name implies, this is a combination of Hierarchical Linear Model (HLM, also known as Multilevel Model, Raudenbush 1993) and Geographically Weighted Regression (GWR, Brunsdon, Fotheringham, and Charlton 1996). In this model, spatial effects are divided into three types: global fixed, local fixed and random. Formally, it is expressed as \[ y = G\gamma + X\beta + Z\mu + \epsilon \] with \(y\) the dependent variable, \(G\) the group level independent variables, \(\gamma\) the local fixed effects, \(X\) also the group level independent variables, \(\beta\) the global fixed effects, \(Z\) the individual level independent variables, \(\mu\) the random effects, \(\epsilon\) the individual errors.
Why HGWR model?
As we know, hierarchical structure is commonly existing in spatial data. For example, cities can be grouped by provinces or other higher-level administrative district they belong to; house prices may share some factors from the block; and students in one school have different access to education resources with those in another school. When dealing with this type of data, we usually choose HLM to address the within-group homogeneity and the between-group heterogeneity. And there are usually two types of variables: group-level variables and sample-level variables. The formal ones are used to describe the properties of groups (such as the provinces, blocks and schools); the latter ones are observations of individual samples (such as the cities, houses and students). The effect of some sample-level variables are similar in all groups, thus they are modelled with fixed coefficients (effects). For others, they are modelled individually, i.e., with random effects.
However, for group-level variables, they can only be modelled with fixed effects. For spatial data, we would encounter some problems. According to the Tobler’s first law of Geography “Everything is related to everything else, but near things are more related than distant things” (Tobler 1970). If the model is calibrated with equally weighted samples, spatial heterogeneity would be overlooked (Fotheringham, Brunsdon, and Charlton 2002). Thus, it requires us to distinguish “local fixed effects” from “global fixed effects” to discover spatial heterogeneity in group-level variables.
But why not GWR or Multiscale GWR (Fotheringham, Yang, and Kang 2017, LuBrunsdon–2017) Because when dealing with data of hierarchical structures, GWR is problematic (Hu et al. 2022). We know that GWR calibrate a model with unique coefficients on each sample by borrowing data from its neighbours. And it uses a parameter “bandwidth” to control how many neighbours are included. If samples are not hierarchically structured, everything works well. However, just imagine a situation like Figure 1. For the two samples of red color and blue color, we take the same number of their neighbours, but actually the spatial extents are not the same. In extreme cases, spatial extends of some samples could be too small to hold more than one or two location, but some are large enough. This would lead to the failure of bandwidth optimization and reduce the reliability of the optimized bandwidth.
As shown in this video, bandwidths have inequal spatial scale for two samples (represented by cubes). Both the samples represented by large red cubes and large blue cubes take 41 neighbour samples to calibrate GWR models. For the red one, neighbours on 8 nearest locations are taken. But the figure for the blue one is only 6. This situation means estimated coefficients are more smoothed for the red samples. In other words, estimations for the blue samples are much local.
To solve the problems mentioned above, we need to use HGWR model. It is able to modelling spatial hierarchical structure and spatial heterogeneity simultaneously. Examples below can show that it works well for spatial hierarchical data.
Modelling with HGWR Model
The R package hgwrr is built for calibrating HGWR model. In this section, we are going to show how to use it.
Installation
Package hgwrr is available on CRAN. Simply type the following codes to install it.
install.packages("hgwrr")Or download latest released source package and run the following command to install this package.
R CMD INSTALL hgwrr_0.2-0.tar.gzNote that RTools is required on Windows.
Usage
We are going to show the usage of hgwrr package with a simulated data.
First, we need to load this package in an R session.
library(hgwrr)Then we can calibrate an HGWR model via hgwr()
function.
hgwr(
formula, data, local.fixed, coords, bw,
alpha = 0.01, eps_iter = 1e-06, eps_gradient = 1e-06, max_iters = 1e+06,
max_retries = 10, ml_type = HGWR_ML_TYPE_D_ONLY, verbose = 0
)The first five arguments are mandatory.
formulaaccepts a formula object in R. Its format follows lme4 package. As there are two types of effects: fixed effects and random effects, we use the following format to specify both of them:dependent ~ fixed1 + fixed2 + (random1 + random2 | group)dataaccepts a DataFrame object in R. All variables specified informulaare extracted fromdata. In this stage,Spatial*DataFrameis not supported.local.fixedaccepts a list of character specifying which fixed effects are local. For example, iffixed1needs to be local fixed, then setlocal.fixedtoc("fixed1").coordsaccepts a matrix of 2 columns. Each row is the longitude and latitude of each group.bwaccepts a integer or numeric number to specify the bandwidth used in geographically weighted process. Currently it can only be adaptive bandwidth.
Other arguments are optional which is used to control the backfitting
maximum likelihood algorithm. On most occassions the default values are
fine. If the default values cause some problems and you want change some
of them, please check the documentation of function hgwr()
for more infomation.
Example: A Small Simulated Data Set
This example is used to show the usage of this package and to test whether it works. We don’t care about how good the fitness of this model is with this data set.
A data set “multisampling” is provided with this package,
data(multisampling)
head(multisampling$data)| y | g1 | g2 | z1 | x1 | group |
|---|---|---|---|---|---|
| 1.2311965 | 0.1706889 | -0.2246718 | 1.4808437 | 0.7930132 | 1 |
| 2.7154442 | 0.1706889 | -0.2246718 | 0.4890035 | 0.5222513 | 1 |
| 1.9980754 | 0.1706889 | -0.2246718 | -0.2261288 | 1.7462222 | 1 |
| 3.7671728 | 0.1706889 | -0.2246718 | 0.3268472 | -1.2713361 | 1 |
| 4.4938533 | 0.1706889 | -0.2246718 | 1.8754945 | 2.1973895 | 1 |
| 0.7256683 | 0.1706889 | -0.2246718 | -0.3023764 | 0.4331308 | 1 |
head(multisampling$coords)| U | V |
|---|---|
| 2940.897 | 2851.943 |
| 3002.659 | 3157.717 |
| 2848.345 | 2904.326 |
| 2863.735 | 2907.999 |
| 3117.849 | 2800.236 |
| 2906.585 | 2972.770 |
where y is the dependent variable, g1 and
g2 are two group-level variables, z1 and
x1 are two sample-level variables, group are
the labels of the groups they belong to, and U,
V are longitude and latitude coordinate values of all
groups.
We regards g1 and g2 have local fixed
effects, x1 have global fixed effects and z1
have random effects. Then we can calibrate an HGWR model with like
this
ms_hgwr <- hgwr(
formula = y ~ g1 + g2 + x1 + (z1 | group),
data = multisampling$data,
local.fixed = c("g1", "g2"),
coords = multisampling$coords,
bw = 10
)
ms_hgwr## Hierarchical and geographically weighted regression model
## =========================================================
## Formula: y ~ g1 + g2 + x1 + (z1 | group)
## Method: Back-fitting and Maximum likelihood
## Data: multisampling$data
##
## Global Fixed Effects
## -------------------
## Intercept x1
## -4538.623106 0.964530
##
## Local Fixed Effects
## -------------------
## Coefficient Min 1st Quartile Median 3rd Quartile Max
## Intercept 4540.169533 4540.258191 4540.536449 4540.825440 4541.111718
## g1 5.919438 6.366335 7.282849 7.877878 9.261846
## g2 -0.935852 0.052836 0.978779 1.494663 2.178014
##
## Random Effects
## --------------
## Groups Name Std.Dev. Corr
## group Intercept 778.316317
## z1 49.552648 0.063573
## Residual 1.900196
##
## Other Information
## -----------------
## Number of Obs: 484
## Groups: group , 16
The output of the model shows estimations of global fixed effects, summary of those of local fixed effects. Also there are the standard deviations of random effects and correlation coefficients between them.
Then we can have a look on the coefficient estimations.
coef(ms_hgwr)| Intercept | g1 | g2 | x1 | z1 | group |
|---|---|---|---|---|---|
| 0.7575213 | 8.073992 | -0.0488209 | 0.9645297 | -0.1886598 | 1 |
| 3.4042278 | 7.035509 | 1.3211308 | 0.9645297 | 0.3153482 | 2 |
| 0.7623931 | 7.660647 | 0.8469575 | 0.9645297 | 1.3234052 | 3 |
| 2.4643282 | 7.591925 | 0.9277784 | 0.9645297 | -0.6226628 | 4 |
| 5.8517603 | 9.261846 | -0.9358521 | 0.9645297 | -1.5012307 | 5 |
| 0.6818849 | 6.228258 | 1.7987301 | 0.9645297 | 0.7493044 | 6 |
| 2.6272027 | 8.301746 | -0.0470795 | 0.9645297 | -1.5212598 | 7 |
| 3.5158609 | 7.969104 | 0.0441887 | 0.9645297 | 0.3311799 | 8 |
| 1.0563234 | 6.262002 | 1.3913978 | 0.9645297 | 0.6206638 | 9 |
| 1.1177829 | 5.919438 | 2.1780138 | 0.9645297 | -0.8921036 | 10 |
| 1.0775400 | 6.968977 | 1.3304763 | 0.9645297 | 0.3229058 | 11 |
| 0.4547059 | 7.786652 | 0.0614841 | 0.9645297 | -0.4205407 | 12 |
| 0.2454580 | 7.517574 | 1.0297799 | 0.9645297 | 1.6211414 | 13 |
| 3.2664446 | 6.117750 | 1.5979289 | 0.9645297 | 0.0050019 | 14 |
| 1.8603374 | 6.470669 | 1.7181931 | 0.9645297 | 0.0483338 | 15 |
| 2.2202882 | 7.048124 | 0.5651645 | 0.9645297 | -0.0496753 | 16 |
With ggplot2 or other packages, we can create some figures.
library(ggplot2)
ms_hgwr_coef <- as.data.frame(cbind(multisampling$coords, coef(ms_hgwr)))
ggplot(ms_hgwr_coef, aes(x = U, y = V)) +
geom_point(aes(color = g1)) +
coord_fixed() + theme_bw()
We can also convert it to spatial data and use tmap to visualize.
library(sf)
library(tmap)
ms_hgwr_coef_sf <- st_as_sf(ms_hgwr_coef,
coords = names(multisampling$coords),
crs = 27700)
tm_shape(ms_hgwr_coef_sf) + tm_dots(col = c("g1", "g2"), size = 0.5)
And we can also fetch the fitted and residuals.
head(data.frame(
real = multisampling$data$y,
fitted = fitted(ms_hgwr),
residuals = residuals(ms_hgwr)
))| real | fitted | residuals |
|---|---|---|
| 1.2311965 | 2.6321401 | -1.4009436 |
| 2.7154442 | 2.5581026 | 0.1573416 |
| 1.9980754 | 3.8735757 | -1.8755003 |
| 3.7671728 | 0.8587268 | 2.9084461 |
| 4.4938533 | 3.9122480 | 0.5816052 |
| 0.7256683 | 2.6214449 | -1.8957766 |
The summary() function will give some statistical
information about this model.
summary(ms_hgwr)## Hierarchical and geographically weighted regression model
## =========================================================
## Formula: y ~ g1 + g2 + x1 + (z1 | group)
## Method: Back-fitting and Maximum likelihood
## Data: multisampling$data
##
## Diagnostics
## -----------
## Rsquared
## 0.647278
##
## Scaled residuals
## ----------------
## Min 1Q Median 3Q Max
## -5.457712 -1.200908 -0.066369 1.264277 5.293872
##
## Other Information
## -----------------
## Number of Obs: 484
## Groups: group , 16
On the current stage, only pesudo \(R^2\) is available. In the future, more diagnostic information will be provided in this package.
Example: Large Scale Simulated Data
In the former example, there are only 484 observations and 16 groups.
They are not adequate enough to get precises estimations. Here, we are
going to use a large scale simulated data set to show the performance of
HGWR model. As the true value of coefficients are already known (stored
in variable msl_beta), closeness between estimated and true
values is an appliable performance metric.
The data set is provided here. Its
structure is similar to data multisampling.
data(multisampling.large)
head(multisampling.large$data)| y | g1 | g2 | x1 | z1 | group |
|---|---|---|---|---|---|
| 2.090256 | 0.1933745 | -0.057836 | 0.7930132 | 1.1432041 | 1 |
| 3.456452 | 0.1933745 | -0.057836 | 0.5222513 | 1.0198745 | 1 |
| 2.876627 | 0.1933745 | -0.057836 | 1.7462222 | -0.7071740 | 1 |
| 4.510162 | 0.1933745 | -0.057836 | -1.2713361 | 0.8431381 | 1 |
| 5.583738 | 0.1933745 | -0.057836 | 2.1973895 | -0.1603318 | 1 |
| 1.601511 | 0.1933745 | -0.057836 | 0.4331308 | -0.7634945 | 1 |
head(multisampling.large$coords)## V1 V2
## [1,] 2940.897 2851.943
## [2,] 3002.659 3157.717
## [3,] 2848.345 2904.326
## [4,] 2863.735 2907.999
## [5,] 3117.849 2800.236
## [6,] 2906.585 2972.770
msl_beta <- multisampling.large$beta
head(msl_beta)| Intercept | g1 | g2 | z1 | x1 |
|---|---|---|---|---|
| 0.7114818 | 11.309720 | 3.336207 | -0.1359245 | 1 |
| 1.6523106 | 12.438025 | 4.972805 | -0.0407970 | 1 |
| 1.4783712 | 6.876733 | -5.069080 | 1.0105390 | 1 |
| 3.2734732 | 6.088872 | -3.877653 | -0.1582624 | 1 |
| 3.8245206 | 21.341706 | 3.814638 | -2.1566375 | 1 |
| 0.4886921 | 1.243364 | -2.469626 | 0.4986468 | 1 |
In this data, we also regards g1 and g2 as
two group-level variables, z1 and x1 as two
sample-level variables, group as the labels of the groups
they belong to, and U, V as longitude and
latitude coordinate values of all groups. Then we calibrate an HGWR
model.
As the data is large (13862 observations), it may take some time to get results.
msl_hgwr <- hgwr(
formula = y ~ g1 + g2 + x1 + (z1 | group),
data = multisampling.large$data,
local.fixed = c("g1", "g2"),
coords = multisampling.large$coords,
bw = 32, kernel = "bisquared"
)
msl_hgwr## Hierarchical and geographically weighted regression model
## =========================================================
## Formula: y ~ g1 + g2 + x1 + (z1 | group)
## Method: Back-fitting and Maximum likelihood
## Data: multisampling.large$data
##
## Global Fixed Effects
## -------------------
## Intercept x1
## 3.142118 1.010699
##
## Local Fixed Effects
## -------------------
## Coefficient Min 1st Quartile Median 3rd Quartile Max
## Intercept -2.099433 -1.501617 -1.330840 -1.160251 -0.637086
## g1 -3.495954 0.264949 3.452187 6.340162 11.766479
## g2 -11.003983 -2.416328 -0.251891 1.704997 8.757358
##
## Random Effects
## --------------
## Groups Name Std.Dev. Corr
## group Intercept 1.743385
## z1 1.841777 -0.178200
## Residual 1.991062
##
## Other Information
## -----------------
## Number of Obs: 13862
## Groups: group , 200
Fitness Assessment
Then we check the estimations of intercept, g1,
g2 and z1 via some scatter plots.
library(dplyr)
library(purrr)
coef(msl_hgwr) %>%
select(Intercept, g1, g2, z1) %>%
list(label = names(.), Truth = msl_beta[names(.)], Estimated = .) %>%
pmap_df(data.frame) %>%
mutate(label = factor(label, levels = c("Intercept", "g1", "g2", "z1"))) %>%
ggplot(aes(x = Truth, y = Estimated)) + geom_point() +
coord_fixed() + scale_y_continuous(limits = c(-25, 25)) +
facet_grid(cols = vars(label)) +
theme_bw()
In addition to these scatter plots, root mean squared errors (RMSE) and mean absolute errors of estimations and true values are also very useful to assess the fitting performance.
msl_hgwr_err <- coef(msl_hgwr) %>%
select(Intercept, g1, g2, z1, x1) %>% {
as.data.frame(rbind(
MAE = map2_dbl(msl_beta, ., ~ mean(abs(.x - .y))),
RMSE = map2_dbl(msl_beta, ., ~ sqrt(mean((.x - .y)^2)))
))
}
msl_hgwr_err| Intercept | g1 | g2 | z1 | x1 | |
|---|---|---|---|---|---|
| MAE | 0.3529462 | 2.437674 | 2.134569 | 0.2027014 | 0.0106994 |
| RMSE | 0.4856966 | 3.703679 | 2.862071 | 0.2575985 | 0.0106994 |
Comparison of HGWR, GWR and HLM
As a comparison, we can also calibrate a GWR model and HLM model and have a look at their fitting performance. THe GWR model can be calibrated with the following codes.
### GWR model
library(GWmodel)
msl_gwr_data <- multisampling.large$data
coordinates(msl_gwr_data) <- with(multisampling.large, coords[data$group, ])
##### Get optimized bandwidth via golden minimization algorithm.
msl_gwr_bw <- bw.gwr(
formula = y ~ g1 + g2 + x1 + z1, data = msl_gwr_data, approach = "AIC",
adaptive = TRUE, parallel.method = "omp"
)## Take a cup of tea and have a break, it will take a few minutes.
## -----A kind suggestion from GWmodel development group
## Adaptive bandwidth (number of nearest neighbours): 8574 AICc value: 64882.39
## Adaptive bandwidth (number of nearest neighbours): 5307 AICc value: 64454.28
## Adaptive bandwidth (number of nearest neighbours): 3286 AICc value: 64047.9
## Adaptive bandwidth (number of nearest neighbours): 2039 AICc value: 63620.4
## Adaptive bandwidth (number of nearest neighbours): 1266 AICc value: 62999.43
## Adaptive bandwidth (number of nearest neighbours): 790 AICc value: 62173.75
## Adaptive bandwidth (number of nearest neighbours): 494 AICc value: 61359.34
## Adaptive bandwidth (number of nearest neighbours): 313 AICc value: 60434.33
## Adaptive bandwidth (number of nearest neighbours): 199 AICc value: 1071693
## Adaptive bandwidth (number of nearest neighbours): 381 AICc value: 60780.34
## Adaptive bandwidth (number of nearest neighbours): 268 AICc value: 1013623
## Adaptive bandwidth (number of nearest neighbours): 338 AICc value: 60513.49
## Adaptive bandwidth (number of nearest neighbours): 295 AICc value: 60292.35
## Adaptive bandwidth (number of nearest neighbours): 286 AICc value: 60250.93
## Adaptive bandwidth (number of nearest neighbours): 278 AICc value: 1013623
## Adaptive bandwidth (number of nearest neighbours): 288 AICc value: 60255.59
## Adaptive bandwidth (number of nearest neighbours): 281 AICc value: 60239.58
## Adaptive bandwidth (number of nearest neighbours): 282 AICc value: 60239.58
##### Calibrate GWR model with optimized bandwidth.
msl_gwr <- gwr.basic(
formula = y ~ g1 + g2 + x1 + z1, data = msl_gwr_data, bw = msl_gwr_bw,
adaptive = TRUE, parallel.method = "omp"
)
#### Get coefficient estimations.
#### As samples in one group have equal estimations,
#### we use mean value of each group to represent them.
msl_gwr_coef <- cbind(msl_gwr$SDF@data, group = msl_gwr_data$group) %>%
select(Intercept, g1, g2, x1, z1, group) %>%
group_by(group) %>%
summarise_all(mean)
##### Calculate RMSE and MAE of estimations.
msl_gwr_err <- msl_gwr_coef %>%
select(Intercept, g1, g2, z1, x1) %>% {
as.data.frame(rbind(
MAE = map2_dbl(msl_beta, ., ~ mean(abs(.x - .y))),
RMSE = map2_dbl(msl_beta, ., ~ sqrt(mean((.x - .y)^2)))
))
}
msl_gwr_err| Intercept | g1 | g2 | z1 | x1 | |
|---|---|---|---|---|---|
| MAE | 5.27123 | 54.27019 | 206.523 | 0.4871780 | 0.1620288 |
| RMSE | 44.07530 | 478.03679 | 2667.250 | 0.6847527 | 0.1980524 |
And also a HLM model calibrated with the following codes.
library(lme4)
msl_hlm <- lmer(
formula = y ~ g1 + g2 + x1 + (z1 | group),
data = multisampling.large$data
)
msl_hlm_coef <- coef(msl_hlm)$group
colnames(msl_hlm_coef)[which(colnames(msl_hlm_coef) == "(Intercept)")] <- "Intercept"
msl_hlm_err <- msl_hlm_coef %>%
select(Intercept, g1, g2, z1, x1) %>% {
as.data.frame(rbind(
MAE = map2_dbl(msl_beta, ., ~ mean(abs(.x - .y))),
RMSE = map2_dbl(msl_beta, ., ~ sqrt(mean((.x - .y)^2)))
))
}
msl_hlm_err| Intercept | g1 | g2 | z1 | x1 | |
|---|---|---|---|---|---|
| MAE | 0.5403015 | 3.961721 | 3.183228 | 0.2036397 | 0.0102346 |
| RMSE | 0.8043435 | 5.423963 | 4.610203 | 0.2584218 | 0.0102346 |
A bar plot will be helpful here to compare the fitness of these three models.
list(HGWR = msl_hgwr_err, GWR = msl_gwr_err, HLM = msl_hlm_err) %>%
map2_dfr(., names(.), function(x, nx) {
map2_dfr(x, colnames(x), function(i, ni) {
data.frame(Coefficient = ni, Label = rownames(x), Value = i)
}) %>% cbind(Algorithm = nx, .)
}) %>%
ggplot(aes(x = Coefficient, y = Value, group = Algorithm, fill = Algorithm)) +
geom_col(position = "dodge") +
geom_text(aes(label = round(Value, 2), y = pmin(Value, 10), vjust = -0.2),
position = position_dodge(0.9)) +
facet_grid(rows = vars(Label)) +
scale_y_continuous(limits = c(0, 10), oob = scales::squish,
expand = expansion(add = c(0.5, 1))) +
theme_bw()
Note that in this figure we limited the scale of y axis to make the bar for
x1andz1more obvious. The actual numbers are labelled above bars.
We can say that HGWR has best fitness among these three models. It is able to give more precise estimations for local fixed effects. And it can estimate global fixed effects and also random effects as precisely as HLM does, but more precisely than GWR does.
Case Study: Impact Factors of House Price in Wuhan
As the fitness of HGWR has been demonstrated with simulation data, we are thing about applying it in collected data sets. Here we are going to use a case study to show the usage of HGWR on this type of data. The data set is provided in this file.
data(wuhan.hp)
head(wuhan.hp$data)| Price | Floor.High | Floor.Low | Decoration.Fine | PlateTower | Steel | BuildingArea | Fee | d.Commercial | d.GreenLand | d.Water | d.University | d.Kindergarten | d.PrimarySchool | d.MiddleSchool | d.HighSchool | d.SubwayStation | d.Supermarket | d.ShoppingMall | lon | lat | group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9.458528 | 0 | 1 | 0 | 0 | 0 | 4.887639 | 0.1823216 | 0.0741679 | 0.0521697 | 0.0350293 | -3.083466 | -5.237382 | -5.694951 | -2.686764 | -2.677074 | -2.956452 | -5.874491 | -6.23154 | 114.1187 | 30.55188 | 1 |
| 9.355739 | 0 | 1 | 0 | 0 | 0 | 4.990433 | 0.1823216 | 0.0741679 | 0.0521697 | 0.0350293 | -3.083466 | -5.237382 | -5.694951 | -2.686764 | -2.677074 | -2.956452 | -5.874491 | -6.23154 | 114.1187 | 30.55188 | 1 |
| 9.435402 | 0 | 0 | 0 | 0 | 0 | 4.880906 | 0.1823216 | 0.0741679 | 0.0521697 | 0.0350293 | -3.083466 | -5.237382 | -5.694951 | -2.686764 | -2.677074 | -2.956452 | -5.874491 | -6.23154 | 114.1187 | 30.55188 | 1 |
| 9.573524 | 0 | 1 | 0 | 0 | 0 | 4.883862 | 0.1823216 | 0.0741679 | 0.0521697 | 0.0350293 | -3.083466 | -5.237382 | -5.694951 | -2.686764 | -2.677074 | -2.956452 | -5.874491 | -6.23154 | 114.1187 | 30.55188 | 1 |
| 9.657779 | 0 | 1 | 0 | 0 | 0 | 4.364753 | 0.1823216 | 0.0741679 | 0.0521697 | 0.0350293 | -3.083466 | -5.237382 | -5.694951 | -2.686764 | -2.677074 | -2.956452 | -5.874491 | -6.23154 | 114.1187 | 30.55188 | 1 |
| 9.662498 | 0 | 1 | 0 | 0 | 0 | 4.810557 | 0.1823216 | 0.0741679 | 0.0521697 | 0.0350293 | -3.083466 | -5.237382 | -5.694951 | -2.686764 | -2.677074 | -2.956452 | -5.874491 | -6.23154 | 114.1187 | 30.55188 | 1 |
head(wuhan.hp$coords)## X Y
## [1,] 511387.8 3381299
## [2,] 512928.1 3387722
## [3,] 513303.3 3392168
## [4,] 513462.6 3391487
## [5,] 513777.9 3388737
## [6,] 513813.1 3392252
This data set, collected in 2018, has 19599 properties located in 776 communities. In each observation, there are fields of house properties, locational and neighbourhood variables. In details, they are distances to nearby educational resources (kindergartens, primary schools, middle schools, high schools, universities), commercial areas (business districts, shopping malls and supermarkets), transportation facilities (metro stations and bus stations), rivers, lakes, and green lands. The following figure shows the distribution of these properties, and the table gives information about all available variables.
Map of second-hand properties
| Variables | Value | Level | Type | Logarithm |
|---|---|---|---|---|
| Price | House price per square metre. | Sample | Dependent | Yes |
| Floor.High | 1 if a property is on a high floor, otherwise 0. | Sample | Structural | |
| Floor.Low | 1 if a property is on a low floor, otherwise 0. | Sample | Structural | |
| Decoration.Fine | 1 if a property is well decorated, otherwise 0. | Sample | Structural | |
| PlateTower | 1 if a property is of the plate-tower type, otherwise 0. | Sample | Structural | |
| Steel | 1 if a property is of ‘steel’ structure, otherwise 0. | Sample | Structural | |
| BuildingArea | Building area in square metres. | Sample | Structural | Yes |
| Fee | Management fee per square meter per month. | Group | Neighbourhood | Yes |
| d.Commercial | Distance to the nearest commercial area. | Group | Locational | Yes |
| d.Greenland | Distance to the nearest green land. | Group | Locational | Yes |
| d.Water | Distance to the nearest river or lake. | Group | Locational | Yes |
| d.University | Distance to the nearest university. | Group | Locational | Yes |
| d.HighSchool | Distance to the nearest high school. | Group | Locational | Yes |
| d.MiddleSchool | Distance to the nearest middle school. | Group | Locational | Yes |
| d.PrimarySchool | Distance to the nearest primary school. | Group | Locational | Yes |
| d.Kindergarten | Distance to the nearest kindergarten. | Group | Locational | Yes |
| d.SubwayStation | Distance to the nearest subway station. | Group | Locational | Yes |
| d.Supermarket | Distance to the nearest supermarket. | Group | Locational | Yes |
| d.ShoppingMall | Distance to the nearest shopping mall. | Group | Locational | Yes |
A variable selection process (Hu et al. 2022) suggests us to use the following variables to build a HGWR model:
- Fee
- d.Water
- d.Commercial
- d.PrimarySchool
- d.Kindergarten
- BuildingArea
- Floor.High
But we are going to append a variable d.ShoppingMall which is usually concerned by customers and researchers. And regards the locational and Fee as local fixed effects, the BuildingArea as global fixed effects and Floor.High as random effects. Then we can calibrate a HGWR model with the following codes.
whhp_hgwr <- hgwr(
formula = Price ~ d.Water + d.Commercial + d.PrimarySchool + d.ShoppingMall +
d.Kindergarten + Fee + BuildingArea + (Floor.High | group),
data = wuhan.hp$data,
local.fixed = c("d.Water", "d.Commercial", "d.PrimarySchool",
"d.ShoppingMall", "d.Kindergarten", "Fee"),
coords = wuhan.hp$coords,
bw = 50,
kernel = "bisquared",
eps_gradient = 1e-16,
eps_iter = 1e-16
)
whhp_hgwr## Hierarchical and geographically weighted regression model
## =========================================================
## Formula: Price ~ d.Water + d.Commercial + d.PrimarySchool + d.ShoppingMall + d.Kindergarten + Fee + BuildingArea + (Floor.High | group)
## Method: Back-fitting and Maximum likelihood
## Data: wuhan.hp$data
##
## Global Fixed Effects
## -------------------
## Intercept BuildingArea
## 10.271682 -0.019036
##
## Local Fixed Effects
## -------------------
## Coefficient Min 1st Quartile Median 3rd Quartile Max
## Intercept -5.655311 -0.753470 -0.335763 -0.007905 2.481812
## d.Water -29.363638 -7.449347 -1.706057 2.879884 16.308855
## d.Commercial -56.248990 -7.554371 -2.721221 2.859166 90.103118
## d.PrimarySchool -0.215276 -0.048748 -0.025535 0.004450 0.203624
## d.ShoppingMall -0.204345 -0.039703 -0.006851 0.026344 0.092649
## d.Kindergarten -1.094823 -0.032577 0.003839 0.017043 0.669662
## Fee 0.000741 0.099061 0.142559 0.187511 0.381820
##
## Random Effects
## --------------
## Groups Name Std.Dev. Corr
## group Intercept 0.081694
## Floor.High 0.081694 0.000000
## Residual 0.081694
##
## Other Information
## -----------------
## Number of Obs: 19599
## Groups: group , 776
Coefficient Visualization
We can also make some maps to visualize the estimated coefficients. Boundary data of Wuhan is provided. It is in GeoJSON format and can be loaded by package rgdal. For coefficients, we can simply combine it with coordinates of houses. Because coefficients, coordinates and groups can be one-to-one matched.
wuhan <- rgdal::readOGR("data/wuhan.geojson")## OGR data source with driver: GeoJSON
## Source: "C:\Users\rd21411\Develop\hlmgwr-backfitting-ml-doc\data\wuhan.geojson", layer: "Wuhan"
## with 13 features
## It has 6 fields
wuhp_coef_sf <- cbind(coef(whhp_hgwr), as.data.frame(wuhan.hp$coords)) %>%
st_as_sf(coords = c("X", "Y"), crs = 4547)
wh_basemap <- tm_shape(wuhan) + tm_polygons(col = "white")Then visualize coefficients estimations with the basemap.
with(whhp_hgwr$effects, c(local.fixed, random, global.fixed, "Intercept")) %>%
map(function(var) {
wh_basemap +
tm_shape(wuhp_coef_sf, is.master = T) +
tm_dots(col = var, size = 0.1, midpoint = 0,
palette = "-RdBu", legend.col.reverse = T)
}) %>%
tmap_arrange(ncol = 3)
From this figure, we can see that as BuildingArea is regarded as global fixed effects, for all samples there is only one estimations. As Floor.High is regarded as random effects, each group has a unique estimation but no spatial relationship can be seen. For local fixed effects, their estimations seems to be locally related. This is suggested by the first law of geography. Besides, spatial heterogeneity is also obvious in their estimations.
Residual Analysis
Via the standard function residuals(), we have the
access to residuals estimated by this model. Then we can visualize them
by combining the summary statistics of residuals with coordinates. For
example, here we created a map showing mean residual of each group
together with standard deviation.
whhp_hgwr_res <- data.frame(
residuals = residuals(whhp_hgwr),
group = wuhan.hp$data$group
) %>%
group_by(group) %>%
summarise(res.abs.mean = mean(abs(residuals)),
res.sd = sd(residuals)) %>%
cbind(wuhan.hp$coords) %>%
st_as_sf(coords = c("X", "Y"), crs = 4547)
wh_basemap +
tm_shape(whhp_hgwr_res, is.master = T) +
tm_dots(col = "res.abs.mean", size = "res.sd", midpoint = 0,
palette = "-RdBu", legend.col.reverse = T)
And we can also convery a global Moran test on the residuals to to find well estimated points.
library(spdep)
whhp_hgwr_res_listw <- st_coordinates(whhp_hgwr_res) %>%
knearneigh(k = 20) %>% knn2nb() %>% nb2listw() %>% listw2U()
moran.test(whhp_hgwr_res$res.abs.mean, whhp_hgwr_res_listw)##
## Moran I test under randomisation
##
## data: whhp_hgwr_res$res.abs.mean
## weights: whhp_hgwr_res_listw
##
## Moran I statistic standard deviate = 6.4658, p-value = 5.038e-11
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.066815400 -0.001290323 0.000110949
From the p-value and Moran’s I value, we can find that although it is suggested to reject the null hypothesis, the global spatial autocorrelation is too weak to take any effects. Additionally, a local Moran test is also useful.
whhp_hgwr_res_localmoran <- localmoran(whhp_hgwr_res$res.abs.mean, whhp_hgwr_res_listw)
wh_basemap +
tm_shape(cbind(whhp_hgwr_res, whhp_hgwr_res_localmoran), is.master = T) +
tm_dots(col = "Ii", size = 0.2, palette = "-RdBu", style = "quantile",
midpoint = 0, title = "Local Moran")
whhp_hgwr_res$res_c <- scale(whhp_hgwr_res$res.abs.mean, scale = F)
whhp_hgwr_res$lmoran_c <- scale(whhp_hgwr_res_localmoran[,1], scale = F)
whhp_hgwr_res$quadrant <- with(whhp_hgwr_res, {
quadrant <- integer(length(res_c))
quadrant[res_c < 0 & lmoran_c < 0] <- 1
quadrant[res_c < 0 & lmoran_c > 0] <- 2
quadrant[res_c > 0 & lmoran_c < 0] <- 3
quadrant[res_c > 0 & lmoran_c > 0] <- 4
quadrant[whhp_hgwr_res_localmoran[,5] > 0.1] <- 0
as.factor(quadrant)
})
LISA_palette <- list(
"Insignificant" = "green", "Low-Low" = "darkblue", "Low-High" = "blue",
"High-Low" = "red", "High-High" = "darkred"
)
tm_layout(aes.palette = "cat") +
wh_basemap +
tm_shape(whhp_hgwr_res, is.master = T) +
tm_dots(col = "quadrant", size = 0.2,
palette = as.vector(unlist(LISA_palette)),
labels = names(LISA_palette))

It can be seen that most residuals are not spatially clustered. Most of those showing spatial clusters locate near the boundary of this study area.