(he/him) · Ph.D Student at Arizona State University in Geography
As mentioned in the final blog post, these two notebooks showcase the user-oriented designs I created during GSoC 2022.
formula_example.ipynb
: a Jupyter notebook demonstrating the syntax and usage of the spatial regression formula syntax in pysal/spreg
.sklearn_example.ipynb
: a Jupyter notebook demonstrating the usage of the scikit-learn
style interface to pysal/spreg
.tdhoffman/spreg
, api-dev
branch: this branch of my forked copy of spreg
contains all the deliverables I created over the summer. I did some experimental work in other repositories (see the blog and links at the end of this page), but the vast majority of the work was focused on creating different interfaces for spreg
.Posted 10 September 2022
This summer, I implemented two new ways of interfacing with the spreg
module of the Python Spatial Analysis Library. In the first half of the summer, I created a Wilkinson formula interface available through the spreg.from_formula()
function. Wilkinson formulas are used extensively in R and Python packages like statsmodels
that are oriented towards quantitative social scientists and econometricians. These libraries emphasize the inference side of quantitative analyses and value interpretability and expressibility of models. Formulas offer a natural path for achieving these goals: by writing in a mathematical syntax, users can easily fit a model and interpret the results. It’s a clear, direct way of doing statistical modeling that requires very little code on the user end. I extended the Wilkinson formula syntax to admit spatial regression components (spatial lags and spatial errors) and wrapped the formula parser from formulaic
in a spatial preprocessor (that handles the spatial components) and dispatcher (that sends the constructed model matrices to the modeling classes in spreg
). As a result of my work, users can now fit nonspatial and spatial regressions using a PySAL function. This contribution improves the accessibility and ease of use of spreg
’s modeling classes. To see how they work, read the blog entries below or check out the formula_example.ipynb
Jupyter notebook linked above.
In the second half of the summer, I designed an alternative way of interfacing with the modeling classes in spreg
. In a submodule called spreg.sklearn
, I translated the spatial regression modeling classes from spreg
into scikit-learn
style estimators. scikit-learn
is the dominant Python library for statistical and machine learning (anything short of deep learning) and is widely used in the data science community. It’s focused primarily on prediction, which makes it a great complement to the formula interface implemented in the first half of the summer. By conforming spreg
’s modeling classes to scikit-learn
’s style, the models can now be used in scikit-learn
workflows. This construction makes spatial models available to a wide community of modelers who previously may not have included spatial models in their data science workflow. Once merged with the main branch of spreg
, it may be useful to reach out to the scikit-learn
development team or others in the community to raise awareness about this connection and begin promoting the use of spatial models in the data science community. Finally, to preserve backward-compatibility, the spreg.sklearn
submodule is not imported by default when importing spreg
– it must be directly imported (e.g., import spreg.sklearn
) so as not to pollute the namespace.
I am immensely thankful for the advice and guidance of my mentors throughout this project. Their encyclopedic knowledge of the design and structure of PySAL proved invaluable time and time again while I was building these interfaces. Our discussions about design philosophy and code organization were illuminating for me: I’ve used open source code for years, but prior to this summer had not spent time thinking about its construction and development. Gaining insights into how the library and its modules were structured has made me introspect about the usage of code as an expression of thought. I’m eager to build on these insights in my dissertation and to continue working with the PySAL development team to expand the library!
scikit-learn
StructurePosted 11 August 2022
Unfortunately, the syntax for the regimes models proved to be too difficult to set up. My mentors and I initially thought that formulaic
’s pipe operator would be sufficient to create the groupings needed for a multilevel regimes model, but it turns out that the pipe in formulaic
works differently. It’s designed for creating multiple models, like for a two-stage least squares. This finding makes it significantly more difficult to implement the regimes, and it’s not as high of a priority at this point in the project. The pull request for the formula implementation has been submitted and regimes can come at some point in the future.
After receiving confirmation that the spreg
code must remain backwards-compatible, my mentors and I have shifted to designing a new package with a sleeker API which could exist in parallel. To do this, we’ve elected to revisit the scikit-learn
design pattern and create spatial models which plug directly into it. As a result, the new package doesn’t need the ordinary least squares class (already implemented in sklearn.linear_model
), so I’ve gotten to work rewriting the spatial lag and spatial error model classes in the scikit-learn
design. They’ll each use the appropriate mixins and base classes for linear models, and should serve to plug and play directly with existing scikit-learn
pipelines.
Posted 4 August 2022
The spreg.from_formula()
function has been steadily getting closer and closer to library-grade code. Per the spreg
Feature Enhancement Proposal instructions, I’ve added doc tests, unit tests, and a Jupyter notebook demonstrating the usage and verifying the code’s correctness. I’ve also polished up the code a lot, adding a few more features:
I’m now working on developing formula syntax for the regimes models that exist in spreg
. Hopefully, I’ll be able to have that ready by tomorrow’s PySAL developer meeting so I can include it in my presentation. If not, then it will be something I can easily develop in the future now that the base code is prepared. Tomorrow I’ll initiate the bigger review process to get this code added to PySAL!
Posted 27 July 2022
This week I was primarily occupied by moving to a new apartment, but I did manage to handle some small issues with the formula parser (see the commits in tdhoffman/spreg
). The big news this week is in splitting the scope of this project: in the next week or so, I will move forward with polishing and testing the formula parser for spreg
as it exists. I’ll submit that code for formal review and aim to get it accepted in the repository by September.
Separately, I’ll create a new package (name TBD, will add to Ongoing Links) that will serve as a reimagining of the spreg
API. My goal with this package is to cleanly reexpress the functionality of spreg
in a user-oriented way. Ideally, by the end of the second half of the summer I’ll have a full-fledged novel implementation of global spatial regression models, paving the way for its inclusion in PySAL and broader rethinking of the spatial model class structures.
from_formula
Posted 18 July 2022
Progress on the spreg.from_formula()
function has been steady. I’ve built a parser that accepts a new syntax for spatial lag and spatial error models. This syntax comes in two parts:
<...>
operator:
<FIELD>
adds spatial lag of that field from df to model matrix&
symbol:
&
is not a reserved char in formulaic
, so there are no conflicts.<...>
and &
: <FIELD + ... + FIELD> + &
is the most general possible spatial model available.Importantly, all terms and operators MUST be space delimited in order for the parser to properly pick up on the tokens. The current design also requires the user to have constructed a weights matrix first, which I think makes sense as the weights functionality is well-documented and external to the actual running of the model.
The ordinary least squares and spatial lag classes have been converted to the new GenericModel
template, but have not been tested and almost certainly have some errors. The spatial error class is partway converted and also needs to be tested. Once these are completed, I will convert the combo models (those accepting a mixture of lag and error terms) and build out the dispatcher in spreg.from_formula
to select the correct combo class.
Posted 7 July 2022
While conceptualizing this project, the twin streams of formula implementation and library standardization seemed quite separate to me. Naturally, a solid base class spec would lead to easier implementations of formulas, but I didn’t see them as being much more related than that. This week, however, the commonalities between these two objectives merged.
First, I began designing a generic base class for PySAL models. I based my design off of the novel framework developed in Guo et al. (2022), which unifies a variety of spatial models under one paradigm (see Table 1 and Figure 1 for more information). The structure is: data, model, objective function, optimization method, and output. Such a structure broadly characterizes nearly every model in PySAL, and would offer a sleek way to standardize the user-facing model classes. Drafts of this idea can be found in my pysal_base
repository; in the coming weeks I will be building on these ideas to create minimal working examples to showcase to the PySAL developers at large.
In parallel, I created a demo of the formulaic
library (called formula_test.py
) in my forked copy of spreg
for implementing spatial lag of X models. This demo showcases how to create a model matrix for spatially lagged covariates using native features of the formulaic
library. All that remains to implement spatial lag and spatial error models is to add preprocessing of formula strings that recognizes when the dependent variable is being lagged and when a spatial error term is being added. These will be implemented for next week.
Together, these two streams are starting to converge to form a proposal for PySAL’s next-gen model API. Ideally, I’ll equip spreg
with a from_formula
function that generates and runs models similar to R’s lm
function simply given a formula and a dataframe. This external-facing method will connect nicely to updated versions of the modeling classes, which will conform to the new base template.
References
Guo, H., Python, A., and Liu, Y. (2022) “A generalized regionalization framework for geographical modelling and its application in spatial regression.” arXiv:2206.09429.
Posted 1 July 2022
This week, I accomplished my two goals of implementing scikit-learn
-compatible versions of Moran_Local
and GM_Lag
and starting discussions about converting PySAL’s style. In doing this, I got a better picture of the potential advantages and disadvantages of the paradigm.
One issue I ran into while building out the demo classes was the location of the spatial weights matrix in a scikit-learn
style estimator. Putting the matrix in the __init__
method of a class implies that the matrix is a parameter rather than data, whereas putting the matrix in the fit
method of a class implies that it is data rather than a parameter. Each choice has implications on the format and usage of the class. Personally, I prefer placing the spatial weights matrix in the __init__
method as that choice encourages users to tune hyperparameters of the estimator for each spatial domain they are handling. However, this could lead to memory issues if users need to analyze many spatial domains. This dilemma was also noticed by Martin Fleischmann in the esda
issue thread.
In spreg
, Eli Knaap and Luc Anselin discussed how PySAL and scikit-learn
have generally different objectives in their modeling frameworks: the former is typically focused on inference and interpretation, while the latter is centered around prediction. These different use cases affect the design patterns of the two libraries, and as a result it may not make sense to mimic the design pattern from scikit-learn
. Of course, Wilkinson formulas were viewed as being useful to users on this thread as they reflect how social scientists think about models, thus making the library more accessible to a wider base of users.
For these reasons, I think it may be more useful to design a set of PySAL-specific base classes. While scikit-learn
and multiple dispatch might not be great fits for the library, it still needs consistent standards (whether or not they conform to external rules). Creating a PySAL-specific framework would have several key advantages, like in-house maintainability and customizability. This week, my goals are to begin prototyping Wilkinson formulas (as they are universally popular) and to think of design constraints for a set of PySAL-specific base classes.
Posted 26 June 2022
At our first meeting, my mentors and I discussed the options we have for implementing new interfaces to PySAL’s model and exploratory statistic classes. We discussed using multiple dispatch or ducktyping from scikit-learn
and chose to begin with ducktyping scikit-learn
as multiple dispatch has been deemed non-Pythonic. While developing functionality for Wilkinson formulas is an ancillary goal of this project, we are keeping it in mind as we make interface edits as those structural changes affect the formula implementation.
My first two goals are as follows:
Moran_Local
) and a supervised statistic (GM_Lag
) and implement them using the scikit-learn
paradigm. This will give us two concrete, minimal working examples that we can share to package leads.scikit-learn
interface and to extend it to Wilkinson formulas.Package | Requirements for ducktyping scikit-learn |
Requirements for extending to Wilkinson formulas |
---|---|---|
spreg |
||
gwr |
||
esda |
||
spopt |
||
weights |
||
spglm |
||
… |
Finally, we aim to include automatic attribution tools to all the new interfaces so that users may easily generate the proper citations for the code they are using.
pysal_base
, a repository of draft base classes for PySAL.tdhoffman/spreg
, my forked copy of spreg
. This is a testing ground for the spreg
formula dispatcher and new model APIs.tdhoffman/esda
, my forked copy of esda
. This is a testing ground for base classes in descriptive or unsupervised settings.