A OOP approach to hierarchically indexed data tables

This notebook defines the schema and relations that will be implemented to facilitate the workflows projecting global data sets on a hierarchical index and manipulating different layers together. Mainly, this approach aims to solve the following challenges:

  • User friendly : ideally have a familiar sytax to pandas users

  • Scalable : be able to work with national, continetal and even global data sets of fine spatial resolution (ex: 100m,1000m grids)

  • Flexible : it should provide a easily exportable type of data to use across different types of processes and analyses.

The main tools chosen to do this are :

  • Using H3 : https://uber.github.io/h3-py/

  • Using DuckDB : https://duckdb.org/docs/guides/python/execute_sql

  • Using ibis : https://ibis-project.org

## Base workflow

One of the main contributions of this workflow is a method to efficiently project raster grids of any size into a format that is more efficient to work with and perform geometric operations with other types.

Minimal input

[ polars.DataFrame, pyarrow.Table, pandas.DataFrame, geopandas.GeoDataFrame(not recommended), ibis.table ]

WIth non ambiguous columns for the coordinates, ex (‘x’,’y’), (‘lon’,’lat’) etc …. and variables containing numeric values with a suffix var or containing categorical values with suffix cat preceding the actual column name {band}.

lon

lat

{band}_var

float

float

[float,int]

lon

lat

{band}_cat

float

float

str

Minimal output

The data is projected to the h3 grid for a resolution which is refered to later on as the native resolution. It should be the highest possible resolution that makes sense for a given data layer. For example for points data, any resolution is possible, and the choise will most likely depend on the type of processing that is done with it. In the case of polygons, a resolution high enough to give a good description of the original shape, but coarse enough to be efficient to work with. Once again, this is context specific and will most likely depend on the type of analysis.

lon

lat

{band}_cat

h3_id

float

float

str

[str,int]

lon

lat

{band}_var

h3_id

float

float

[float,int]

[str,int]

A mix of the two previous tyopes of numeric or categorical variables can be present as long as the coordinates and h3_id columns are unambiguiously identified.

The object to which the data is associated contains additional attributes and methods descrbed next:

Class

Attributes

  • base_res : base resolution, the original resolution into which the data was projected on the h3 grid

  • current_res : current resolution, the resolution in which the data currently is after potential processing.

Methods

  • set_res(res) : int(4,18), set the resolution of the data to a specifed one.

  • change_res(level) : int, change the resolution of the data by a value provided. the reuslting resolution will be equal to current_res+level

  • add_layer()