OpenStreetMap provides a rich data set to operate with a geospatial database. OpenStreetMap consists of several geometries and adds for each individual object also additional metadata information. This repository provides a set of methods to construct based on a .osm.pbf file a Spatio-Temporal Knowledge Graph, which write it's structure in a Delta Table.
The file that OpenStreetMap provides is the .osm.pbf file. It is an alternative to an XML file and provides a compressed representation of the OpenStreetMap features by a factor of 30% compared to a gzipped xml planet file. OpenStreetMap divides it geometry features into three different elements:
| OpenStreetMap features | Vector Geometries |
|---|---|
| Node | Point |
| Way | LineString, Polygon |
| Relation | MultiLineString, MultiPolygon, MultiPoint, GeometryCollection |
Additionally to the geometry object OpenStreetMap uses map features to represent additional information associated to the individual OpenStreetMap object. those are organized in a key value pair. The key in general provides the main category, where for instance the value specifies the main category more precisely. The common tags are displayed under the following web page: https://wiki.openstreetmap.org/wiki/Map_features.
The coding within the repository has been created using Python 3.10.13, with the use of Spark 3.5.0 and Sedona 1.5.1 using the respective Python bindings. In general the script can run on a local laptop as also parts of the repository on computation clusters due to involvement of Apache Sedona and Apache Spark. It is generally recommended to use for the Planet file not the laptop as the memory footprint could be too huge.
To install all dependencies, the environment.yml file is provided to create a Conda environemnt on which the code is based on. Run the following command to install all Conda packages:
conda env create -f environment.yml
For the overall configuration of the different scripts, we have defined different variables that are used accross the different scripts in this repository. The definition of the file can be found in the constants.py script.
| Variable names | Description |
|---|---|
| helper_path_directory | Path where the different helper scripts and files can be found |
| osm_data_path | Folder path where the OpenStreetMap .osm.pbf files are stored |
| osm_parquet_path | Folder path where the OpenStreetMap parquet files are stored |
| osm_start_date | Start date of the Overpass API retrieval |
| osm_end_date | End date of the Overpass API retrieval |
| osm_area | Area used for the Overpass API and potential clipping for the created h3 grid |
| osm_clipping | Boolean value to perform area/ geometry clipping for OpenStreetMap geometries |
| ogr_temporary | Folder to store temporary files that are produced by the transformation between .osm.pbf files and GeoParquet |
| cpu_cors | Number of CPU cores used to parallelize the coding snippets |
| spark_master | Spark master for local deployment with a defined number of cores |
| spark_temp_directory | Spark temporary directory where Shuffle Writes are performed and stored |
| spark_driver_memory | Memory size for the Spark Driver that is used for an individual Spark session |
| spark_executor_memory | Memory size for the Spark Executor that is used for an individual Spark session |
| kg_output_path | Output path for the constructed Knowledge Graph |
| grid_clipping | Boolean value to perform area/ geometry clipping for base World map geometries |
| grid_level | Level of h3 grid that is used for the grid construction. Details on the grid cell size associated with a grid level can be found here |
| grid_compaction | Usage of the grid compaction algorithm implemented by the h3 API |
| grid_parquet_path | Output path of the cosnstructed h3 grid |
| geo_hash_level | Geohash level, which controls the resolution of the constructed geohashes |
| sedona_packages | List of compatible Apache Sedona packages that are used together with Delta Table jar files |
| geometry_file_path | Folder path where the unified OpenStreetMap parquet files are stored |
For the OSM data in the repository, we base our coding on the provided .osm.pbf files provided by Geofabrik. As a basis for the downloads used in our work, we have provided a file hat outlines the individual .osm.pbf file links we have downloaded. In addition, in the file osm_api_retrieval.py we have provided a script to retrieve OpenStreetMap data over the Overpass API and store the API response in GeoParquet files.
For the data preparation pipeline, the source is an osm.pbf file that is converted into a Delta Table, which represents a Spatial-Temporal Knowledge Graph. In the following picture, the complete pipeline for data preparation is outlined.
For OpenStreetMap, we base our analysis on .osm.pbf files. Generally, we provide with the file osm_parquet_transform.pythe possibility to transform .osm.pbf files to geoparquet. This allows us to directly interact with OpenStreetMap data while using Spark, a distributed computation engine, to construct our Knowledge Graph. Overall, the transformation is based on the ogr2ogr method from the GDAL library, which transforms an OpenStreetMap file into five separate files, each representing a separate layer constructed by GDAL. The configuration for the conversion can be defined in the file osmconf.ini and also adopted based on the needs of the user. A change in the configuration file might imply that a change to the subsequent scripts is needed. In the current implementation, each individual file is processed sequentially, which currently does not utilize the full extent of an underlying resource.
The different part files created by the conversion script are unified by the Python script geoparquet_sedona.py. In addition to the unification of the different geoparquet files, we calculate the geohash for all individual geometries. This helps to store additional metadata for the respective GeoParquet files.
For the creation of our Knowledge Graph, we align the OpenStreetMap geometries using the h3 DGG created by Uber. Each individual grid cell is represented as a regular hexagon (except for 12 grid cells per level, which have a pentagon structure). To generate the grid cells, we use a base of the world map, which can be found under the following folder in different data formats (Shapefiles, GeoParquet or GeoJSON). Based on the world geometries, we fill up each individual geometry with the respective grid cells based on the input parameters used in this repository. Similar to the OpenStreetMap data, we store our generated grid using GeoParquet together with geohashes.
As a computational engine, we use Apache Sedona to scale the Knowledge Graph creation process. For the Knowledge Graph creation, we divide our creation into different parts. The first part transforms the tag structure of OpenStreetMap into a triple structure. Because a tag is marked as a commonly used tag, we transform the tag into a subclass relation for the Knowledge Graph. For OpenStreetMap tags that are not commonly used, we use the OpenStreetMap ID as a subject, the tag key as a predicate, and the tag value as an object. For OpenStreetMap geometries and h3 grid cell geometries, we expose those into the constructed Knowledge Graph and store them in the WKT format.
The grid cell-based relationship, we compare the grid cells to each other. We extract three different relations: Neighborhood relation (isAdjacentTo), child relation (isChildCellOf) and parent relation (isParentCellOf). This takes the hierarchical dependencies into account for the h3 grid. This provides an interconnectivity to the individual grid cells.
Furthermore, we extract the spatial predicates between individual h3 grid cells and OpenStreetMap entities. For that, we base the extraction of spatial predicates on the DE-9IM methodology. We build up the spatial relationship from the h3 grid cell to the OpenStreetMap entity.
