Skip to content

Commit 762f151

Browse files
committed
fix: add fix for slow full count in postgresql
This commit adds a fix for the slow performance of full counts in PostgreSQL (Issue #1969). To achieve this two new PostgreSQL provider settings have been added (postgresql_pseudo_count_enabled and postgresql_pseudo_count_start). Allowing the use of pseudo counts to be configured individually on each use of the PostgreSQL provider. This fix then uses the PostgreSQL EXPLAIN function to "guess" the number of rows that will be returned by a given request. But this does not affect all queries equally because pseudo counts cannot be run on the following types of query: - Requests with a Result Type of Hits. - Requests with a CQL filter. - Requests with a BBOX filter. - Requests with a Temporal filter. Also, you can use the postgresql_pseudo_count_start setting to tell the system to run a full count if the row estimate is to small meaning there is enough time for a full count to be run. This commit also adds the required documentation and postgreSQL provider test changes. Including adding a building_type and datetime column to the dummy_data.sql file.
1 parent 0081549 commit 762f151

4 files changed

Lines changed: 267 additions & 23 deletions

File tree

docs/source/publishing/ogcapi-features.rst

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -450,15 +450,15 @@ Connection
450450
password: geo_test
451451
# external_auth: wallet
452452
# tns_name: XEPDB1
453-
# tns_admin /opt/oracle/client/network/admin
453+
# tns_admin /opt/oracle/client/network/admin
454454
# init_oracle_client: True
455455
456456
id_field: id
457457
table: lakes
458458
geom_field: geometry
459459
title_field: name
460460
461-
The provider supports connection over host and port with SID, SERVICE_NAME or TNS_NAME. For TNS naming, the system
461+
The provider supports connection over host and port with SID, SERVICE_NAME or TNS_NAME. For TNS naming, the system
462462
environment variable TNS_ADMIN or the configuration parameter tns_admin must be set.
463463

464464
The providers supports external authentication. At the moment only wallet authentication is implemented.
@@ -484,7 +484,7 @@ SDO options
484484
title_field: name
485485
sdo_operator: sdo_relate # defaults to sdo_filter
486486
sdo_param: mask=touch+coveredby # defaults to mask=anyinteract
487-
487+
488488
The provider supports two different SDO operators, sdo_filter and sdo_relate. When not set, the default is sdo_relate!
489489
Further more it is possible to set the sdo_param option. When sdo_relate is used the default is anyinteraction!
490490
`See Oracle Documentation for details <https://docs.oracle.com/en/database/oracle/oracle-database/23/spatl/spatial-operators-reference.html>`_.
@@ -509,7 +509,7 @@ Mandatory properties
509509
mandatory_properties:
510510
- example_group_id
511511
512-
On large tables it could be useful to disallow a query on the complete dataset. For this reason it is possible to
512+
On large tables it could be useful to disallow a query on the complete dataset. For this reason it is possible to
513513
configure mandatory properties. When this is activated, the provider throws an exception when the parameter
514514
is not in the query uri.
515515

@@ -556,13 +556,13 @@ Extra_params
556556
""""""""""""
557557
The Oracle provider allows for additional parameters that can be passed in the request. It allows for the processing of additional parameters that are not defined in the ``pygeoapi-config.yml`` to be passed to a custom SQL-Manipulator-Plugin. An example use case of this is advanced filtering without exposing the filtered columns like follows ``.../collections/some_data/items?is_recent=true``. The ``SqlManipulator`` plugin's ``process_query`` method would receive ``extra_params = {'is_recent': 'true'}`` and could dynamically add a custom condition to the SQL query, like ``AND SYSDATE - create_date < 30``.
558558

559-
The ``include_extra_query_parameters`` has to be set to ``true`` for the collection in ``pygeoapi-config.yml``. This ensures that the additional request parameters (e.g. ``is_recent=true``) are not discarded.
559+
The ``include_extra_query_parameters`` has to be set to ``true`` for the collection in ``pygeoapi-config.yml``. This ensures that the additional request parameters (e.g. ``is_recent=true``) are not discarded.
560560

561561

562562
Custom SQL Manipulator Plugin
563563
"""""""""""""""""""""""""""""
564564
The provider supports a SQL-Manipulator-Plugin class. With this, the SQL statement could be manipulated. This is
565-
useful e.g. for authorization at row level or manipulation of the explain plan with hints.
565+
useful e.g. for authorization at row level or manipulation of the explain plan with hints.
566566

567567
More information and examples about this feature can be found in ``tests/provider/test_oracle_provider.py``.
568568

@@ -584,14 +584,14 @@ To publish a GeoParquet file (with a geometry column) the geopandas package is a
584584
providers:
585585
- type: feature
586586
name: Parquet
587-
data:
587+
data:
588588
source: ./tests/data/parquet/random.parquet
589589
id_field: id
590590
time_field: time
591591
x_field:
592592
- minlon
593593
- maxlon
594-
y_field:
594+
y_field:
595595
- minlat
596596
- maxlat
597597
@@ -663,6 +663,24 @@ These are optional and if not specified, the default from the engine will be use
663663
table: hotosm_bdi_waterways
664664
geom_field: foo_geom
665665
666+
Due to PostgreSQL's unique multi read multi write strategy row counts for query’s that return a lot of rows can take significant amounts of time to complete. To address this issue the following settings can be configured on your PostgreSQL provider:
667+
668+
.. csv-table::
669+
:header: Name, Type, Default Value, Description
670+
:align: left
671+
672+
postgresql_pseudo_count_enabled, Boolean, false, Enables pseudo count.
673+
postgresql_pseudo_count_start, Integer, 5000000, Sets the minimum number of rows a table must have before a pseudo count is performed when pseudo counts are enabled.
674+
675+
This solution uses the built in PostgreSQL EXPLAIN function to “guess” the number of rows a given query will return. If that value is greater than the postgresql_pseudo_count_start value, then the “guessed” value from the EXPLAIN function is returned in the response. If the “guessed” value is lower then, a full count is completed and the result returned in the response. These settings do not affect the following types of requests:
676+
677+
* Requests with a Result Type of Hits.
678+
* Requests with a CQL filter.
679+
* Requests with a BBOX filter.
680+
* Requests with a Temporal filter.
681+
682+
Using these pseudo count options allows you to granularly configure each PostgreSQL provider to get the best performance out of your API. But when choosing weather or not to use these pseudo count settings understanding the trade-off you are making is important. By enabling pseudo counts you are choosing speed over accuracy, some pseudo counts maybe higher and others lower than the true row count for a given query. But, with some datasets this trade off might be better than not being able to provide the data at all. While these settings have default values pseudo row counts must be enabled for them to be used.
683+
666684
The PostgreSQL provider is also able to connect to Cloud SQL databases.
667685

668686
.. code-block:: yaml

pygeoapi/provider/sql.py

Lines changed: 124 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
from decimal import Decimal
5858
import functools
5959
import logging
60+
from re import search
6061
from typing import Optional
6162

6263
from geoalchemy2 import Geometry # noqa - this isn't used explicitly but is needed to process Geometry columns
@@ -71,7 +72,8 @@
7172
PrimaryKeyConstraint,
7273
asc,
7374
desc,
74-
delete
75+
delete,
76+
text
7577
)
7678
from sqlalchemy.engine import URL
7779
from sqlalchemy.exc import (
@@ -127,6 +129,12 @@ def __init__(
127129
self.id_field = provider_def['id_field']
128130
self.geom = provider_def.get('geom_field', 'geom')
129131
self.driver_name = driver_name
132+
self.postgresql_pseudo_count_enabled = provider_def.get(
133+
'postgresql_pseudo_count_enabled', False
134+
)
135+
self.postgresql_pseudo_count_start = provider_def.get(
136+
'postgresql_pseudo_count_start', 5000000
137+
)
130138

131139
LOGGER.debug(f'Name: {self.name}')
132140
LOGGER.debug(f'Table: {self.table}')
@@ -738,6 +746,121 @@ def _get_bbox_filter(self, bbox: list[float]):
738746

739747
return bbox_filter
740748

749+
def query(
750+
self,
751+
offset=0,
752+
limit=10,
753+
resulttype='results',
754+
bbox=[],
755+
datetime_=None,
756+
properties=[],
757+
sortby=[],
758+
select_properties=[],
759+
skip_geometry=False,
760+
q=None,
761+
filterq=None,
762+
crs_transform_spec=None,
763+
**kwargs
764+
):
765+
"""
766+
Query sql database for all the content.
767+
e,g: http://localhost:5000/collections/hotosm_bdi_waterways/items?
768+
limit=1&resulttype=results
769+
770+
:param offset: starting record to return (default 0)
771+
:param limit: number of records to return (default 10)
772+
:param resulttype: return results or hit limit (default results)
773+
:param bbox: bounding box [minx,miny,maxx,maxy]
774+
:param datetime_: temporal (datestamp or extent)
775+
:param properties: list of tuples (name, value)
776+
:param sortby: list of dicts (property, order)
777+
:param select_properties: list of property names
778+
:param skip_geometry: bool of whether to skip geometry (default False)
779+
:param q: full-text search term(s)
780+
:param filterq: CQL query as text string
781+
:param crs_transform_spec: `CrsTransformSpec` instance, optional
782+
783+
:returns: GeoJSON FeatureCollection
784+
"""
785+
786+
LOGGER.debug('Preparing filters')
787+
property_filters = self._get_property_filters(properties)
788+
cql_filters = self._get_cql_filters(filterq)
789+
bbox_filter = self._get_bbox_filter(bbox)
790+
time_filter = self._get_datetime_filter(datetime_)
791+
order_by_clauses = self._get_order_by_clauses(sortby, self.table_model)
792+
selected_properties = self._select_properties_clause(
793+
select_properties, skip_geometry
794+
)
795+
796+
LOGGER.debug('Querying Database')
797+
# Execute query within self-closing database Session context
798+
with Session(self._engine) as session:
799+
results = (
800+
session.query(self.table_model)
801+
.filter(property_filters)
802+
.filter(cql_filters)
803+
.filter(bbox_filter)
804+
.filter(time_filter)
805+
.options(selected_properties)
806+
)
807+
808+
LOGGER.debug(f'PostgreSQL pseudo count enabled: {self.postgresql_pseudo_count_enabled}') # noqa
809+
LOGGER.debug(f'PostgreSQL pseudo count start: {self.postgresql_pseudo_count_start}') # noqa
810+
811+
if self.postgresql_pseudo_count_enabled:
812+
# This if statement uses is not True for cql_filters, bbox_filter, and time_filter because even when no value is provided they are True. This is because an empty object is always set for each of these values if no value is provided by the user. # noqa
813+
if resulttype == 'hits' or cql_filters is not True or bbox_filter is not True or time_filter is not True: # noqa
814+
matched = results.count()
815+
LOGGER.debug('Full count executed (hits or filters)')
816+
else:
817+
compiled_query = results.statement.compile(
818+
self._engine,
819+
compile_kwargs={"literal_binds": True}
820+
)
821+
explain_query = f"EXPLAIN {compiled_query}"
822+
query_explanation = session.execute(text(explain_query))
823+
explanation_overview = query_explanation.fetchone()
824+
match = (
825+
search(r'rows=(\d+)', str(explanation_overview[0]))
826+
if explanation_overview else ""
827+
)
828+
matched = int(match.group(1)) if match else 0
829+
LOGGER.debug('Pseudo count executed')
830+
831+
if matched < self.postgresql_pseudo_count_start:
832+
matched = results.count()
833+
LOGGER.debug('Full count executed (too few features)')
834+
835+
else:
836+
matched = results.count()
837+
838+
LOGGER.debug(f'Found {matched} result(s)')
839+
840+
LOGGER.debug('Preparing response')
841+
response = {
842+
'type': 'FeatureCollection',
843+
'features': [],
844+
'numberMatched': matched,
845+
'numberReturned': 0
846+
}
847+
848+
if resulttype == 'hits' or not results:
849+
return response
850+
851+
crs_transform_out = get_transform_from_spec(crs_transform_spec)
852+
853+
for item in (
854+
results.order_by(*order_by_clauses).offset(offset).limit(limit)
855+
):
856+
response['numberReturned'] += 1
857+
response['features'].append(
858+
self._sqlalchemy_to_feature(item, crs_transform_out,
859+
select_properties)
860+
)
861+
862+
return response
863+
741864

742865
class MySQLProvider(GenericSQLProvider):
743866
"""

tests/data/dummy_data.sql

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,37 +6,39 @@ CREATE EXTENSION IF NOT EXISTS postgis WITH SCHEMA dummy;
66
CREATE TABLE IF NOT EXISTS dummy.buildings(
77
gid serial PRIMARY KEY,
88
centroid geometry(POINT, 25833),
9-
contours geometry(POLYGON, 25833)
9+
contours geometry(POLYGON, 25833),
10+
building_type varchar,
11+
datetime timestamp
1012
);
1113

12-
INSERT INTO dummy.buildings(centroid, contours)
14+
INSERT INTO dummy.buildings(centroid, contours, building_type, datetime)
1315
VALUES (ST_GeomFromText('POINT (473449 7463146)', 25833),
14-
ST_GeomFromText('POLYGON ((473447.9967755177 7463140.685534775, 473453.51980463834 7463143.029921546, 473450.0032244818 7463151.314465227, 473444.4801953612 7463148.970078456, 473447.9967755177 7463140.685534775))', 25833)),
16+
ST_GeomFromText('POLYGON ((473447.9967755177 7463140.685534775, 473453.51980463834 7463143.029921546, 473450.0032244818 7463151.314465227, 473444.4801953612 7463148.970078456, 473447.9967755177 7463140.685534775))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
1517
(ST_GeomFromText('POINT (473458 7463104)', 25833),
16-
ST_GeomFromText('POLYGON ((473460.9359104787 7463106.762323238, 473457.1106914547 7463107.931810057, 473455.06408952177 7463101.237676765, 473458.88930854574 7463100.068189946, 473460.9359104787 7463106.762323238))', 25833)),
18+
ST_GeomFromText('POLYGON ((473460.9359104787 7463106.762323238, 473457.1106914547 7463107.931810057, 473455.06408952177 7463101.237676765, 473458.88930854574 7463100.068189946, 473460.9359104787 7463106.762323238))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
1719
(ST_GeomFromText('POINT (473446 7463144)', 25833),
18-
ST_GeomFromText('POLYGON ((473446.09474694915 7463138.853056925, 473450.31999101397 7463146.79958526, 473445.9052530499 7463149.146943075, 473441.6800089851 7463141.20041474, 473446.09474694915 7463138.853056925))', 25833)),
20+
ST_GeomFromText('POLYGON ((473446.09474694915 7463138.853056925, 473450.31999101397 7463146.79958526, 473445.9052530499 7463149.146943075, 473441.6800089851 7463141.20041474, 473446.09474694915 7463138.853056925))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
1921
(ST_GeomFromText('POINT (473449 7463142)', 25833),
20-
ST_GeomFromText('POLYGON ((473452.3381955018 7463138.820935548, 473452.65221123956 7463144.812712757, 473445.6618044963 7463145.179064451, 473445.3477887586 7463139.187287242, 473452.3381955018 7463138.820935548))', 25833)),
22+
ST_GeomFromText('POLYGON ((473452.3381955018 7463138.820935548, 473452.65221123956 7463144.812712757, 473445.6618044963 7463145.179064451, 473445.3477887586 7463139.187287242, 473452.3381955018 7463138.820935548))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
2123
(ST_GeomFromText('POINT (473443 7463137)', 25833),
22-
ST_GeomFromText('POLYGON ((473447.7083111685 7463135.5571535295, 473440.9159249468 7463141.46168479, 473438.2916888306 7463138.44284647, 473445.0840750523 7463132.538315209, 473447.7083111685 7463135.5571535295))', 25833)),
24+
ST_GeomFromText('POLYGON ((473447.7083111685 7463135.5571535295, 473440.9159249468 7463141.46168479, 473438.2916888306 7463138.44284647, 473445.0840750523 7463132.538315209, 473447.7083111685 7463135.5571535295))', 25833), 'residential', '2021-11-23 09:00:00.000'),
2325
(ST_GeomFromText('POINT (473433 7463125)', 25833),
24-
ST_GeomFromText('POLYGON ((473432.73905580025 7463120.082489641, 473436.8249702975 7463128.10154836, 473433.2609442007 7463129.917510359, 473429.1750297034 7463121.898451641, 473432.73905580025 7463120.082489641))', 25833)),
26+
ST_GeomFromText('POLYGON ((473432.73905580025 7463120.082489641, 473436.8249702975 7463128.10154836, 473433.2609442007 7463129.917510359, 473429.1750297034 7463121.898451641, 473432.73905580025 7463120.082489641))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
2527
(ST_GeomFromText('POINT (473451 7463140)', 25833),
26-
ST_GeomFromText('POLYGON ((473454.99435667787 7463139.456755368, 473453.4959303038 7463143.165490787, 473447.00564332213 7463140.543244633, 473448.5040696962 7463136.834509214, 473454.99435667787 7463139.456755368))', 25833)),
28+
ST_GeomFromText('POLYGON ((473454.99435667787 7463139.456755368, 473453.4959303038 7463143.165490787, 473447.00564332213 7463140.543244633, 473448.5040696962 7463136.834509214, 473454.99435667787 7463139.456755368))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
2729
(ST_GeomFromText('POINT (473438 7463144)', 25833),
28-
ST_GeomFromText('POLYGON ((473438.99554283824 7463137.7143898895, 473444.28561010957 7463144.995542839, 473437.00445716083 7463150.28561011, 473431.7143898895 7463143.00445716, 473438.99554283824 7463137.7143898895))', 25833)),
30+
ST_GeomFromText('POLYGON ((473438.99554283824 7463137.7143898895, 473444.28561010957 7463144.995542839, 473437.00445716083 7463150.28561011, 473431.7143898895 7463143.00445716, 473438.99554283824 7463137.7143898895))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
2931
(ST_GeomFromText('POINT (473474 7463101)', 25833),
30-
ST_GeomFromText('POLYGON ((473474.83006438427 7463097.491297516, 473477.55805782415 7463100.416712323, 473473.1699356148 7463104.508702483, 473470.441942174 7463101.583287676, 473474.83006438427 7463097.491297516))', 25833)),
32+
ST_GeomFromText('POLYGON ((473474.83006438427 7463097.491297516, 473477.55805782415 7463100.416712323, 473473.1699356148 7463104.508702483, 473470.441942174 7463101.583287676, 473474.83006438427 7463097.491297516))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
3133
-- gid 10
3234
(NULL,
33-
ST_GeomFromText('POLYGON ((473464.1495667333 7463116.574655892, 473461.1307284124 7463119.1988920085, 473457.85043326765 7463115.425344108, 473460.8692715885 7463112.8011079915, 473464.1495667333 7463116.574655892))', 25833)),
35+
ST_GeomFromText('POLYGON ((473464.1495667333 7463116.574655892, 473461.1307284124 7463119.1988920085, 473457.85043326765 7463115.425344108, 473460.8692715885 7463112.8011079915, 473464.1495667333 7463116.574655892))', 25833), 'commercial', '2021-10-31 09:00:00.000'),
3436
-- gid 11
3537
(ST_GeomFromText('POINT (473461 7463116)', 25833),
36-
NULL),
38+
NULL, 'commercial', '2021-10-31 09:00:00.000'),
3739
-- gid 12
3840
(NULL,
39-
NULL);
41+
NULL, 'commercial', '2021-10-31 09:00:00.000');
4042

4143
/* Two tables which create a naming conflict
4244

0 commit comments

Comments
 (0)