-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Hi @mmcclenn & @jpjenk I just want to clarify the discussion we had about flat data structures in the API response.
Right now, regardless of data format (json, xml, csv), we are returning data as a flat table.
I understand the motivation for doing this for csv formats, but the JSON and XML formats are designed to return structured data, so I'm not clear why we wouldn't use this in that case.
For example, the bibJSON schema for publications is designed to support (for example) variable length author lists, or sets of publications with differing reference structures.
Given the extent of repetition and the potentially large size of some of our responses it might make sense to consider structured data formats for some of the responses, particularly since we're making our users define the response type they're expecting.
For example, a publication response in JSON would use the bibJSON standard, while in CSV is would be wide table that could be saved as csv.
My thinking is two-fold:
- I want to avoid repetition in the response as much as possible. Even structuring the API response for occurrences:
{
"elapsed_time":14.8,
"warnings":[
"Neotoma: Request failed",
"Neotoma: WKT not properly formatted: Polygon((-180 -90,10 -90,10 180,-180 180,-180 -90))"
],
"records": [
{"Database":"PaleoBioDB","OccurrenceID":"pbdb:occ:94749","RecordType":"Occurrence","TaxonName":"Busycon","TaxonID":"pbdb:txn:10874","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"},
. . .
{"Database":"PaleoBioDB","OccurrenceID":"pbdb:occ:94752","RecordType":"Occurrence","TaxonName":"Busycotypus canaliculatus","TaxonID":"pbdb:txn:94432","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"}]}versus:
{
"elapsed_time":14.8,
"records": [
{"Database":"PaleoBioDB","occurrences":[{"OccurrenceID":"pbdb:occ:94749","RecordType":"Occurrence","TaxonName":"Busycon","TaxonID":"pbdb:txn:10874","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"},
. . .
{"OccurrenceID":"pbdb:occ:94752","RecordType":"Occurrence","TaxonName":"Busycotypus canaliculatus","TaxonID":"pbdb:txn:94432","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"}]}]}saves us an astounding 24 bytes per row :) Which isn't that much, I suppose, but then we could add a bit more structure, returning a taxon table for multi-taxon responses that would link the taxon IDs to the names, so we wouldn't need to repeat those as well. I think we'd see performance improvements in the downstream applications that use the application, particularly web based services that use JSON natively.
Tagging @spatialit as well.