Skip to content

Clarifying flat vs. structured data responses #7

@SimonGoring

Description

@SimonGoring

Hi @mmcclenn & @jpjenk I just want to clarify the discussion we had about flat data structures in the API response.

Right now, regardless of data format (json, xml, csv), we are returning data as a flat table.

I understand the motivation for doing this for csv formats, but the JSON and XML formats are designed to return structured data, so I'm not clear why we wouldn't use this in that case.

For example, the bibJSON schema for publications is designed to support (for example) variable length author lists, or sets of publications with differing reference structures.

Given the extent of repetition and the potentially large size of some of our responses it might make sense to consider structured data formats for some of the responses, particularly since we're making our users define the response type they're expecting.

For example, a publication response in JSON would use the bibJSON standard, while in CSV is would be wide table that could be saved as csv.

My thinking is two-fold:

  1. I want to avoid repetition in the response as much as possible. Even structuring the API response for occurrences:
{
"elapsed_time":14.8,
"warnings":[
"Neotoma: Request failed",
"Neotoma:  WKT not properly formatted: Polygon((-180 -90,10 -90,10 180,-180 180,-180 -90))"
],
"records": [
{"Database":"PaleoBioDB","OccurrenceID":"pbdb:occ:94749","RecordType":"Occurrence","TaxonName":"Busycon","TaxonID":"pbdb:txn:10874","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"},
. . . 
{"Database":"PaleoBioDB","OccurrenceID":"pbdb:occ:94752","RecordType":"Occurrence","TaxonName":"Busycotypus canaliculatus","TaxonID":"pbdb:txn:94432","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"}]}

versus:

{
"elapsed_time":14.8,
"records": [
{"Database":"PaleoBioDB","occurrences":[{"OccurrenceID":"pbdb:occ:94749","RecordType":"Occurrence","TaxonName":"Busycon","TaxonID":"pbdb:txn:10874","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"},
. . . 
{"OccurrenceID":"pbdb:occ:94752","RecordType":"Occurrence","TaxonName":"Busycotypus canaliculatus","TaxonID":"pbdb:txn:94432","AgeOlder":2.588,"AgeYounger":0.0117,"AgeUnit":"Ma","SiteID":"pbdb:col:7108"}]}]}

saves us an astounding 24 bytes per row :) Which isn't that much, I suppose, but then we could add a bit more structure, returning a taxon table for multi-taxon responses that would link the taxon IDs to the names, so we wouldn't need to repeat those as well. I think we'd see performance improvements in the downstream applications that use the application, particularly web based services that use JSON natively.

Tagging @spatialit as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions