Skip to content

Conversation

@rdhyee
Copy link

@rdhyee rdhyee commented Nov 14, 2025

Summary

This PR adds comprehensive documentation for understanding and working with iSamples property graph (PQG) data. Created in response to the challenge of discovering the major structures within PQG files - particularly the "14 sentence types" that form the underlying grammar of iSamples metadata.

What's Included

Five interconnected documentation files totaling 59K+ words:

1. UNDERSTANDING_THE_GRAPH.md (Foundation)

  • Explains the 8 entity types (MaterialSampleRecord, SamplingEvent, etc.)
  • Details the 14 relationship types (predicates)
  • Introduces the "14 sentence types" as the complete grammar of iSamples
  • Covers graph traversal patterns and design rationale
  • Explains the unified table storage format

2. PREDICATES_REFERENCE.md (Detailed Reference)

  • Complete documentation for each of the 14 predicates
  • YAML usage examples for every predicate
  • SQL query patterns for common operations
  • Real statistics from OpenContext data (1.1M samples, 11.6M total records)
  • Common issues, solutions, and cross-domain usage comparison

3. EXAMPLES_BY_DOMAIN.md (Real-World Examples)

  • Complete examples from 3 scientific domains:
    • Archaeology: Pottery sherd from Çatalhöyük (OpenContext)
    • Geology: Basalt core from mid-ocean ridge (SESAR)
    • Biology: Coral tissue sample with parent chain (GEOME)
  • Full YAML examples (500+ lines each)
  • Domain-specific patterns and best practices
  • Cross-domain comparison tables

4. QUERYING_THE_GRAPH.md (Practical SQL Guide)

  • SQL query patterns for DuckDB (and other SQL databases)
  • Basic entity queries through complex multi-hop traversals
  • Aggregation, statistics, and geographic filtering
  • Performance optimization techniques
  • 10+ copy-paste query recipes (export, validation, GeoJSON generation)

5. EDGE_TYPES_VISUAL.md (Visual Guide)

  • Mermaid diagrams showing entity relationships
  • Complete ERD of all 8 entity types and 14 edge types
  • Connectivity matrices and heatmaps
  • Graph traversal path visualizations
  • Storage structure diagrams
  • Real data usage patterns from OpenContext

Key Features

Cross-referenced - Each document links to related sections
Real examples - SQL queries tested on actual OpenContext data
Multi-domain - Demonstrates archaeology, geology, and biology usage
Visual - Mermaid diagrams for complex relationships
Practical - Copy-paste query recipes for immediate use
Complete - Covers all 8 entity types and 14 edge types

Why This Matters

The iSamples property graph format is powerful but complex. These docs make the underlying structure explicit and accessible:

  • Developers can understand the graph schema and write efficient queries
  • Data providers can see how to structure their metadata across domains
  • Researchers can discover relationships and traverse the graph effectively
  • New users can learn the "grammar" (14 sentence types) systematically

Testing

  • All SQL examples tested against OpenContext parquet data (11.6M records)
  • YAML examples validated against LinkML schema
  • Mermaid diagrams render correctly on GitHub

Files Changed

src/docs/UNDERSTANDING_THE_GRAPH.md    (+1,082 lines)
src/docs/PREDICATES_REFERENCE.md       (+765 lines)
src/docs/EXAMPLES_BY_DOMAIN.md         (+912 lines)
src/docs/QUERYING_THE_GRAPH.md         (+975 lines)
src/docs/EDGE_TYPES_VISUAL.md          (+668 lines)

Total: 5 new files, 3,914 lines

Related Work

This documentation builds on recent work:

Questions for Discussion

  1. Location: Is src/docs/ the right place, or should these go elsewhere?
  2. Audience: Are these pitched at the right technical level?
  3. Additions: What other topics should be covered?
  4. Integration: Should we add links to these from the main README?

Looking forward to feedback! 🙏

🤖 Generated with Claude Code

Created 5 comprehensive documentation files to help users understand
the iSamples property graph structure:

1. UNDERSTANDING_THE_GRAPH.md (13K words)
   - Foundation document explaining the 8 entity types
   - Details on the 14 relationship types (predicates)
   - The 14 sentence types as the "grammar" of iSamples
   - Graph traversal patterns and design rationale
   - Storage format explanation

2. PREDICATES_REFERENCE.md (10K words)
   - Detailed reference for each of the 14 predicates
   - YAML usage examples for each predicate
   - SQL query patterns for common operations
   - OpenContext data statistics showing actual usage
   - Common issues and solutions
   - Cross-domain usage comparison

3. EXAMPLES_BY_DOMAIN.md (12K words)
   - Complete real-world examples from 3 scientific domains
   - Archaeology: Pottery sherd from Çatalhöyük (OpenContext)
   - Geology: Basalt core from mid-ocean ridge (SESAR)
   - Biology: Coral tissue sample (GEOME)
   - Full YAML examples (500+ lines each)
   - Domain-specific patterns and best practices
   - Cross-domain comparison tables

4. QUERYING_THE_GRAPH.md (15K words)
   - Practical SQL query patterns for DuckDB
   - Basic entity queries and single-hop traversals
   - Multi-hop traversal patterns (2-hop, 3-hop)
   - Aggregation and statistics queries
   - Filtering and search patterns
   - Complex query patterns (spatial, hierarchical)
   - Performance optimization techniques
   - Common query recipes (export, validation, GeoJSON)

5. EDGE_TYPES_VISUAL.md (9K words)
   - Mermaid diagrams showing entity relationships
   - Complete ERD of all 8 entity types and 14 edge types
   - Edge type matrix and connectivity heatmaps
   - Sample-centric and event-centric views
   - Graph traversal examples with path visualizations
   - Storage structure diagrams
   - Predicate usage patterns from real data
   - Cross-domain comparison charts

These documents address the challenge of discovering and understanding
the major structures in PQG files by making the "14 sentence types"
(the underlying grammar) explicit and accessible.

Each document cross-references the others for comprehensive coverage,
and all include real SQL examples, YAML snippets, and visualizations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive documentation for the iSamples property graph (PQG) data format, making the underlying structure explicit and accessible. The documentation introduces the "14 sentence types" that form the complete grammar of iSamples metadata, along with the 8 entity types that compose the graph.

Key additions:

  • Foundation document explaining graph structure and the 14 relationship types
  • Detailed reference guide for each predicate with SQL and YAML examples
  • Real-world examples across three scientific domains (archaeology, geology, biology)
  • Practical SQL query patterns and optimization techniques
  • Visual diagrams showing entity relationships and graph traversal patterns

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/docs/UNDERSTANDING_THE_GRAPH.md Introduces the 8 entity types, 14 predicates, and explains the property graph model with traversal patterns and storage format
src/docs/PREDICATES_REFERENCE.md Comprehensive reference for all 14 predicates including usage examples, SQL patterns, OpenContext statistics, and cross-domain comparison
src/docs/EXAMPLES_BY_DOMAIN.md Complete YAML examples from archaeology (OpenContext), geology (SESAR), and biology (GEOME) demonstrating domain-agnostic design
src/docs/QUERYING_THE_GRAPH.md Practical SQL guide with query patterns for DuckDB, including basic to complex traversals, aggregations, and 10+ copy-paste recipes
src/docs/EDGE_TYPES_VISUAL.md Visual guide with Mermaid diagrams showing entity relationships, connectivity matrices, traversal paths, and usage heatmaps

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

---

**Document Version:** 1.0
**Last Updated:** 2025-11-14
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date "2025-11-14" appears to be incorrect. The PR description states "it is currently November 2025" but the year 2025 hasn't occurred yet as of the knowledge cutoff (January 2025). This date should likely be "2024-11-14" or the current actual date.

Suggested change
**Last Updated:** 2025-11-14
**Last Updated:** 2024-11-14

Copilot uses AI. Check for mistakes.

---

**Last updated:** 2025-11-14
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date "2025-11-14" appears to be incorrect. This should likely be "2024-11-14" or the current actual date, as the year 2025 is in the future.

Suggested change
**Last updated:** 2025-11-14
**Last updated:** 2024-11-14

Copilot uses AI. Check for mistakes.
---

**Document Version:** 1.0
**Last Updated:** 2025-11-14
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date "2025-11-14" appears to be incorrect. This should likely be "2024-11-14" or the current actual date, as the year 2025 is in the future.

Suggested change
**Last Updated:** 2025-11-14
**Last Updated:** 2024-11-14

Copilot uses AI. Check for mistakes.

This table shows which entity types (subjects) connect to which entity types (objects) via which predicates.

| **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** |
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent capitalization in markdown table header. The header "Multivalued" should match the style of other headers. Consider using "Multi-valued" for consistency with hyphenated compound adjectives elsewhere in the documentation.

Suggested change
| **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** |
| **Subject Type** | **Predicate** | **Object Type** | **Multi-valued** | **Required** |

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +16
| [has_material_category](#has_material_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Material type |
| [has_context_category](#has_context_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Domain context |
| [has_sample_object_type](#has_sample_object_type) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Physical form |
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation inconsistency: The table lists has_material_category, has_context_category, and has_sample_object_type as "required" with checkmarks (✅ Yes), but their cardinality is listed as "Many" rather than a minimum requirement. According to line 126-127, has_material_category is "required, minimum 1", which should be more clearly indicated. Consider adding a column for minimum cardinality or clarifying in the "Cardinality" column (e.g., "Many (≥1)").

Suggested change
| [has_material_category](#has_material_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Material type |
| [has_context_category](#has_context_category) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Domain context |
| [has_sample_object_type](#has_sample_object_type) | MaterialSampleRecord → IdentifiedConcept | Many | ✅ Yes | Physical form |
| [has_material_category](#has_material_category) | MaterialSampleRecord → IdentifiedConcept | Many (≥1) | ✅ Yes | Material type |
| [has_context_category](#has_context_category) | MaterialSampleRecord → IdentifiedConcept | Many (≥1) | ✅ Yes | Domain context |
| [has_sample_object_type](#has_sample_object_type) | MaterialSampleRecord → IdentifiedConcept | Many (≥1) | ✅ Yes | Physical form |

Copilot uses AI. Check for mistakes.

---

**Last updated:** 2025-11-14
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date "2025-11-14" appears to be incorrect. This should likely be "2024-11-14" or the current actual date, as the year 2025 is in the future.

Suggested change
**Last updated:** 2025-11-14
**Last updated:** 2024-11-14

Copilot uses AI. Check for mistakes.
2. **Edge rows** have `otype = '_edge_'`
3. **Edge `s` field** points to subject entity's `row_id`
4. **Edge `p` field** contains the predicate name (e.g., `produced_by`)
5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multivalued)
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the comment. "Multivalued" should be "Multi-valued" to match the hyphenated form used elsewhere in the documentation for this compound adjective.

Suggested change
5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multivalued)
5. **Edge `o` field** is an **array** of object entity `row_id`s (supports multi-valued)

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +184
# Edges (multivalued - can have multiple material types)
edge_001:
s: sample_001
p: has_material_category
o: [concept_earthenware]

edge_002:
s: sample_001
p: has_material_category
o: [concept_anthropogenic]
```
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Documentation clarity: The comment "# Edges (multivalued - can have multiple material types)" on line 174 is misleading. While edges can be multivalued, this specific comment appears in a YAML structure where each edge only connects to one concept. The multivalued nature means there can be multiple separate edges with the same predicate, not that a single edge has multiple targets. Consider clarifying: "# Edges (can have multiple edges with same predicate for different material types)"

Copilot uses AI. Check for mistakes.
| **Subject Type** | **Predicate** | **Object Type** | **Multivalued** | **Required** |
|------------------|---------------|-----------------|-----------------|--------------|
| MaterialSampleRecord | `produced_by` | SamplingEvent | No | Yes |
| MaterialSampleRecord | `has_material_category` | IdentifiedConcept | Yes | No |
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent table formatting: The table on line 82 has "No" for multivalued, but the description on line 126 says "Many (required, minimum 1)". The table should indicate "Yes" for multivalued since a sample can have multiple material categories. This is inconsistent with the actual behavior described in the detailed section.

Copilot uses AI. Check for mistakes.
Comment on lines +698 to +728
```sql
-- Create GeoJSON for web mapping
SELECT json_object(
'type', 'FeatureCollection',
'features', json_group_array(
json_object(
'type', 'Feature',
'geometry', json_object(
'type', 'Point',
'coordinates', json_array(coords.longitude, coords.latitude)
),
'properties', json_object(
'id', sample.pid,
'label', sample.label,
'material', material.label
)
)
)
) AS geojson
FROM pqg AS sample
JOIN pqg AS mat_edge ON mat_edge.s = sample.row_id AND mat_edge.p = 'has_material_category'
JOIN pqg AS material ON material.row_id = ANY(mat_edge.o)
JOIN pqg AS event_edge ON event_edge.s = sample.row_id AND event_edge.p = 'produced_by'
JOIN pqg AS event ON event.row_id = ANY(event_edge.o)
JOIN pqg AS coord_edge ON coord_edge.s = event.row_id AND coord_edge.p = 'sample_location'
JOIN pqg AS coords ON coords.row_id = ANY(coord_edge.o)
WHERE sample.otype = 'MaterialSampleRecord'
AND coords.latitude IS NOT NULL
AND coords.longitude IS NOT NULL
LIMIT 1000;
```
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL syntax warning: The query uses json_group_array() which is SQLite-specific syntax. Since the document states "All queries are designed for DuckDB" (line 3), this should use DuckDB's JSON functions instead. DuckDB uses different JSON aggregation functions like list() or array aggregation with to_json(). Consider updating this example to use DuckDB-compatible syntax or noting that this specific example requires SQLite.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant