Skip to content

KeyError: 'id' in generate_concept() due to orphan nodes created by add_edge() #45

@FredDsR

Description

@FredDsR

Hello, I'm using your KG construction pipeline (atlas-rag 0.0.5.post1) in my experiments and I'm very grateful for your work. While running concept generation on a ~13k-node graph, I hit an issue in the concept generation module, as follows.

Description

generate_concept() in concept_generation.py crashes with KeyError: 'id' when accessing neighbor node attributes. The root cause is in csvs_to_temp_graphml() (csv_to_graphml.py): when edges reference nodes not present in the triple_nodes CSV, nx.DiGraph.add_edge() auto-creates those nodes without any attributes (id, type). Later, generate_concept() accesses temp_kg.nodes[neighbor]['id'] unconditionally and crashes.

Reproducing

This occurs on larger, real-world KGs where the edges CSV contains :START_ID or :END_ID values not present in the nodes CSV name:ID column. In our case, ~1,700 out of 13,376 nodes were orphans.

Root Cause

csv_to_graphml.py lines 51 vs 62:

# Line 51 — adds node WITH attributes
g.add_node(mapped_id, id=node_id, type=row["type"])

# Line 62 — add_edge implicitly creates nodes WITHOUT attributes
g.add_edge(start_id, end_id, relation=row["relation"], type=row[":TYPE"])

concept_generation.py lines 212, 216 — no guard for missing attributes:

context += ", ".join([
    f"{temp_kg.nodes[neighbor]['id']} {temp_kg[neighbor][node_id]['relation']}"
    for neighbor in random_two_neighbors
])

Suggested Fix

In csvs_to_temp_graphml(), ensure edge endpoints exist as fully-attributed nodes before adding the edge:

for row in reader:
    start_id = get_node_id(row[":START_ID"], entity_to_id)
    end_id = get_node_id(row[":END_ID"], entity_to_id)
    # Ensure both endpoints exist with attributes
    if start_id not in g.nodes:
        g.add_node(start_id, id=row[":START_ID"], type="Entity")
    if end_id not in g.nodes:
        g.add_node(end_id, id=row[":END_ID"], type="Entity")
    if not g.has_edge(start_id, end_id):
        g.add_edge(start_id, end_id, relation=row["relation"], type=row[":TYPE"])

And/or add a defensive guard in generate_concept():

if node_id not in temp_kg or 'id' not in temp_kg.nodes.get(node_id, {}):
    continue

Additional: Cache bug in get_node_id()

There's also a cache inefficiency in get_node_id() (csv_to_graphml.py:29-38). The function appends '_entity' to entity_name before storing in entity_to_id, but the cache lookup on line 31 checks the original (unmodified) name, so the cache is never hit:

def get_node_id(entity_name, entity_to_id={}):
    if entity_name not in entity_to_id:       # checks "foo"
        entity_name = entity_name + '_entity'  # mutates to "foo_entity"
        ...
        entity_to_id[entity_name] = hash_hex   # stores under "foo_entity"
    return entity_to_id[entity_name]           # looks up "foo_entity" — works only because
                                               # entity_name was already mutated above

This doesn't cause incorrect hashes (the mutation happens before hashing), but it means every call recomputes the hash since "foo" is never found as a key — only "foo_entity" is stored.

Environment

  • atlas-rag 0.0.5.post1
  • Python 3.13
  • NetworkX (latest)
  • Model: qwen3:14b
  • ~13,000 nodes, ~9,500 edges

Traceback

concept_generation.py:212 in generate_concept
    context += ", ".join([f"{temp_kg.nodes[neighbor]['id']} ..."
KeyError: 'id'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions