Hello, I'm using your KG construction pipeline (atlas-rag 0.0.5.post1) in my experiments and I'm very grateful for your work. While running concept generation on a ~13k-node graph, I hit an issue in the concept generation module, as follows.
Description
generate_concept() in concept_generation.py crashes with KeyError: 'id' when accessing neighbor node attributes. The root cause is in csvs_to_temp_graphml() (csv_to_graphml.py): when edges reference nodes not present in the triple_nodes CSV, nx.DiGraph.add_edge() auto-creates those nodes without any attributes (id, type). Later, generate_concept() accesses temp_kg.nodes[neighbor]['id'] unconditionally and crashes.
Reproducing
This occurs on larger, real-world KGs where the edges CSV contains :START_ID or :END_ID values not present in the nodes CSV name:ID column. In our case, ~1,700 out of 13,376 nodes were orphans.
Root Cause
csv_to_graphml.py lines 51 vs 62:
# Line 51 — adds node WITH attributes
g.add_node(mapped_id, id=node_id, type=row["type"])
# Line 62 — add_edge implicitly creates nodes WITHOUT attributes
g.add_edge(start_id, end_id, relation=row["relation"], type=row[":TYPE"])
concept_generation.py lines 212, 216 — no guard for missing attributes:
context += ", ".join([
f"{temp_kg.nodes[neighbor]['id']} {temp_kg[neighbor][node_id]['relation']}"
for neighbor in random_two_neighbors
])
Suggested Fix
In csvs_to_temp_graphml(), ensure edge endpoints exist as fully-attributed nodes before adding the edge:
for row in reader:
start_id = get_node_id(row[":START_ID"], entity_to_id)
end_id = get_node_id(row[":END_ID"], entity_to_id)
# Ensure both endpoints exist with attributes
if start_id not in g.nodes:
g.add_node(start_id, id=row[":START_ID"], type="Entity")
if end_id not in g.nodes:
g.add_node(end_id, id=row[":END_ID"], type="Entity")
if not g.has_edge(start_id, end_id):
g.add_edge(start_id, end_id, relation=row["relation"], type=row[":TYPE"])
And/or add a defensive guard in generate_concept():
if node_id not in temp_kg or 'id' not in temp_kg.nodes.get(node_id, {}):
continue
Additional: Cache bug in get_node_id()
There's also a cache inefficiency in get_node_id() (csv_to_graphml.py:29-38). The function appends '_entity' to entity_name before storing in entity_to_id, but the cache lookup on line 31 checks the original (unmodified) name, so the cache is never hit:
def get_node_id(entity_name, entity_to_id={}):
if entity_name not in entity_to_id: # checks "foo"
entity_name = entity_name + '_entity' # mutates to "foo_entity"
...
entity_to_id[entity_name] = hash_hex # stores under "foo_entity"
return entity_to_id[entity_name] # looks up "foo_entity" — works only because
# entity_name was already mutated above
This doesn't cause incorrect hashes (the mutation happens before hashing), but it means every call recomputes the hash since "foo" is never found as a key — only "foo_entity" is stored.
Environment
atlas-rag 0.0.5.post1
- Python 3.13
- NetworkX (latest)
- Model: qwen3:14b
- ~13,000 nodes, ~9,500 edges
Traceback
concept_generation.py:212 in generate_concept
context += ", ".join([f"{temp_kg.nodes[neighbor]['id']} ..."
KeyError: 'id'
Hello, I'm using your KG construction pipeline (atlas-rag 0.0.5.post1) in my experiments and I'm very grateful for your work. While running concept generation on a ~13k-node graph, I hit an issue in the concept generation module, as follows.
Description
generate_concept()inconcept_generation.pycrashes withKeyError: 'id'when accessing neighbor node attributes. The root cause is incsvs_to_temp_graphml()(csv_to_graphml.py): when edges reference nodes not present in thetriple_nodesCSV,nx.DiGraph.add_edge()auto-creates those nodes without any attributes (id,type). Later,generate_concept()accessestemp_kg.nodes[neighbor]['id']unconditionally and crashes.Reproducing
This occurs on larger, real-world KGs where the edges CSV contains
:START_IDor:END_IDvalues not present in the nodes CSVname:IDcolumn. In our case, ~1,700 out of 13,376 nodes were orphans.Root Cause
csv_to_graphml.pylines 51 vs 62:concept_generation.pylines 212, 216 — no guard for missing attributes:Suggested Fix
In
csvs_to_temp_graphml(), ensure edge endpoints exist as fully-attributed nodes before adding the edge:And/or add a defensive guard in
generate_concept():Additional: Cache bug in
get_node_id()There's also a cache inefficiency in
get_node_id()(csv_to_graphml.py:29-38). The function appends'_entity'toentity_namebefore storing inentity_to_id, but the cache lookup on line 31 checks the original (unmodified) name, so the cache is never hit:This doesn't cause incorrect hashes (the mutation happens before hashing), but it means every call recomputes the hash since
"foo"is never found as a key — only"foo_entity"is stored.Environment
atlas-rag0.0.5.post1Traceback