KeyError: 'id' in `generate_concept()` due to orphan nodes created by `add_edge()`

Hello, I'm using your KG construction pipeline (atlas-rag 0.0.5.post1) in my experiments and I'm very grateful for your work. While running concept generation on a ~13k-node graph, I hit an issue in the concept generation module, as follows.

### Description

`generate_concept()` in `concept_generation.py` crashes with `KeyError: 'id'` when accessing neighbor node attributes. The root cause is in `csvs_to_temp_graphml()` (`csv_to_graphml.py`): when edges reference nodes not present in the `triple_nodes` CSV, `nx.DiGraph.add_edge()` auto-creates those nodes **without any attributes** (`id`, `type`). Later, `generate_concept()` accesses `temp_kg.nodes[neighbor]['id']` unconditionally and crashes.

### Reproducing

This occurs on larger, real-world KGs where the edges CSV contains `:START_ID` or `:END_ID` values not present in the nodes CSV `name:ID` column. In our case, ~1,700 out of 13,376 nodes were orphans.

### Root Cause

**`csv_to_graphml.py` lines 51 vs 62:**

```python
# Line 51 — adds node WITH attributes
g.add_node(mapped_id, id=node_id, type=row["type"])

# Line 62 — add_edge implicitly creates nodes WITHOUT attributes
g.add_edge(start_id, end_id, relation=row["relation"], type=row[":TYPE"])
```

**`concept_generation.py` lines 212, 216** — no guard for missing attributes:

```python
context += ", ".join([
    f"{temp_kg.nodes[neighbor]['id']} {temp_kg[neighbor][node_id]['relation']}"
    for neighbor in random_two_neighbors
])
```

### Suggested Fix

In `csvs_to_temp_graphml()`, ensure edge endpoints exist as fully-attributed nodes before adding the edge:

```python
for row in reader:
    start_id = get_node_id(row[":START_ID"], entity_to_id)
    end_id = get_node_id(row[":END_ID"], entity_to_id)
    # Ensure both endpoints exist with attributes
    if start_id not in g.nodes:
        g.add_node(start_id, id=row[":START_ID"], type="Entity")
    if end_id not in g.nodes:
        g.add_node(end_id, id=row[":END_ID"], type="Entity")
    if not g.has_edge(start_id, end_id):
        g.add_edge(start_id, end_id, relation=row["relation"], type=row[":TYPE"])
```

And/or add a defensive guard in `generate_concept()`:

```python
if node_id not in temp_kg or 'id' not in temp_kg.nodes.get(node_id, {}):
    continue
```

### Additional: Cache bug in `get_node_id()`

There's also a cache inefficiency in `get_node_id()` (`csv_to_graphml.py:29-38`). The function appends `'_entity'` to `entity_name` before storing in `entity_to_id`, but the cache lookup on line 31 checks the original (unmodified) name, so the cache is never hit:

```python
def get_node_id(entity_name, entity_to_id={}):
    if entity_name not in entity_to_id:       # checks "foo"
        entity_name = entity_name + '_entity'  # mutates to "foo_entity"
        ...
        entity_to_id[entity_name] = hash_hex   # stores under "foo_entity"
    return entity_to_id[entity_name]           # looks up "foo_entity" — works only because
                                               # entity_name was already mutated above
```

This doesn't cause incorrect hashes (the mutation happens before hashing), but it means every call recomputes the hash since `"foo"` is never found as a key — only `"foo_entity"` is stored.

### Environment

- `atlas-rag` 0.0.5.post1
- Python 3.13
- NetworkX (latest)
- Model: qwen3:14b
- ~13,000 nodes, ~9,500 edges

### Traceback

```
concept_generation.py:212 in generate_concept
    context += ", ".join([f"{temp_kg.nodes[neighbor]['id']} ..."
KeyError: 'id'
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 'id' in `generate_concept()` due to orphan nodes created by `add_edge()` #45

Description

Reproducing

Root Cause

Suggested Fix

Additional: Cache bug in `get_node_id()`

Environment

Traceback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KeyError: 'id' in generate_concept() due to orphan nodes created by add_edge() #45

Description

Description

Reproducing

Root Cause

Suggested Fix

Additional: Cache bug in get_node_id()

Environment

Traceback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

KeyError: 'id' in `generate_concept()` due to orphan nodes created by `add_edge()` #45

Additional: Cache bug in `get_node_id()`