Skip to content

Commit 04f4699

Browse files
nerpaulaSimran-B
andauthored
DOC-766 | Add new optional parameters to GraphML (profiles, enableGpu). (#833)
* add new parameters to ui, update screenshots * add parameters to notebook-api file * Optimize and rename screenshots --------- Co-authored-by: Simran Spiller <simran@arangodb.com>
1 parent 309f2d6 commit 04f4699

15 files changed

+30
-14
lines changed

site/content/ai-suite/graphml/notebooks-api.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,7 @@ but you can substitute them as follows for a schema description in terms of JSON
165165

166166
- `jobConfiguration` (dict, _optional): A set of configurations that are applied to the job.
167167
- `batchSize` (int): The number of documents to process in a single batch. Default is `32`.
168+
- `profiles` (list): One or more profiles to specify pod configurations for the project (e.g., `["gpu-g4dn-xlarge"]`). Default is `None`.
168169
- `runAnalysisChecks` (bool): Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`.
169170
- `skipLabels` (bool): Skips the featurization process for attributes marked as `label`. Default is `false`.
170171
- `useFeatureStore` (bool): Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph.
@@ -417,11 +418,17 @@ The Training Service depends on a **Training Specification**:
417418
- `inputFeatures` (str): The name of the feature to be used as input.
418419
- `labelField` (str): The name of the attribute to be predicted.
419420
- `batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
421+
- `dataLoadBatchSize` (int): The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase. Default is `50000`.
422+
- `dataLoadParallelism` (int): The number of parallel processes used when loading data from ArangoDB into memory for training. Default is `10`.
423+
- `enableGpu` (bool): Enables GPU-accelerated training using GPU-capable profiles configured for the project. Default is `false`.
420424
- `graphEmbeddings` (dict): Dictionary to describe the Graph Embedding Task Specification.
421425
- `targetCollection` (str): The ArangoDB collection used to generate the embeddings.
422426
- `embeddingSize` (int): The size of the embedding vector. Default is `128`.
423427
- `batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
424428
- `generateEmbeddings` (bool): Whether to generate embeddings on the training dataset. Default is `false`.
429+
- `dataLoadBatchSize` (int): The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase. Default is `50000`.
430+
- `dataLoadParallelism` (int): The number of parallel processes used when loading data from ArangoDB into memory for training. Default is `10`.
431+
- `enableGpu` (bool): Enables GPU-accelerated training using GPU-capable profiles configured for the project. Default is `false`.
425432

426433
- `metagraph` (dict): Metadata to represent the node & edge collections of the graph. If `featureSetID` is provided, this can be omitted.
427434
- `graph` (str): The ArangoDB graph name.
@@ -736,6 +743,9 @@ The Prediction Service depends on a **Prediction Specification**:
736743
- `modelID` (str): The model ID to use for generating predictions.
737744
- `featurizeNewDocuments` (bool): Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`.
738745
- `featurizeOutdatedDocuments` (bool): Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`.
746+
- `dataLoadBatchSize` (int): The number of documents to load in a single batch. Default is `500000`.
747+
- `dataLoadParallelism` (int): The number of parallel threads used to process the prediction workload. Default is `10`.
748+
- `enableGpu` (bool): Enables GPU-accelerated prediction using GPU-capable profiles configured for the project. Default is `false`.
739749
- `schedule` (str): A cron expression to schedule the prediction job. The cron syntax is a set of
740750
five fields in a line, indicating when the job should be executed. The format must follow
741751
the following order: `minute` `hour` `day-of-month` `month` `day-of-week`

site/content/ai-suite/graphml/ui.md

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ To create a new GraphML project using the Arango Data Platform web interface, fo
2323

2424
1. From the left-hand sidebar, select the database where you want to create the project.
2525
2. In the left-hand sidebar, click **AI Suite** to open the GraphML project management interface, then click **Run GraphML**.
26-
![Create GraphML Project](../../images/create-graphml-project-ui.png)
26+
![Create GraphML Project](../../images/graphml-ui-create-project.png)
2727
3. In the **GraphML projects** view, click **Add new project**.
2828
4. The **Create ML project** modal opens. Enter a **Name** for your machine learning project.
2929
5. Click the **Create project** button to finalize the creation.
@@ -54,6 +54,8 @@ format on the right side of the screen for transparency.
5454
In the **Configuration** tab, you can control the overall featurization job and
5555
how features are stored.
5656
- **Batch size**: The number of documents to process in a single batch.
57+
- **Profiles**: Add one or more profiles to specify pod configurations for the
58+
project (e.g., `gpu-g4dn-xlarge`).
5759
- **Run analysis checks**: Whether to run analysis checks to perform a high-level
5860
analysis of the data quality before proceeding. The default value is `true`.
5961
- **Skip labels**: Skip the featurization process for attributes marked as labels.
@@ -73,20 +75,20 @@ Real-world datasets often contain missing values or mismatched data types. Use
7375
the strategies below to control how each feature type (**Text**, **Numeric**,
7476
**Category**, **Label**) handles these issues during featurization.
7577

76-
| **Strategy type** | **Option** | **Description** | **When to use** |
77-
|-------------------|-----------------------|-----------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
78-
| Missing | **Raise** | Stops the job and reports an error when a value is missing. | When missing data indicates a critical issue. |
79-
| | **Replace** | Substitutes missing values with a default you provide (e.g., `0` for numbers, `"unknown"` for text). | When missing values are expected. |
80-
| Mismatch | **Raise** | The strictest option. Stops the job on any data type mismatch. | When any data type mismatch indicates a critical error. |
81-
| | **Replace** | Replaces mismatched values with a default you provide, without trying to convert it first. | When mismatched values are unreliable, and you prefer to substitute it directly. |
82-
| | **Coerce and Raise** | Attempts to convert (coerce) the value to the correct type (e.g. string "123" to number `123`). If the conversion is successful, it uses the new value. If it fails, the job stops. | A balanced approach, often the best default strategy. |
83-
| | **Coerce and Replace**| The most forgiving option. The system first tries to convert the value. If it fails, it replaces the value with the specified default and continues the job. | For very dirty datasets where completing the job is the highest priority. |
78+
| **Strategy type** | **Option** | **Description** | **When to use** |
79+
|-------------------|------------------------|------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|
80+
| Missing | **Raise** | Stops the job and reports an error when a value is missing. | When missing data indicates a critical issue. |
81+
| | **Replace** | Substitutes missing values with a default you provide (e.g., `0` for numbers, `"unknown"` for text). | When missing values are expected. |
82+
| Mismatch | **Raise** | The strictest option. Stops the job on any data type mismatch. | When any data type mismatch indicates a critical error. |
83+
| | **Replace** | Replaces mismatched values with a default you provide, without trying to convert it first. | When mismatched values are unreliable, and you prefer to substitute it directly. |
84+
| | **Coerce and Raise** | Attempts to convert (coerce) the value to the correct type (e.g. string "123" to number `123`). If the conversion is successful, it uses the new value. If it fails, the job stops. | A balanced approach, often the best default strategy. |
85+
| | **Coerce and Replace** | The most forgiving option. The system first tries to convert the value. If it fails, it replaces the value with the specified default and continues the job. | For very dirty datasets where completing the job is the highest priority. |
8486

8587
Once you’ve set your strategies, click **Begin featurization** to start the node
8688
embedding-compatible featurization job. When the job status updates to
8789
**Ready for training**, proceed to the **Training** step.
8890

89-
![Navigate to Featurization](../../images/graph-ml-ui-featurization.png)
91+
![Navigate to Featurization](../../images/graphml-ui-featurization.png)
9092

9193
## Training phase
9294

@@ -112,10 +114,12 @@ features and structural connections within the graph.
112114
- **Batch Size**: The number of documents processed in a single training iteration. (e.g. `256`)
113115
- **Data Load Batch Size**: The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase (e.g. `50000`).
114116
- **Data Load Parallelism**: The number of parallel processes used when loading data from ArangoDB into memory for training (e.g. `10`).
117+
- **Enable GPU**: Enables GPU-accelerated training using GPU-capable profiles
118+
configured for the project (e.g., `gpu-g4dn-xlarge`).
115119

116120
After setting these values, click the **Begin training** button to start the job.
117121

118-
![Node Classification](../../images/ml-nodeclassification.png)
122+
![Node Classification](../../images/graphml-ui-node-classification.png)
119123

120124
#### Node embeddings
121125

@@ -135,7 +139,7 @@ The target collection is where the model's predictions are stored when running a
135139

136140
Once the configuration is complete, click **Begin training** to start the embedding job.
137141

138-
![Node Embeddings](../../images/ml-node-embedding.png)
142+
![Node Embeddings](../../images/graphml-ui-node-embedding.png)
139143

140144
## Model selection phase
141145

@@ -147,7 +151,7 @@ A list of trained models is displayed, along with performance metrics
147151
(**Accuracy**, **Precision**, **Recall**, **F1 score**, **Loss**). Review the results of different
148152
model runs and configurations.
149153

150-
![GraphML Model Selection](../../images/graph-ml-model.png)
154+
![GraphML Model Selection](../../images/graphml-ui-model.png)
151155

152156
Select the best performing model suitable for your prediction task. You can also
153157
open the **Confusion Matrix** to compare predicted values versus actual values.
@@ -186,8 +190,10 @@ predictions relevant without repeating the entire ML workflow.
186190
- **Data load parallelism**: The number of parallel threads used to process
187191
the prediction workload (e.g. `10`).
188192
- **Prediction field**: The field in the documents where the predicted values are stored.
193+
- **Enable GPU**: Enables GPU-accelerated prediction using GPU-capable profiles
194+
configured for the project (e.g., `gpu-g4dn-xlarge`).
189195

190-
![GraphML prediction phase](../../images/graph-prediction.png)
196+
![GraphML prediction phase](../../images/graphml-ui-prediction.png)
191197

192198
### Configuration options
193199

-639 KB
Binary file not shown.
-63.6 KB
Binary file not shown.
-38.6 KB
Binary file not shown.
-30.9 KB
Binary file not shown.
-30.8 KB
Loading
280 KB
Loading
36.1 KB
Loading
100 KB
Loading

0 commit comments

Comments
 (0)