You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DOC-766 | Add new optional parameters to GraphML (profiles, enableGpu). (#833)
* add new parameters to ui, update screenshots
* add parameters to notebook-api file
* Optimize and rename screenshots
---------
Co-authored-by: Simran Spiller <simran@arangodb.com>
Copy file name to clipboardExpand all lines: site/content/ai-suite/graphml/notebooks-api.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -165,6 +165,7 @@ but you can substitute them as follows for a schema description in terms of JSON
165
165
166
166
-`jobConfiguration` (dict, _optional): A set of configurations that are applied to the job.
167
167
-`batchSize` (int): The number of documents to process in a single batch. Default is `32`.
168
+
-`profiles` (list): One or more profiles to specify pod configurations for the project (e.g., `["gpu-g4dn-xlarge"]`). Default is `None`.
168
169
-`runAnalysisChecks` (bool): Whether to run analysis checks, used to perform a high-level analysis of the data quality before proceeding. Default is `true`.
169
170
-`skipLabels` (bool): Skips the featurization process for attributes marked as `label`. Default is `false`.
170
171
-`useFeatureStore` (bool): Enables the use of the Feature Store database, which allows you to store features separately from your Source Database. Default is `false`, therefore features are written to the source graph.
@@ -417,11 +418,17 @@ The Training Service depends on a **Training Specification**:
417
418
-`inputFeatures` (str): The name of the feature to be used as input.
418
419
-`labelField` (str): The name of the attribute to be predicted.
419
420
-`batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
421
+
-`dataLoadBatchSize` (int): The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase. Default is `50000`.
422
+
-`dataLoadParallelism` (int): The number of parallel processes used when loading data from ArangoDB into memory for training. Default is `10`.
423
+
-`enableGpu` (bool): Enables GPU-accelerated training using GPU-capable profiles configured for the project. Default is `false`.
420
424
-`graphEmbeddings` (dict): Dictionary to describe the Graph Embedding Task Specification.
421
425
-`targetCollection` (str): The ArangoDB collection used to generate the embeddings.
422
426
-`embeddingSize` (int): The size of the embedding vector. Default is `128`.
423
427
-`batchSize` (int): The number of documents to process in a single training batch. Default is `64`.
424
428
-`generateEmbeddings` (bool): Whether to generate embeddings on the training dataset. Default is `false`.
429
+
-`dataLoadBatchSize` (int): The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase. Default is `50000`.
430
+
-`dataLoadParallelism` (int): The number of parallel processes used when loading data from ArangoDB into memory for training. Default is `10`.
431
+
-`enableGpu` (bool): Enables GPU-accelerated training using GPU-capable profiles configured for the project. Default is `false`.
425
432
426
433
-`metagraph` (dict): Metadata to represent the node & edge collections of the graph. If `featureSetID` is provided, this can be omitted.
427
434
-`graph` (str): The ArangoDB graph name.
@@ -736,6 +743,9 @@ The Prediction Service depends on a **Prediction Specification**:
736
743
-`modelID` (str): The model ID to use for generating predictions.
737
744
-`featurizeNewDocuments` (bool): Boolean for enabling or disabling the featurization of new documents. Useful if you don't want to re-train the model upon new data. Default is `false`.
738
745
-`featurizeOutdatedDocuments` (bool): Boolean for enabling or disabling the featurization of outdated documents. Outdated documents are those whose features have changed since the last featurization. Default is `false`.
746
+
-`dataLoadBatchSize` (int): The number of documents to load in a single batch. Default is `500000`.
747
+
-`dataLoadParallelism` (int): The number of parallel threads used to process the prediction workload. Default is `10`.
748
+
-`enableGpu` (bool): Enables GPU-accelerated prediction using GPU-capable profiles configured for the project. Default is `false`.
739
749
-`schedule` (str): A cron expression to schedule the prediction job. The cron syntax is a set of
740
750
five fields in a line, indicating when the job should be executed. The format must follow
741
751
the following order: `minute``hour``day-of-month``month``day-of-week`
| Missing |**Raise**| Stops the job and reports an error when a value is missing. | When missing data indicates a critical issue. |
79
-
||**Replace**| Substitutes missing values with a default you provide (e.g., `0` for numbers, `"unknown"` for text). | When missing values are expected.|
80
-
| Mismatch |**Raise**| The strictest option. Stops the job on any data type mismatch. | When any data type mismatch indicates a critical error.|
81
-
||**Replace**| Replaces mismatched values with a default you provide, without trying to convert it first. | When mismatched values are unreliable, and you prefer to substitute it directly.|
82
-
||**Coerce and Raise**| Attempts to convert (coerce) the value to the correct type (e.g. string "123" to number `123`). If the conversion is successful, it uses the new value. If it fails, the job stops. | A balanced approach, often the best default strategy.|
83
-
||**Coerce and Replace**| The most forgiving option. The system first tries to convert the value. If it fails, it replaces the value with the specified default and continues the job. | For very dirty datasets where completing the job is the highest priority. |
78
+
|**Strategy type**|**Option**|**Description**|**When to use**|
| Missing |**Raise**| Stops the job and reports an error when a value is missing. | When missing data indicates a critical issue.|
81
+
||**Replace**| Substitutes missing values with a default you provide (e.g., `0` for numbers, `"unknown"` for text). | When missing values are expected.|
82
+
| Mismatch |**Raise**| The strictest option. Stops the job on any data type mismatch. | When any data type mismatch indicates a critical error. |
83
+
||**Replace**| Replaces mismatched values with a default you provide, without trying to convert it first. | When mismatched values are unreliable, and you prefer to substitute it directly. |
84
+
||**Coerce and Raise**| Attempts to convert (coerce) the value to the correct type (e.g. string "123" to number `123`). If the conversion is successful, it uses the new value. If it fails, the job stops. | A balanced approach, often the best default strategy.|
85
+
||**Coerce and Replace**| The most forgiving option. The system first tries to convert the value. If it fails, it replaces the value with the specified default and continues the job. | For very dirty datasets where completing the job is the highest priority. |
84
86
85
87
Once you’ve set your strategies, click **Begin featurization** to start the node
86
88
embedding-compatible featurization job. When the job status updates to
87
89
**Ready for training**, proceed to the **Training** step.
88
90
89
-

91
+

90
92
91
93
## Training phase
92
94
@@ -112,10 +114,12 @@ features and structural connections within the graph.
112
114
-**Batch Size**: The number of documents processed in a single training iteration. (e.g. `256`)
113
115
-**Data Load Batch Size**: The number of documents loaded from ArangoDB into memory in a single batch during the data loading phase (e.g. `50000`).
114
116
-**Data Load Parallelism**: The number of parallel processes used when loading data from ArangoDB into memory for training (e.g. `10`).
117
+
-**Enable GPU**: Enables GPU-accelerated training using GPU-capable profiles
118
+
configured for the project (e.g., `gpu-g4dn-xlarge`).
115
119
116
120
After setting these values, click the **Begin training** button to start the job.
0 commit comments