From 3a225063374cbbd541b855b6a0343160da51062f Mon Sep 17 00:00:00 2001 From: Samuel Johnson Date: Mon, 11 May 2026 19:45:18 -0400 Subject: [PATCH 1/8] readme overhaul --- PYPI.md | 779 ------------------ README.md | 751 +---------------- docs/.nojekyll | 0 docs/PYPI.md | 367 +++++++++ docs/README.md | 13 + docs/_sidebar.md | 31 + .../build_executable.md | 40 +- docs/cli-reference.md | 309 +++++++ docs/contributing.md | 83 ++ docs/development.md | 186 +++++ docs/faq.md | 165 ++++ docs/index.html | 48 ++ docs/index.md | 52 ++ docs/quick-start.md | 162 ++++ 14 files changed, 1449 insertions(+), 1537 deletions(-) delete mode 100644 PYPI.md create mode 100644 docs/.nojekyll create mode 100644 docs/PYPI.md create mode 100644 docs/README.md create mode 100644 docs/_sidebar.md rename README_Build_Executable.md => docs/build_executable.md (94%) create mode 100644 docs/cli-reference.md create mode 100644 docs/contributing.md create mode 100644 docs/development.md create mode 100644 docs/faq.md create mode 100644 docs/index.html create mode 100644 docs/index.md create mode 100644 docs/quick-start.md diff --git a/PYPI.md b/PYPI.md deleted file mode 100644 index 915c69169..000000000 --- a/PYPI.md +++ /dev/null @@ -1,779 +0,0 @@ -# CDISC Rules Engine Guide - -## Step 0: Install the Library - -``` -pip install cdisc-rules-engine -``` - -In addition to installing the library, you'll also want to download the rules cache (found in the resources/cache folder of this repository) and store them somewhere in your project. Notably, when pip install is run, it will install the USDM and dataset-JSON dataset schemas should you decide to implement the dataset reader classes in cdisc_rules_engine/services/data_readers or the metadata_readers in cdisc_rules_engine/services - -## Step 1: Load the Rules - -The rules can be loaded into an in-memory cache by doing the following: - -```python -import os -import pathlib -import pickle - -from multiprocessing.managers import SyncManager -from cdisc_rules_engine.services.cache import InMemoryCacheService - -class CacheManager(SyncManager): - pass - -# If you're working from a terminal you may need to -# use SyncManager directly rather than define CacheManager -CacheManager.register("InMemoryCacheService", InMemoryCacheService) - - -def load_rules_cache(path_to_rules_cache): - cache_path = pathlib.Path(path_to_rules_cache) - manager = CacheManager() - manager.start() - cache = manager.InMemoryCacheService() - - files = next(os.walk(cache_path), (None, None, []))[2] - - for fname in files: - with open(cache_path / fname, "rb") as f: - cache.add_all(pickle.load(f)) - - return cache -``` - -Rules in this cache can also be accessed by standard and version using the get_rules_cache_key function. - -```python -from cdisc_rules_engine.utilities.utils import get_rules_cache_key - -cache = load_rules_cache("path/to/rules/cache") -# Note that the standard version is separated by a dash, not a period -cache_key_prefix = get_rules_cache_key("sdtmig", "3-4") -rules = cache.get_all_by_prefix(cache_key_prefix) -``` - -`rules` will now be a list of dictionaries with the following keys: - -- core_id (e.g. "CORE-000252") -- domains (e.g. {'Include': ['DM'], 'Exclude': []} or {'Include': ['ALL']}) -- author -- reference -- sensitivity -- executability -- description -- authorities -- standards -- classes -- rule_type -- conditions -- actions -- datasets -- output_variables - -A rule using JSON/dict containing the rule keys above. - -```python -from cdisc_rules_engine.models.rule import Rule -rule_metadata = { - "Authorities": [ - ], - "Check": { - }, - "Core": { - "Id": "CORE-000659", - "Status": "Published", - "Version": "1" - }, - # ... rest of your rule metadata -} - -# Convert the CDISC metadata format to the internal format -rule_dict = Rule.from_cdisc_metadata(rule_metadata) - -# Create the Rule object -rule_obj = Rule(rule_dict) -``` - -# Implementation Options - -You can run rules using the business rules logic or using CDISC RulesEngine() class. The RulesEngine class provides a higher-level interface with additional features like dataset preprocessing, rule operations, and more comprehensive error handling but requires more setup and physical data files. Using the business rules directly uses the business_rules package to run each rule individually against your dataset. - -# Option A: Direct Use of Business Rules Engine - -## Step 2: Prepare Your Data - -In order to pass your data through the business rules, it must be a pandas dataframe of an SDTM dataset. For example: - -```python ->>> data -STUDYID DOMAIN USUBJID AESEQ AESER AETERM ... AESDTH AESLIFE AESHOSP -0 AE 001 0 Y Headache ... N N N - -[1 rows x 19 columns] -``` - -Before passing this into the rules engine, we need to wrap it in a DatasetVariable, which requires first wrapping it in a PandasDataset: - -```python -import pandas as pd -from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset -from cdisc_rules_engine.models.dataset_variable import DatasetVariable -from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata - -# First, create a PandasDataset from your DataFrame -pandas_dataset = PandasDataset(data=data) - -# Create dataset metadata (needed for column_prefix_map) -dataset_metadata = SDTMDatasetMetadata( - name="AE", - label="Adverse Events", - first_record=data.iloc[0].to_dict() if not data.empty else None -) - -# Then create the DatasetVariable -dataset_variable = DatasetVariable( - pandas_dataset, - column_prefix_map={"--": dataset_metadata.domain}, -) -``` - -NOTE: DatasetVariable has several arguments that can be instantiated but dataset and column prefix are the most vital for rule execution. - -```python -dataset_variable = DatasetVariable( - dataset, - column_prefix_map={"--": dataset_metadata.domain}, - value_level_metadata=value_level_metadata, - column_codelist_map=variable_codelist_map, - codelist_term_maps=codelist_term_maps, -) -``` - -## Step 3: Run Rules - -```python -from business_rules.engine import run -from cdisc_rules_engine.models.actions import COREActions - -# Get the rules for the domain AE -ae_rules = [ - rule for rule in rules - if "AE" in rule.get("domains", {}).get("Include", []) or - "ALL" in rule.get("domains", {}).get("Include", []) -] - -print(f"Found {len(ae_rules)} rules applicable to AE domain") - -all_results = [] -for idx, rule in enumerate(ae_rules): - print(f"\nProcessing rule {idx+1}/{len(ae_rules)}: {rule.get('core_id')}") - - results = [] - core_actions = COREActions( - output_container=results, - variable=dataset_variable, - dataset_metadata=dataset_metadata, - rule=rule, - value_level_metadata=None # This is optional - ) - - try: - was_triggered = run( - rule=rule, - defined_variables=dataset_variable, - defined_actions=core_actions, - ) - - if was_triggered and results: - print(f" Rule {rule.get('core_id')} was triggered - issues found!") - all_results.extend(results) - else: - print(f" Rule {rule.get('core_id')} passed - no issues found") - except Exception as e: - print(f" Error processing rule {rule.get('core_id')}: {str(e)}") -``` - -all_results now contains your validation result. You can print it to a text file using - -```python -import os -OUTPUT_DIR = os.getcwd() -with open(os.path.join(OUTPUT_DIR, 'results.txt'), 'w') as f: - for result in all_results: - f.write(str(result) + '\n') -``` - -# Option B: Using the RulesEngine Class - -## Step 2: Prepare Your Data - -We will use an XPT files (SAS transport format) for this example. Other dataset readers classes to model your implementation can be found in cdisc_rules_engine/services/data_readers - -```python -import pandas as pd -import pyreadstat -from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata - -def create_dataset_metadata(file_path): - """Create dataset metadata from an XPT file""" - try: - # Read the XPT file - data, meta = pyreadstat.read_xport(file_path) - - # Extract domain from first record if available - first_record = data.iloc[0].to_dict() if not data.empty else None - - # Create metadata object - return SDTMDatasetMetadata( - name=os.path.basename(file_path).split('.')[0].upper(), - label=meta.file_label if hasattr(meta, 'file_label') else "", - filename=os.path.basename(file_path), - full_path=file_path, - file_size=os.path.getsize(file_path), - record_count=len(data), - first_record=first_record - ) - except Exception as e: - print(f"Error creating metadata for {file_path}: {e}") - return None - -# Create dataset metadata for all XPT files in a directory -def get_datasets_metadata(directory): - datasets = [] - for file in os.listdir(directory): - if file.lower().endswith('.xpt'): - file_path = os.path.join(directory, file) - metadata = create_dataset_metadata(file_path) - if metadata: - datasets.append(metadata) - return datasets - -# Get metadata for all datasets -datasets = get_datasets_metadata("path/to/dataset/directory") -``` - -For the RulesEngine approach, you DON'T need to manually create PandasDataset or DatasetVariable objects - the engine handles this -Multiple SDTMDatasetMetadata datasets can be made and loaded into the `datasets = []` list which will be fed into engine - -### Step 3: Initialize Library Metadata - -Library metadata provides essential information about standards, models, variables, and controlled terminology: - -```python -from cdisc_rules_engine.models.library_metadata_container import LibraryMetadataContainer -from cdisc_rules_engine.utilities.utils import ( - get_library_variables_metadata_cache_key, - get_model_details_cache_key_from_ig, - get_standard_details_cache_key, - get_variable_codelist_map_cache_key, -) - -standard = "sdtmig" -standard_version = "3-4" -standard_substandard = None - -# Get cache keys for metadata -standard_details_cache_key = get_standard_details_cache_key( - standard, standard_version, standard_substandard -) -variable_details_cache_key = get_library_variables_metadata_cache_key( - standard, standard_version, standard_substandard -) - -# Get standard metadata from cache -standard_metadata = cache.get(standard_details_cache_key) - -# Get model metadata based on standard metadata -model_metadata = {} -if standard_metadata: - model_cache_key = get_model_details_cache_key_from_ig(standard_metadata) - model_metadata = cache.get(model_cache_key) - -# Get variable-codelist mapping -variable_codelist_cache_key = get_variable_codelist_map_cache_key( - standard, standard_version, standard_substandard -) - -# Load controlled terminology packages -ct_packages = ["sdtmct-2021-12-17"] # Replace with your CT package versions -ct_package_metadata = {} -for codelist in ct_packages: - ct_package_metadata[codelist] = cache.get(codelist) - -# Create the library metadata container -library_metadata = LibraryMetadataContainer( - standard_metadata=standard_metadata, - model_metadata=model_metadata, - variables_metadata=cache.get(variable_details_cache_key), - variable_codelist_map=cache.get(variable_codelist_cache_key), - ct_package_metadata=ct_package_metadata, -) -``` - -### Step 4: Initialize Data Service - -The data service handles reading and processing dataset files: - -```python -from cdisc_rules_engine.config import config as default_config -from cdisc_rules_engine.services.data_services import DataServiceFactory - -max_dataset_size = max(datasets, key=lambda x: x.file_size).file_size # set to 0 for Dask implementationt - -# Create data service factory -data_service_factory = DataServiceFactory( - config=default_config, - cache_service=cache, - standard=standard, - standard_version=standard_version, - standard_substandard=standard_substandard, - library_metadata=library_metadata, - max_dataset_size=max_dataset_size, -) - -# Set dataset implementation (imported from cdisc_rules_engine.models.dataset) -dataset_implementation = PandasDataset or DaskDataset - -# Get data service -data_service = data_service_factory.get_data_service(dataset_paths) -``` - -### Step 5: Initialize Rules Engine - -Now you can create the rules engine with all required components: - -```python -from cdisc_rules_engine.rules_engine import RulesEngine - -# Initialize the rules engine -rules_engine = RulesEngine( - cache=cache, - data_service=data_service, - config_obj=default_config, - external_dictionaries=None, # see bottom for implementation tips - standard=standard, - standard_version=standard_version, - standard_substandard=None, # substandard if indicated - library_metadata=library_metadata, - max_dataset_size=max_dataset_size, - dataset_paths=dataset_paths, - ct_packages=ct_packages, - define_xml_path="path/to/define.xml", # Optional - validate_xml=False, # Whether to validate XML against schema -) -``` - -### Step 6: Run Validation - -Now you can iterate through the rules and validate them against your datasets: - -```python -import time -import itertools -from cdisc_rules_engine.models.rule_conditions import ConditionCompositeFactory -from cdisc_rules_engine.models.rule_validation_result import RuleValidationResult - -# Process each rule -start_time = time.time() -validation_results = [] - -for rule in rules: - try: - print(f"Validating rule {rule['core_id']}...") - if isinstance(rule["conditions"], dict): - rule["conditions"] = ConditionCompositeFactory.get_condition_composite(rule["conditions"]) - results = rules_engine.validate_single_rule(rule, datasets) - flattened_results = [] - for domain_results in results.values(): - flattened_results.extend(domain_results) - validation_results.append(RuleValidationResult(rule, flattened_results)) - except Exception as e: - print(f"Error validating rule {rule.get('core_id')}: {str(e)}") -end_time = time.time() -elapsed_time = end_time - start_time -``` - -### Step 7: Generate Report - -The simplest way to output your validation results is to write them to a text file: - -```python -import os -import json - -output_dir = os.getcwd() -output_file = os.path.join(output_dir, "validation_results.txt") -with open(output_file, "w") as f: - for result in validation_results: - rule_id = result.rule.get("core_id", "Unknown") - f.write(f"Rule: {rule_id}\n") - if hasattr(result, 'violations') and result.violations: - f.write(f"Found {len(result.violations)} violations\n") - for violation in result.violations: - f.write(f" - {json.dumps(violation, default=str)}\n") - else: - f.write(" No violations found\n") - f.write("\n") -print(f"Results written to {output_file}") -``` - -For more advanced reporting, you can implement custom reports using the ReportFactory class - -```python -reporting_factory = ReportFactory( - datasets=datasets, - validation_results=validation_results, - elapsed_time=elapsed_time, - args=args, # Provide validation args - data_service=data_service - ) -reporting_services = reporting_factory.get_report_services() -``` - -### Other Information - -## Interpret the Results - -The return value of a run will tell us if the rule was triggered. - -- A False value means that there were no errors -- A True value means that there were errors - -If there were errors, they will have been appended to the results array passed into your COREActions instance. Here's an example error: - -```python -{ - 'executionStatus': 'success', - 'domain': 'AE', - 'variables': ['AESLIFE'], - 'message': 'AESLIFE is completed, but not equal to "N" or "Y"', - 'errors': [ - {'value': {'AESLIFE': 'Maybe'}, 'row': 1} - ] -} -``` - -## Understanding Dataset Abstraction - -The CDISC Rules Engine uses an abstraction layer for datasets, which allows for flexibility but requires properly initializing your data before validation. Here's how to work with the PandasDataset class: - -```python -from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset -import pandas as pd - -# Create or load your DataFrame -my_dataframe = pd.DataFrame({ - 'STUDYID': ['STUDY1', 'STUDY1'], - 'USUBJID': ['001', '002'], - 'DOMAIN': ['DM', 'DM'], - # Add other columns as needed -}) - -# Create a PandasDataset instance -dataset = PandasDataset(data=my_dataframe) - -# Now 'dataset' can be used with DatasetVariable: -dataset_variable = DatasetVariable( - dataset, - column_prefix_map={"--": dataset_metadata.domain}, - value_level_metadata=value_level_metadata, - column_codelist_map=variable_codelist_map, - codelist_term_maps=codelist_term_maps, -) -``` - -There is also a DaskDataset which can be utilized. - -## Understanding DatasetMetadata and column_prefix_map - -The column_prefix_map is often dynamically set using the domain information from SDTMDatasetMetadata. Here's how these components work together: - -### SDTMDatasetMetadata - -The SDTMDatasetMetadata class provides essential information about SDTM datasets: - -```python -from dataclasses import dataclass -from typing import Union -from cdisc_rules_engine.models.dataset_metadata import DatasetMetadata -from cdisc_rules_engine.constants.domains import SUPPLEMENTARY_DOMAINS - -@dataclass -class SDTMDatasetMetadata(DatasetMetadata): - """ - This class is a container for SDTM dataset metadata - """ - @property - def domain(self) -> Union[str, None]: - return (self.first_record or {}).get("DOMAIN", None) - - @property - def rdomain(self) -> Union[str, None]: - return (self.first_record or {}).get("RDOMAIN", None) if self.is_supp else None - - @property - def is_supp(self) -> bool: - """ - Returns true if name starts with SUPP or SQ - """ - return self.name.startswith(SUPPLEMENTARY_DOMAINS) - - @property - def unsplit_name(self) -> str: - if self.domain: - return self.domain - if self.name.startswith("SUPP"): - return f"SUPP{self.rdomain}" - if self.name.startswith("SQ"): - return f"SQ{self.rdomain}" - return self.name - - @property - def is_split(self) -> bool: - return self.name != self.unsplit_name -``` - -### How column_prefix_map uses domain Information - -In the rule execution flow, the column_prefix_map is typically set using the domain from the dataset metadata: - -```python -dataset_variable = DatasetVariable( - dataset, - column_prefix_map={"--": dataset_metadata.domain}, - value_level_metadata=value_level_metadata, - column_codelist_map=variable_codelist_map, - codelist_term_maps=codelist_term_maps, -) -``` - -This dynamic mapping allows the engine to correctly interpret variable names based on their domain context. - -## Complete End-to-End Example - -Here's a comprehensive example showing how to set up your environment with dataset metadata and execute a rule using the business rule implementation: - -```python -import os -import pathlib -import pickle -import pandas as pd -import pyreadstat -from multiprocessing import freeze_support -from multiprocessing.managers import SyncManager -from cdisc_rules_engine.services.cache import InMemoryCacheService -from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset -from cdisc_rules_engine.models.dataset_variable import DatasetVariable -from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata -from cdisc_rules_engine.utilities.utils import get_rules_cache_key -from cdisc_rules_engine.models.actions import COREActions -from business_rules.engine import run - -# Step 1: Load the Rules Cache -class CacheManager(SyncManager): - pass - -CacheManager.register("InMemoryCacheService", InMemoryCacheService) - -def load_rules_cache(path_to_rules_cache): - cache_path = pathlib.Path(path_to_rules_cache) - manager = CacheManager() - manager.start() - cache = manager.InMemoryCacheService() - - # Get all files in the cache directory - files = next(os.walk(cache_path), (None, None, []))[2] - - for fname in files: - with open(cache_path / fname, "rb") as f: - cache.add_all(pickle.load(f)) - - return cache - -def main(): - # Define file paths - current_dir = os.getcwd() - cache_path = os.path.join(current_dir, "cache") - ae_file_path = os.path.join(current_dir, "ae.xpt") - - print(f"Looking for rules cache in: {cache_path}") - print(f"Using AE dataset from: {ae_file_path}") - - try: - # Step 1: Load the rules cache - cache = load_rules_cache(cache_path) - print("Successfully loaded rules cache") - - # Get SDTMIG 3.4 rules - cache_key_prefix = get_rules_cache_key("sdtmig", "3-4") - rules = cache.get_all_by_prefix(cache_key_prefix) - print(f"Found {len(rules)} rules for SDTMIG 3.4") - - # Step 2: Load and prepare the AE dataset - # Read the XPT file - ae_data, meta = pyreadstat.read_xport(ae_file_path) - print(f"Successfully loaded AE dataset with {len(ae_data)} records") - print(f"Columns: {', '.join(ae_data.columns)}") - - # Convert to PandasDataset - pandas_dataset = PandasDataset(data=ae_data) - - # Create dataset metadata - dataset_metadata = SDTMDatasetMetadata( - name="AE", - label=meta.file_label if hasattr(meta, 'file_label') else "Adverse Events", - first_record=ae_data.iloc[0].to_dict() if not ae_data.empty else None - ) - - # Create the DatasetVariable - dataset_variable = DatasetVariable( - pandas_dataset, - column_prefix_map={"--": dataset_metadata.domain}, - ) - - # Step 3: Run the AE Domain Rules - ae_rules = [ - rule for rule in rules - if "AE" in rule.get("domains", {}).get("Include", []) or - "ALL" in rule.get("domains", {}).get("Include", []) - ] - - print(f"Found {len(ae_rules)} rules applicable to AE domain") - - # Process the rules - all_results = [] - - for idx, rule in enumerate(ae_rules): - print(f"\nProcessing rule {idx+1}/{len(ae_rules)}: {rule.get('core_id')} - {rule.get('description')[:80]}...") - - results = [] - # Updated initialization of COREActions with correct parameters - core_actions = COREActions( - output_container=results, - variable=dataset_variable, - dataset_metadata=dataset_metadata, - rule=rule, - value_level_metadata=None # This is optional - ) - - try: - was_triggered = run( - rule=rule, - defined_variables=dataset_variable, - defined_actions=core_actions, - ) - - if was_triggered and results: - print(f" ❌ Rule {rule.get('core_id')} was triggered - issues found!") - for result in results: - if isinstance(result, dict) and 'errors' in result and result['errors']: - error_count = len(result.get('errors', [])) - print(f" Message: {result.get('message')} ({error_count} errors)") - # Only show up to 3 errors for brevity - for error in result.get('errors', [])[:3]: - print(f" Row: {error.get('row')}, Value: {error.get('value')}") - if error_count > 3: - print(f" ... and {error_count - 3} more errors") - else: - print(f" Message: {result}") - all_results.extend(results) - else: - print(f" ✓ Rule {rule.get('core_id')} passed - no issues found") - except Exception as e: - print(f" ⚠️ Error processing rule {rule.get('core_id')}: {str(e)}") - - # Summary of results - print("\n===== VALIDATION SUMMARY =====") - print(f"Total rules checked: {len(ae_rules)}") - print(f"Rules with issues: {len(all_results)}") - - # Count total issues, handling different result formats - issue_count = 0 - for result in all_results: - if isinstance(result, dict) and 'errors' in result: - issue_count += len(result['errors']) - else: - issue_count += 1 - - print(f"Total issues found: {issue_count}") - - except Exception as e: - print(f"Error: {str(e)}") - import traceback - traceback.print_exc() - -if __name__ == "__main__": - freeze_support() - main() -``` - -# Troubleshooting - -If you're seeing errors related to the dataset integration, check that: - -- Your DataFrame contains all the required columns for the validation rules -- The column_prefix_map correctly maps variable prefixes to domains (e.g., {"--": "AE"} for Adverse Events) -- Your column_codelist_map includes all the necessary codelists for variables that have controlled terminology -- Any metadata passed to the DatasetVariable constructor is correctly formatted -- The dataset object is an instance of PandasDataset, not a raw pandas DataFrame -- All required parameters are provided to the DatasetVariable constructor -- Ensure dataset metadata has the correct domain property (usually taken from the first record's DOMAIN column) -- Check that full_path is provided in dataset_metadata when using RulesEngine approach -- Verify the rule's domain inclusion criteria matches your dataset domain (in the rule's domains.Include array) -- Make sure your cache contains the appropriate controlled terminology if validating against standard terminologies -- Confirm the standard_version format is consistent -- If using external dictionaries, verify all file paths are correct and accessible -- When working with define.xml, ensure the define_xml_path is valid and accessible and the file is named `define.xml` - -## For Windows compatibility - -you will need freeze_support() for multiprocessing compatibility - -```python -from multiprocessing import freeze_support - -if __name__ == "__main__": - freeze_support() - main() -``` - -## External Dictionaries - -To feed external dictionaries into engine, you will need to instantiate the container. - -```python -from cdisc_rules_engine.models.external_dictionaries_container import ExternalDictionariesContainer -from cdisc_rules_engine.models.dictionaries.dictionary_types import DictionaryTypes - -# Create a dictionary path mapping with the dictionary you are providing -external_dictionaries = ExternalDictionariesContainer( - { - DictionaryTypes.UNII.value: unii_path, - DictionaryTypes.MEDRT.value: medrt_path, - DictionaryTypes.MEDDRA.value: meddra_path, - DictionaryTypes.WHODRUG.value: whodrug_path, - DictionaryTypes.LOINC.value: loinc_path, - DictionaryTypes.SNOMED.value: { - "edition": snomed_edition, - "version": snomed_version, - "base_url": snomed_url, - }, - } -) - -# Instantiate the container with the paths -external_dictionaries = ExternalDictionariesContainer(dictionary_path_mapping=dictionary_paths) - -# Then pass it to RulesEngine -rules_engine = RulesEngine( - cache=cache, - dataset_paths=[os.path.dirname(ae_file_path)], - standard="sdtmig", - standard_version="3.4", - external_dictionaries=external_dictionaries -) -``` - -see [External Dictionary Reference](https://cdisc-org.github.io/conformance-rules-editor/#/exdictionary) for more information diff --git a/README.md b/README.md index 4340249dc..769e5ed68 100644 --- a/README.md +++ b/README.md @@ -10,751 +10,20 @@ Open source offering of the CDISC Rules Engine, a tool designed for validating clinical trial data against data standards. -## Quick Start +## Quick Start Documentation -**Need help?** Jump to [Troubleshooting & Support](#troubleshooting--support) +Full documentation lives in the [`docs/`](docs/index.md) directory and is hosted at: -### Option 1: Use the Pre-built Executable (Recommended for Most Users) +> **[https://cdisc-org.github.io/cdisc-rules-engine](https://cdisc-org.github.io/cdisc-rules-engine)** -**Best for:** Users who want to run CORE without installing Python or dependencies. +| | | +|---|---| +| [Quick Start](docs/quick-start.md) | Download the executable or run from source | +| [CLI Reference](docs/cli-reference.md) | All commands and flags | +| [Development](docs/development.md) | PyPI integration, building, testing | +| [Contributing](docs/contributing.md) | Code style, tests, rule contributions | +| [FAQ & Troubleshooting](docs/faq.md) | Common issues and questions | -1. Download the latest executable for your operating system from [Releases](https://github.com/cdisc-org/cdisc-rules-engine/releases) -2. Unzip the downloaded file -3. Open a terminal in the unzipped directory -4. **Verify the installation** (optional but recommended): - - **Windows (PowerShell):** - -```bash - .\core.exe --help -``` - -**Linux/Mac:** - -```bash - # First, make it executable (one-time setup) - chmod +x ./core - - # Then verify - ./core --help -``` - -This displays all available commands and confirms the executable is working properly. - -**Mac users:** If you encounter security warnings, you may need to remove the quarantine attribute first: - -```bash - xattr -rd com.apple.quarantine . -``` - -5. **Run a test validation** (optional): - - CORE includes built-in test commands to verify core functionality: - -```bash - # Windows - .\core.exe test-validate json - - # Linux/Mac - ./core test-validate json -``` - -This confirms the executable is working correctly. Test results are automatically cleaned up after completion and cannot be accessed by the user. - -6. **Run your first validation:** - - **Windows (PowerShell):** - -```bash - .\core.exe validate -s sdtmig -v 3-4 -d C:\path\to\datasets -``` - -**Linux:** - -```bash - ./core validate -s sdtmig -v 3-4 -d /path/to/datasets -``` - -**Mac:** - -```bash - ./core validate -s sdtmig -v 3-4 -d /path/to/datasets -``` - ---- - -### Option 2: Run from Source Code - -**Best for:** Developers, contributors, or users who need the latest features. - -**Prerequisites:** Python 3.12 installed on your system. - -1. Clone the repository: - - ```bash - git clone https://github.com/cdisc-org/cdisc-rules-engine.git - cd cdisc-rules-engine - ``` - -2. Install dependencies: - - ```bash - python -m pip install -r requirements-dev.txt - ``` - -3. Run validation: - ```bash - python core.py validate -s sdtmig -v 3-4 -d /path/to/datasets - ``` - ---- - -## Command-line Interface - -All examples below use `python core.py` for source code users. **If you're using the executable**, replace `python core.py` with: - -- **Windows:** `.\core.exe` -- **Linux/Mac:** `./core` - -### Running a validation (`validate`) - -Clone the repository and run: - -```bash -python core.py --help -``` - -This will display the full list of commands. - -Run: - -```bash -python core.py validate --help -``` - -This will show the list of validation options. - -``` - -ca, --cache TEXT Relative path to cache files containing pre - loaded metadata and rules - -ps, --pool-size INTEGER Number of parallel processes for validation - -dep, --dotenv-path Path to the .env file used to set environment variables. - -d, --data TEXT Path to directory containing data files. Should only be provided once. If provided more than once, only the last value will be recorded. - -dp, --dataset-path TEXT Absolute path to dataset file. Can be specified multiple times. - -dxp, --define-xml-path TEXT Path to Define-XML. DEFINE_XML environment variable can be used to pass value. - -l, --log-level [info|debug|error|critical|disabled|warn] - Sets log level for engine logs, logs are - disabled by default - -rt, --report-template TEXT File path of report template to use for - excel output - -s, --standard TEXT CDISC standard to validate against. PRODUCT environment variable can be used to pass value. - [required] - -v, --version TEXT Standard version to validate against. VERSION environment variable can be used to pass value. - [required] - -ss, --substandard TEXT Substandard to validate against. SUBSTANDARD environment variable can be used to pass value. - "SDTM", "SEND", "ADaM", or "CDASH" - [required for TIG] - -uc, --use-case TEXT Use Case for TIG Validation - "INDH", "PROD", "NONCLIN", or "ANALYSIS" - [required for TIG] - USE_CASE environment variable can be used to pass value. - -ct, --controlled-terminology-package TEXT - Controlled terminology package to validate - against, can provide more than one - NOTE: if a defineXML is provided, if it is version 2.1 - engine will use the CT laid out in the define. If it is - version 2.0, -ct is expected to specify the CT package. - CT environment variable can be used to pass values separated by ':' on Unix and ';' for Windows. - -o, --output TEXT Report output file destination and name. Path will be - relative to the validation execution directory - and should end in the desired output filename - without file extension. The file extension will be - automatically added based on the output format. - Example: 'reports/result' will create 'reports' directory - with the filename 'result.json' (or 'result.xlsx' for Excel). - Note: Provide a valid, writable path. Absolute paths must - be valid for your operating system. - -of, --output-format [JSON|XLSX] - Output file format - -rr, --raw-report Report in a raw format as it is generated by - the engine. This flag must be used only with - --output-format JSON. - -mr, --max-report-rows INTEGER Maximum rows for 'Issue Details' per Excel report. When exceeded, - issues beyond the limit will not be reported. - Defaults to 1,000 'Issue Details' rows - Can be set via MAX_REPORT_ROWS env variable; - if both .env and -mr are specified, the larger value will be used. - If set to 0, no maximum will be enforced. - Excel row limit is 1,048,576 rows - -me, --max-errors-per-rule INTEGER BOOLEAN - Imposes a maximum number of errors per rule to enforce. - Usage: -me - Example: -me 100 True (make sure to use capital for True/False) - - : Maximum number of errors (integer) - - : - - false (default): Cumulative soft limit across all datasets. - After each dataset is validated for a single rule, - the limit is checked and if met or exceeded, - validation for that rule will cease for remaining datasets. - - true: Non-cumulative per-dataset limit. - Limits reported issues to per dataset per rule. - The rule continues to execute on all datasets, but only - the first issues per dataset are included in the report. - - Can be set via MAX_ERRORS_PER_RULE env variable; - if both .env and -me are specified, the larger value will be used. If either sets the per_dataset_flag to true, it will be true - If limit is set to 0, no maximum will be enforced. - No maximum is the default behavior. - -dxp, --define-xml-path Path to define-xml file. - -vx, --validate-xml Enable XML validation (default 'y' to enable, otherwise disable). - --whodrug TEXT Path to directory with WHODrug dictionary - files - --meddra TEXT Path to directory with MedDRA dictionary - files - --loinc TEXT Path to directory with LOINC dictionary - files - --medrt TEXT Path to directory with MEDRT dictionary - files - --unii TEXT Path to directory with UNII dictionary - files - --snomed-version TEXT Version of snomed to use. (ex. 2024-09-01) - --snomed-url TEXT Base url of snomed api to use. (ex. https://snowstorm.snomedtools.org/snowstorm/snomed-ct) - --snomed-edition TEXT Edition of snomed to use. (ex. SNOMEDCT-US) - -r, --rules TEXT Specify rule core ID ex. CORE-000001. Can be specified multiple times. - -er, --exclude-rules TEXT Specify rule core ID to exclude, ex. CORE-000001. Can be specified multiple times. - -lr, --local-rules TEXT Specify relative path to directory or file containing - local rule yml and/or json rule files. - -cs, --custom-standard Adding this flag tells engine to use a custom standard specified with -s and -v - that has been uploaded to the cache using update-cache - -cse, --custom-standard-encoding TEXT - Explicitly specify the file encoding to use - when reading custom standard files (JSON). - If not provided, the engine will attempt to - automatically detect the encoding by trying - common options (utf-8-sig, utf-8, system - default). - -vo, --verbose-output Specify this option to print rules as they - are completed - -p, --progress [verbose_output|disabled|percents|bar] - Defines how to display the validation - progress. By default a progress bar like - "[████████████████████████████--------] - 78%"is printed. - -jcf, --jsonata-custom-functions Pair containing a variable name and a Path to directory containing a set of custom JSONata functions. Can be specified multiple times - -e, --encoding TEXT File encoding for reading datasets. If not specified, defaults to utf-8. Supported encodings: utf-8, utf-16, utf-32, cp1252, latin-1, etc. - -ft, --filetype TEXT File extension to filter datasets. Has higher priority than --dataset-path parameter. - -vcp, --variables-csv-path Path to _variables.csv. Used when multiple dataset paths are provided and refer to different folders. - Not required if _variables.txt exists in all -dp directories. - -dcp, --datasets-csv-path Path to _datasets.csv. Required when multiple dataset paths are provided and refer to different folders. - --help Show this message and exit. -``` - -#### Available log levels - -- `debug` - Display all logs -- `info` - Display info, warnings, and error logs -- `warn` - Display warnings and errors -- `error` - Display only error logs -- `critical` - Display critical logs - -#### Validate folder - -To validate a folder using rules for SDTM-IG version 3.4 use the following command: - -```bash -python core.py validate -s sdtmig -v 3-4 -d /path/to/datasets -``` - -**_NOTE:_** Before running a validation in the CLI, you must first populate the cache with rules to validate against. See the update-cache command below. - -#### Supported Dataset Formats - -CORE supports the following dataset file formats for validation: - -- **XPT** - SAS Transport Format (version 5) -- **JSON** - Dataset-JSON > v1.1 (CDISC standard format) -- **NDJSON** - Newline Delimited JSON datasets -- **XLSX** - Excel format (Microsoft Excel files) - -**Important Notes:** - -- Define-XML files should be provided via the `--define-xml-path` (or `-dxp`) option, not through the dataset directory (`-d` or `-dp`). -- If you point to a folder containing unsupported file formats, CORE will display an error message indicating which formats are supported. - -#### File Encoding - -CORE defaults to utf-8 encoding when reading datasets. If your files use a different encoding, you must specify it using the `-e` or `--encoding` flag: - -```bash -python core.py validate -s sdtmig -v 3-4 -dp path/to/dataset.xpt -e cp1252 -``` - -The encoding name must be a valid Python codec name. Common encodings include: - -- `utf-8`, `utf-16`, `utf-32` - Unicode encodings -- `cp1252` - Windows-1252 (commonly used for files exported from Excel or SAS) -- `latin-1` - ISO-8859-1 - -If an invalid encoding is specified, CORE will display an error message with the supported encoding names. - -#### Validate single rule - -```bash -python core.py validate -s sdtmig -v 3-4 -dp /path/to/dataset.json -lr /path/to/rule.json --meddra /path/to/meddra/ --whodrug /path/to/whodrug/ -``` - -Note: JSON dataset should match the format provided by the rule editor: - -```json -{ - "datasets": [ - { - "filename": "cm.xpt", - "label": "Concomitant/Concurrent medications", - "domain": "CM", - "variables": [ - { - "name": "STUDYID", - "label": "Study Identifier", - "type": "Char", - "length": 10 - } - ], - "records": { - "STUDYID": ["CDISC-TEST", "CDISC-TEST", "CDISC-TEST", "CDISC-TEST"] - } - } - ] -} -``` - -#### **Understanding the Rules Report** - -The rules report tab displays the run status of each rule selected for validation - -The possible rule run statuses are: - -- `SUCCESS` - The rule ran, data was validated, and no issues were reported. -- `SKIPPED` - The rule was unable to be run for one of the following reasons: - - Column not found in data - - Domain not found - - Schema validation is off - - Outside scope -- `ISSUE REPORTED` - The rule ran, data was validated, and issues were reported -- `EXECUTION ERROR` - The validation failed for an unknown reason caused by rule evaluation or execution. Error details are found in the `Issue Details` tab. - -#### Setting DATASET_SIZE_THRESHOLD for Large Datasets - -The CDISC Rules Engine respects the `DATASET_SIZE_THRESHOLD` environment variable to determine when to use Dask for large dataset processing. Setting this to 0 coerces Dask usage over Pandas. A .env in the root directory with this variable set will cause this implementation coercion for the CLI. This can also be done with the executable releases via multiple methods: - -##### Windows (Command Prompt) - -```cmd -set DATASET_SIZE_THRESHOLD=0 && core.exe validate -rest -of -config -commands -``` - -##### Windows (PowerShell) - -```bash -$env:DATASET_SIZE_THRESHOLD=0; core.exe validate -rest -of -config -commands -``` - -##### Linux/Mac (Bash) - -```bash -DATASET_SIZE_THRESHOLD=0 ./core -rest -of -config -commands -``` - -##### .env File (Alternative) - -Create a `.env` file in the root directory of the release containing: - -``` -DATASET_SIZE_THRESHOLD=0 -``` - -Then run normally: - -```bash -core.exe validate -rest -of -config -commands -``` - ---- - -**Note:** Setting `DATASET_SIZE_THRESHOLD=0` tells the engine to use Dask processing for all datasets regardless of size, size threshold defaults to 1/4 of available RAM so datasets larger than this will use Dask. See env.example to see what the CLI .env file should look like - -### Updating the Cache (`update-cache`) - -Update locally stored cache data (An api-key can be provided through the environment variable - `CDISC_LIBRARY_API_KEY`. This is stored in the .env file in the root directory, the API key does not need quotations around it.) When running a validation, CORE uses rules in the cache unless -lr is specified. Running the above command populates the cache with controlled terminology, rules, metadata, etc. - -**Note:** If a valid api key is not provided, the rules in the cache will still be updated as they are accessible without a key. All other cache data requires a valid api key. - -```bash - python core.py update-cache -``` - -**Firewall Note:** If you encounter an SSL certificate verification error (e.g., `[SSL: CERTIFICATE_VERIFY_FAILED]`), this is typically caused by corporate firewall/proxy SSL inspection. The application connects to `api.library.cdisc.org` on port 443. Contact your IT department to request either the corporate CA certificate bundle or whitelisting for this hostname. - -To obtain an api key, please follow the instructions found here: . Please note it can take up to an hour after sign up to have an api key issued - -The update-cache command options are: - -``` - -c, --cache-path TEXT Relative path to cache. Optional. Only required if the cache has been - moved from its default location. - --apikey TEXT CDISC Library api key. - Can also be provided as an environment - variable CDISC_LIBRARY_API_KEY - -crd, --custom-rules-directory TEXT Relative path to directory containing local - rules in yaml or JSON formats to be added - to the cache - -cr, --custom-rule TEXT Relative path to rule file in yaml or JSON - formats to be added to the cache. - Can be specified multiple times. - -rcr, --remove-custom-rules TEXT Remove rules from the cache. Can be a single - rule ID, a comma-separated list of IDs, - or 'ALL' to remove all custom rules. - -ucr, --update-custom-rule TEXT Relative path to rule file in yaml or JSON - formats. Rule will be updated in cache - with this file. - -cs, --custom-standard TEXT Relative path to JSON file containing custom - standard details. Will update the standard - if it already exists. - -cse, --custom-standard-encoding TEXT Encoding for custom standard details. - -rcs, --remove-custom-standard TEXT Removes a custom standard and version from - the cache. Can be specified multiple times. - --help Show this message and exit. -``` - -##### Custom Standards and Rules - -###### Custom Rules Management - -- **Custom rules** are stored in a flat file in the cache, indexed by their core ID (e.g., 'COMPANY-000123' or 'CUSTOM-000123'). -- Each rule is stored independently in this file, allowing for efficient lookup and management. - -###### Custom Standards Management - -- **Custom standards** act as a lookup mechanism that maps a standard identifier to a list of applicable rule IDs. -- When adding a custom standard, you need to provide a JSON file with the following structure: - - ```json - { - "standard_id/version": ["RULE_ID1", "RULE_ID2", "RULE_ID3", ...] - } - ``` - - For example: - - ```json - { - "cust_standard/1-0": [ - "CUSTOM-000123", - "CUSTOM-000456", - "CUSTOM-001", - "CUSTOM-002" - ] - } - ``` - -- To add or update a custom standard, use: - - ```bash - python core.py update-cache --custom-standard 'path/to/standard.json' - ``` - -- To remove custom standards, use the `--remove-custom-standard` or `-rcs` flag: - - ```bash - python core.py update-cache --remove-custom-standard 'mycustom/1-0' - ``` - -- When executing validation against a custom standard, the system will use the standard as a lookup to determine which rules to apply from the rule cache. Custom standards which match CDISC standard names and versions can be used to get library metadata for the standard while still utilizing custom rules. If a custom name does not match a CDISC standard, library metadata will not be populated. - - ```json - { - "sdtmig/3-4": ["CUSTOM-000123", "CUSTOM-000456", "CUSTOM-001", "CUSTOM-002"] - } - ``` - - This rule will get metadata from SDTMIG version 3.4 but utilize the custom rules listed in the custom standard that need this library metadata. - -##### Relationship Between Custom Rules and Standards - -- You should first add your custom rules to the cache, then create a custom standard that references those rules. -- Custom standards can reference both core CDISC rules and your own custom rules in the same standard definition. -- This two-level architecture allows for flexible rule reuse across multiple standards. - -##### Custom Rules Management - -- **Add custom rules**: Use the `--custom-rules-directory` or `-crd` flag to specify a directory containing local rules, or `--custom-rule` or `-cr` flag to specify a single rule file: - ```bash - python core.py update-cache --custom-rules-directory 'path/to/directory' - python core.py update-cache --custom-rule 'path/to/rule.json' --custom-rule 'path/to/rule.yaml' - ``` -- **Update a custom rule**: Use the `--update-custom-rule` or `-ucr` flag to update an existing rule in the cache: - - ```bash - python core.py update-cache --update-custom-rule 'path/to/updated_rule.yaml' - ``` - -- **Remove custom rules**: Use the `--remove-custom-rules` or `-rcr` flag to remove rules from the cache. Can be a single rule ID, a comma-separated list of IDs, or ALL to remove all custom rules: - ```bash - python core.py update-cache --remove-custom-rules 'RULE_ID' - python core.py update-cache --remove-custom-rules 'RULE_ID1,RULE_ID2,RULE_ID3' - python core.py update-cache --remove-custom-rules ALL - ``` - -### List Rules (`list-rules`) - -List published rules available in the cache - -- list all published rules: - - `python core.py list-rules` - -- list rules for standard: - - `python core.py list-rules -s sdtmig -v 3-4` - -- list rules for integrated standard (substandard: "SDTM", "SEND", "ADaM", "CDASH"): - - `python core.py list-rules -s tig -v 1-0 -ss SDTM` - -- list rules by ID: - - `python core.py list-rules -r CORE-000351 -r CORE-000591` - -- List all custom rules: - - ```bash - python core.py list-rules --custom-rules - ``` - -- List custom rules with a specific ID: - ```bash - python core.py list-rules --custom-rules -s custom_standard -v 1-0 - ``` - -### List Rule Sets (`list-rule-sets`) - -Lists all standards and versions for which rules are available: - -```bash -python core.py list-rule-sets -``` - -To list custom standards and versions instead: - -```bash -python core.py list-rule-sets --custom -# or using the short form: -python core.py list-rule-sets -o -``` - -**Options:** - -- `-c, --cache-path` - Relative path to cache files containing pre-loaded metadata and rules -- `-o, --custom` - Flag to list all custom standards and versions in the cache instead of CDISC standards & rules - -### List CT (`list-ct`) - -List ct packages available in the cache - -``` -Usage: python core.py list-ct [OPTIONS] - - Command to list the ct packages available in the cache. - -Options: - -c, --cache-path TEXT Relative path to cache files containing pre loaded - metadata and rules - -s, --subsets TEXT CT package subset type. Ex: sdtmct. Multiple values - allowed - --help Show this message and exit. -``` - -## Development - -### PyPI Integration - -The CDISC Rules Engine is available as a Python package through PyPI. This allows you to: - -- Import the rules engine library directly into your Python projects -- Validate data without requiring .xpt format files -- Integrate rules validation into your existing data pipelines - -```python -pip install cdisc-rules-engine -``` - -For implementation instructions, see [PYPI.md](PYPI.md). - -### Source Code - -In the terminal, navigate to the directory you intend to install CORE rules engine in - -1. Clone the repository: - - ``` - git clone https://github.com/cdisc-org/cdisc-rules-engine - ``` - -2. **IMPORTANT: Python 3.12 is required** - - CORE Rules Engine requires Python 3.12. Other versions are not supported and may cause unexpected errors or incorrect validation results. - - Check your Python version: - - ``` - python --version - ``` - - If you don't have Python 3.12, please download and install it from [python.org](https://www.python.org/downloads/) or using your system's package manager. - -### Installing dependencies - -These steps should be run before running any tests or core commands using the non compiled version. - -- Create a virtual environment: - - ```bash - python -m venv - ``` - - NOTE: if you have multiple versions of python on your machine, you can call python 3.12 for the virtual environment's creation instead of the above command: - - ```bash - python3.12 -m venv - ``` - -- Activate the virtual environment: - - **Linux/Mac:** - - ```bash - .//bin/activate - ``` - - **Windows:** - - ```bash - .\\Scripts\Activate - ``` - -- Install the requirements: - - ```bash - python -m pip install -r requirements-dev.txt - ``` - - Run this from the root directory. - -### Creating an executable version - -**Note:** Further directions to create your own executable are contained in [README_Build_Executable.md](README_Build_Executable.md) if you wish to build an unofficial release executable for your own use. - -**Linux:** - -```bash -pyinstaller core.py --add-data=venv/lib/python3.12/site-packages/xmlschema/schemas:xmlschema/schemas --add-data=resources/cache:resources/cache --add-data=resources/templates:resources/templates --add-data=resources/jsonata:resources/jsonata -``` - -**Windows:** - -```bash -pyinstaller core.py --add-data=".venv/Lib/site-packages/xmlschema/schemas;xmlschema/schemas" --add-data="resources/cache;resources/cache" --add-data="resources/templates;resources/templates" --add-data="resources/jsonata;resources/jsonata" -``` - -_Note .venv should be replaced with path to python installation or virtual environment_ - -This will create an executable version in the `dist` folder. The version does not require having Python installed and -can be launched by running `core` script with all necessary CLI arguments. - -### Creating .whl file - -All non-python files should be listed in `MANIFEST.in` to be included in the distribution. -Files must be in python package. - -**Unix/MacOS:** - -```bash -python3 -m pip install --upgrade build -python3 -m build -``` - -To install from dist folder: - -```bash -pip3 install {path_to_file}/cdisc_rules_engine-{version}-py3-none-any.whl -``` - -To upload built distributive to pypi: - -```bash -python3 -m pip install --upgrade twine -python3 -m twine upload --repository {repository_name} dist/* -``` - -**Windows (Untested):** - -```bash -py -m pip install --upgrade build -py -m build -``` - -To install from dist folder: - -```bash -pip install {path_to_file}/cdisc_rules_engine-{version}-py3-none-any.whl -``` - -To upload built distributive to pypi: - -```bash -py -m pip install --upgrade twine -py -m twine upload --repository {repository_name} dist/* -``` - -## Contributing - -### Code formatter - -This project uses the `black` code formatter, `flake8` linter for python and `prettier` for JSON, YAML and MD. -It also uses `pre-commit` to run `black`, `flake8` and `prettier` when you commit. -Both dependencies are added to _requirements-dev.txt_. - -Setting up `pre-commit` requires one extra step. After installing it you have to run: - -```bash -pre-commit install -``` - -This installs `pre-commit` in your `.git/hooks` directory. - -### Running The Tests - -From the root of the project run the following command (this will run both the unit and regression tests): - -```bash -python -m pytest tests -``` - -### Updating USDM JSON Schema - -Currently, the engine supports USDM JSON Schema validation against versions 3.0 and 4.0. The schema definition files are located at: - -- `resources/cache/usdm-3-0-schema.pkl` -- `resources/cache/usdm-4-0-schema.pkl` - -These schema definitions were derived from the OpenAPI specs located in the `https://github.com/cdisc-org/DDF-RA` repo, so in order to update the existing schemas or create a new one, run: - -1. `git --no-pager --git-dir DDF-RA.git show --format=format:"%B" {required tag (example: v3.0.0)}:Deliverables/API/USDM_API.json > USDM_API_{required version}.json` -2. Use `scripts/openapi-to-json.py` script to convert the OpenAPI spec to JSON schema definition -3. Use `scripts/json_pkl_converter.py` script to convert the JSON file to `.pkl` -4. Place the `.pkl` file to `resources/cache` ## Troubleshooting & Support diff --git a/docs/.nojekyll b/docs/.nojekyll new file mode 100644 index 000000000..e69de29bb diff --git a/docs/PYPI.md b/docs/PYPI.md new file mode 100644 index 000000000..21b066419 --- /dev/null +++ b/docs/PYPI.md @@ -0,0 +1,367 @@ +# PyPI Integration + +CORE is available as a Python package for direct integration into your own pipelines and tooling. + +```bash +pip install cdisc-rules-engine +``` + +This installs the engine underlying the CLI and executable, but **does not include `core.py`** or the CLI entrypoints. If you need the full CLI, use the [executable or source code](quick-start.md) instead. + +--- + +## What You'll Need + +Installing the package alone is not enough to run validations. You also need: + +1. **The rules cache** — download the contents of `resources/cache/` from the [repository](https://github.com/cdisc-org/cdisc-rules-engine) and store them somewhere in your project. Keep this in sync with the package version you're using. +2. **A CDISC Library API key** — required for controlled terminology and library metadata. See [update-cache](cli-reference.md#updating-the-cache-update-cache) for how to obtain one. + +The package also includes the USDM and Dataset-JSON schemas, available if you use the dataset reader classes in `cdisc_rules_engine/services/data_readers` or the metadata readers in `cdisc_rules_engine/services`. + +--- + +## Choosing an Approach + +| | Option A: Business Rules Engine | Option B: RulesEngine Class | +|---|---|---| +| **Interface** | Low-level, rule-by-rule | High-level, dataset-oriented | +| **Data input** | pandas DataFrame | XPT or other file-based datasets | +| **Setup** | Minimal | More configuration required | +| **Best for** | Simple in-memory validation | Full multi-domain validation pipelines | + +--- + +## Loading the Rules Cache + +Both options start by loading the cache: + +```python +import os +import pathlib +import pickle +from multiprocessing.managers import SyncManager +from cdisc_rules_engine.services.cache import InMemoryCacheService + +class CacheManager(SyncManager): + pass + +CacheManager.register("InMemoryCacheService", InMemoryCacheService) + +def load_rules_cache(path_to_rules_cache): + cache_path = pathlib.Path(path_to_rules_cache) + manager = CacheManager() + manager.start() + cache = manager.InMemoryCacheService() + + files = next(os.walk(cache_path), (None, None, []))[2] + for fname in files: + with open(cache_path / fname, "rb") as f: + cache.add_all(pickle.load(f)) + + return cache +``` + +Retrieve rules for a standard and version: + +```python +from cdisc_rules_engine.utilities.utils import get_rules_cache_key + +cache = load_rules_cache("path/to/rules/cache") +# Note: version uses dashes, not dots +rules = cache.get_all_by_prefix(get_rules_cache_key("sdtmig", "3-4")) +``` + +Each rule is a dict with keys: `core_id`, `domains`, `author`, `reference`, `sensitivity`, `executability`, `description`, `authorities`, `standards`, `classes`, `rule_type`, `conditions`, `actions`, `datasets`, `output_variables`. + +If you have rules in raw CDISC metadata format, convert them first: + +```python +from cdisc_rules_engine.models.rule import Rule + +rule_dict = Rule.from_cdisc_metadata(rule_metadata) +rule_obj = Rule(rule_dict) +``` + +--- + +## Option A: Business Rules Engine + +Minimal setup — good for validating a single domain against an in-memory DataFrame. + +### Prepare Your Data + +```python +from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset +from cdisc_rules_engine.models.dataset_variable import DatasetVariable +from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata + +pandas_dataset = PandasDataset(data=df) +dataset_metadata = SDTMDatasetMetadata( + name="AE", + label="Adverse Events", + first_record=df.iloc[0].to_dict() if not df.empty else None +) +dataset_variable = DatasetVariable( + pandas_dataset, + column_prefix_map={"--": dataset_metadata.domain}, +) +``` + +`DatasetVariable` accepts additional optional arguments for richer validation: + +```python +dataset_variable = DatasetVariable( + pandas_dataset, + column_prefix_map={"--": dataset_metadata.domain}, + value_level_metadata=value_level_metadata, + column_codelist_map=variable_codelist_map, + codelist_term_maps=codelist_term_maps, +) +``` + +### Run Rules + +```python +from business_rules.engine import run +from cdisc_rules_engine.models.actions import COREActions + +ae_rules = [ + r for r in rules + if "AE" in r.get("domains", {}).get("Include", []) + or "ALL" in r.get("domains", {}).get("Include", []) +] + +all_results = [] +for rule in ae_rules: + results = [] + core_actions = COREActions( + output_container=results, + variable=dataset_variable, + dataset_metadata=dataset_metadata, + rule=rule, + value_level_metadata=None, + ) + try: + was_triggered = run(rule=rule, defined_variables=dataset_variable, defined_actions=core_actions) + if was_triggered: + all_results.extend(results) + except Exception as e: + print(f"Error in {rule.get('core_id')}: {e}") +``` + +`was_triggered` is `True` if issues were found. Each result in `all_results` looks like: + +```python +{ + 'executionStatus': 'success', + 'domain': 'AE', + 'variables': ['AESLIFE'], + 'message': 'AESLIFE is completed, but not equal to "N" or "Y"', + 'errors': [{'value': {'AESLIFE': 'Maybe'}, 'row': 1}] +} +``` + +--- + +## Option B: RulesEngine Class + +More setup, but handles dataset reading, preprocessing, and multi-domain validation. The source code in `cdisc_rules_engine/rules_engine.py` and the existing CLI implementation in `core.py` are the best reference for wiring this together — the initializer arguments map closely to the CLI flags documented in the [CLI Reference](cli-reference.md). + +### Step 1: Prepare Dataset Metadata + +```python +import os +import pyreadstat +from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata + +def create_dataset_metadata(file_path): + data, meta = pyreadstat.read_xport(file_path) + first_record = data.iloc[0].to_dict() if not data.empty else None + return SDTMDatasetMetadata( + name=os.path.basename(file_path).split('.')[0].upper(), + label=meta.file_label if hasattr(meta, 'file_label') else "", + filename=os.path.basename(file_path), + full_path=file_path, + file_size=os.path.getsize(file_path), + record_count=len(data), + first_record=first_record, + ) + +datasets = [ + create_dataset_metadata(os.path.join(directory, f)) + for f in os.listdir(directory) + if f.lower().endswith('.xpt') +] +``` + +You don't need to manually create `PandasDataset` or `DatasetVariable` objects for Option B — the engine handles this internally. + +### Step 2: Initialize Library Metadata + +```python +from cdisc_rules_engine.models.library_metadata_container import LibraryMetadataContainer +from cdisc_rules_engine.utilities.utils import ( + get_library_variables_metadata_cache_key, + get_model_details_cache_key_from_ig, + get_standard_details_cache_key, + get_variable_codelist_map_cache_key, +) + +standard = "sdtmig" +standard_version = "3-4" +standard_substandard = None + +standard_metadata = cache.get(get_standard_details_cache_key(standard, standard_version, standard_substandard)) +model_metadata = cache.get(get_model_details_cache_key_from_ig(standard_metadata)) if standard_metadata else {} + +ct_packages = ["sdtmct-2021-12-17"] # replace with your CT package versions +ct_package_metadata = {pkg: cache.get(pkg) for pkg in ct_packages} + +library_metadata = LibraryMetadataContainer( + standard_metadata=standard_metadata, + model_metadata=model_metadata, + variables_metadata=cache.get(get_library_variables_metadata_cache_key(standard, standard_version, standard_substandard)), + variable_codelist_map=cache.get(get_variable_codelist_map_cache_key(standard, standard_version, standard_substandard)), + ct_package_metadata=ct_package_metadata, +) +``` + +### Step 3: Initialize Data Service + +```python +from cdisc_rules_engine.config import config as default_config +from cdisc_rules_engine.services.data_services import DataServiceFactory + +max_dataset_size = max(datasets, key=lambda x: x.file_size).file_size +# Set max_dataset_size=0 to force Dask processing for all datasets + +data_service_factory = DataServiceFactory( + config=default_config, + cache_service=cache, + standard=standard, + standard_version=standard_version, + standard_substandard=standard_substandard, + library_metadata=library_metadata, + max_dataset_size=max_dataset_size, +) + +data_service = data_service_factory.get_data_service(dataset_paths) +``` + +### Step 4: Initialize Rules Engine + +```python +from cdisc_rules_engine.rules_engine import RulesEngine + +rules_engine = RulesEngine( + cache=cache, + data_service=data_service, + config_obj=default_config, + external_dictionaries=None, + standard=standard, + standard_version=standard_version, + standard_substandard=None, + library_metadata=library_metadata, + max_dataset_size=max_dataset_size, + dataset_paths=dataset_paths, + ct_packages=ct_packages, + define_xml_path="path/to/define.xml", # optional + validate_xml=False, +) +``` + +### Step 5: Run Validation + +Note the `ConditionCompositeFactory` conversion step — this is required before passing rules to `validate_single_rule`: + +```python +import time +from cdisc_rules_engine.models.rule_conditions import ConditionCompositeFactory +from cdisc_rules_engine.models.rule_validation_result import RuleValidationResult + +start_time = time.time() +validation_results = [] + +for rule in rules: + try: + if isinstance(rule["conditions"], dict): + rule["conditions"] = ConditionCompositeFactory.get_condition_composite(rule["conditions"]) + results = rules_engine.validate_single_rule(rule, datasets) + flattened = [r for domain_results in results.values() for r in domain_results] + validation_results.append(RuleValidationResult(rule, flattened)) + except Exception as e: + print(f"Error validating rule {rule.get('core_id')}: {e}") + +elapsed_time = time.time() - start_time +``` + +### Step 6: Generate Report + +Simple text output: + +```python +import json + +with open("validation_results.txt", "w") as f: + for result in validation_results: + rule_id = result.rule.get("core_id", "Unknown") + f.write(f"Rule: {rule_id}\n") + if hasattr(result, 'violations') and result.violations: + f.write(f"Found {len(result.violations)} violations\n") + for violation in result.violations: + f.write(f" - {json.dumps(violation, default=str)}\n") + else: + f.write(" No violations found\n") + f.write("\n") +``` + +For structured output, use `ReportFactory`: + +```python +reporting_factory = ReportFactory( + datasets=datasets, + validation_results=validation_results, + elapsed_time=elapsed_time, + args=args, + data_service=data_service, +) +reporting_services = reporting_factory.get_report_services() +``` + +--- + +## Notes + +**Cache key format** — always use dashes in version strings (`3-4`, not `3.4`). + +**`column_prefix_map`** — maps the `--` variable prefix to the dataset domain (e.g. `{"--": "AE"}`), resolving placeholders like `--SEQ` → `AESEQ`. + +**External dictionaries** — pass an `ExternalDictionariesContainer` to `RulesEngine` if validating rules that require MedDRA, WHODrug, LOINC, UNII, MedRT, or SNOMED. See the [External Dictionary Reference](https://cdisc-org.github.io/conformance-rules-editor/#/exdictionary). + +**Dask** — set `max_dataset_size=0` when initializing `DataServiceFactory` to force Dask processing for all datasets. + +**Windows compatibility** — add `freeze_support()` for multiprocessing: + +```python +from multiprocessing import freeze_support + +if __name__ == "__main__": + freeze_support() + main() +``` + +--- + +## Troubleshooting + +- Ensure the DataFrame contains all required columns for the rules being run +- `column_prefix_map` must correctly map `"--"` to the domain (e.g. `{"--": "AE"}`) +- The dataset object must be a `PandasDataset` instance, not a raw pandas DataFrame +- `full_path` must be set in `SDTMDatasetMetadata` when using the `RulesEngine` approach +- The rule's `domains.Include` must match your dataset's domain +- `standard_version` format must be consistent throughout (`3-4`, not `3.4`) +- CT package metadata must be present in the cache if validating against controlled terminology +- When using `define.xml`, the file must be named `define.xml` and the path must be valid +- If using external dictionaries, verify all file paths are correct and accessible +- Don't forget the `ConditionCompositeFactory` conversion before calling `validate_single_rule` (Option B) diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 000000000..62fe5567e --- /dev/null +++ b/docs/README.md @@ -0,0 +1,13 @@ +

+ + CORE Logo + +

+ +[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3120) +[![PyPI](https://img.shields.io/pypi/v/cdisc-rules-engine.svg)](https://pypi.org/project/cdisc-rules-engine) +[![Docker](https://img.shields.io/docker/v/cdiscdocker/cdisc-rules-engine?label=docker)](https://hub.docker.com/r/cdiscdocker/cdisc-rules-engine) + +# CDISC Rules Engine (CORE) + +Open source offering of the CDISC Conformance Rules Engine — a tool for validating clinical trial data against CDISC data standards. CORE validates study data structure and conformance against both published CDISC conformance rules for the various CDISC standards and custom rules authored in the CORE rule format. \ No newline at end of file diff --git a/docs/_sidebar.md b/docs/_sidebar.md new file mode 100644 index 000000000..22eb27f0a --- /dev/null +++ b/docs/_sidebar.md @@ -0,0 +1,31 @@ + + +- [**Home**](index.md) + +- **Getting Started** + - [Quick Start](quick-start.md) + - [Supported Formats](quick-start.md#supported-dataset-formats) + +- **CORE References** + - [CLI Reference](cli-reference.md) + - [validate](cli-reference.md#validate) + - [update-cache](cli-reference.md#updating-the-cache-update-cache) + - [list-rules](cli-reference.md#list-rules) + - [list-rule-sets](cli-reference.md#list-rule-sets) + - [list-ct](cli-reference.md#list-ct) + - [Environment Variables](cli-reference.md#environment-variables) + +- **Development** + - [Development Guide](development.md) + - [Building & Testing](development.md#running-tests) + - [USDM Schema](development.md#updating-the-usdm-json-schema) + - [PyPI Integration](PYPI.md) + - [Building an Executable](build_executable.md) + +- **Community** + - [Contributing](contributing.md) + - [FAQ & Troubleshooting](faq.md) + - [GitHub Discussions](https://github.com/cdisc-org/cdisc-rules-engine/discussions/categories/q-a) + - [GitHub Issues](https://github.com/cdisc-org/cdisc-rules-engine/issues) + - [CDISC Open Rules](https://github.com/cdisc-org/cdisc-open-rules) + - [Rules Editing Reference](https://cdisc-org.github.io/cdisc-open-rules/#/) diff --git a/README_Build_Executable.md b/docs/build_executable.md similarity index 94% rename from README_Build_Executable.md rename to docs/build_executable.md index 047667270..a0bd10502 100644 --- a/README_Build_Executable.md +++ b/docs/build_executable.md @@ -1,24 +1,26 @@ # Building CDISC Rules Engine Executable +Pre-built executables for each release are available on the Releases page. If you need to build your own there are two approaches. + ## Option 1: Using GitHub Actions (Recommended) -### Step 1: Fork and Setup +### Step 1: Fork the Repository and Setup 1. Fork the repository: https://github.com/cdisc-org/cdisc-rules-engine 2. The workflow file `.github/workflows/build-version.yml` is already included in the main repository. It is contained within our .gitignore so you can customize it as you see fit. -### Step 2: Run the Build +### Step 2: Add your API Key 1. Go to the top bar of the fork, click Settings > Security > Secrets and Variables > Actions 2. Click **New Repository Secret** and set an action secret named CDISC_LIBRARY_API_KEY and secret as your API key -3. Go to **Actions** tab in your forked repository -4. Click "Build Custom Executable" -5. Click **Run workflow** -6. Download the artifact when complete + +### Step 3: Run the Build + +Go to the **Actions** tab → **Build Custom Executable** → **Run workflow**. Download the artifact when complete. ### Step 3: Automated Builds (Optional) -To run builds automatically, uncomment the schedule section in the workflow: +To run builds automatically, uncomment the `schedule` section in the workflow: ```yaml schedule: @@ -27,16 +29,7 @@ schedule: - cron: "0 2 * * *" # Daily at 2 AM UTC ``` -## Troubleshooting - -### Architecture Issues - -You can build executables for different operating systems using GitHub's hosted runners. This creates platform-specific executables that work on different environments. See: - -- https://docs.github.com/en/actions/concepts/runners/github-hosted-runners -- https://github.com/actions/runner-images - -The runner in our workflow currently builds for ubuntu-22.04 but this can be changed to your particular OS, as well as CPU architectures (This will be different for Apple M chips that use ARM architecture versus Intel chips) +--- ## Option 2: Using Docker Locally @@ -213,3 +206,16 @@ RUN chmod +x /app/dist/output/core-ubuntu-22.04/core/core && \ ### Cross-Platform Alternative For the most consistent experience across all platforms, consider using the **GitHub Actions approach (Option 1)**, which handles platform differences automatically and doesn't require local Docker setup. + + +## Troubleshooting + +### Architecture Issues + +You can build executables for different operating systems using GitHub's hosted runners. This creates platform-specific executables that work on different environments. See: + +- https://docs.github.com/en/actions/concepts/runners/github-hosted-runners +- https://github.com/actions/runner-images + +The runner in our workflow currently builds for ubuntu-22.04 but this can be changed to your particular OS, as well as CPU architectures (This will be different for Apple M chips that use ARM architecture versus Intel chips) + diff --git a/docs/cli-reference.md b/docs/cli-reference.md new file mode 100644 index 000000000..efd1f25e8 --- /dev/null +++ b/docs/cli-reference.md @@ -0,0 +1,309 @@ +# CLI Reference + +> Throughout this reference, examples use `python core.py`. If you're using the pre-built executable, replace this with `.\core.exe` (Windows) or `./core` (Linux/Mac). + +--- + +## `validate` + +Run conformance validation against a CDISC standard. + +```bash +python core.py validate --help +``` + +### Required Flags + +| Flag | Description | +|---|---| +| `-s, --standard TEXT` | CDISC standard to validate against (e.g. `sdtmig`, `tig`). Also via `PRODUCT` env var. | +| `-v, --version TEXT` | Standard version (e.g. `3-4`). Also via `VERSION` env var. | +| `-ss, --substandard TEXT` | **Required for TIG.** One of `SDTM`, `SEND`, `ADaM`, `CDASH`. Also via `SUBSTANDARD` env var. | +| `-uc, --use-case TEXT` | **Required for TIG.** One of `INDH`, `PROD`, `NONCLIN`, `ANALYSIS`. Also via `USE_CASE` env var. | + +### Dataset Input + +| Flag | Description | +|---|---| +| `-d, --data TEXT` | Path to directory containing dataset files. Only the last value is used if specified multiple times. | +| `-dp, --dataset-path TEXT` | Absolute path to a single dataset file. Can be specified multiple times. | +| `-dxp, --define-xml-path TEXT` | Path to Define-XML. Also via `DEFINE_XML` env var. | +| `-ft, --filetype TEXT` | File extension filter applied to the `-d` directory (e.g. `xpt`). Takes priority over `--dataset-path` when both are provided. | +| `-e, --encoding TEXT` | File encoding for reading datasets (default: `utf-8`). Common values: `cp1252`, `latin-1`, `utf-16`. | +| `-vcp, --variables-csv-path` | Path to `_variables.csv` when using multiple `-dp` paths across different folders. | +| `-dcp, --datasets-csv-path` | Path to `_datasets.csv`. Required when multiple `-dp` paths refer to different folders. | + +### Rules Selection + +| Flag | Description | +|---|---| +| `-r, --rules TEXT` | Validate only specific rule(s) by CORE ID (e.g. `CORE-000001`). Repeatable. | +| `-er, --exclude-rules TEXT` | Exclude specific rule(s) by CORE ID. Repeatable. | +| `-lr, --local-rules TEXT` | Path to directory or file containing local rule YAML/JSON files. | +| `-cs, --custom-standard` | Use a custom standard (uploaded to cache via `update-cache`). | +| `-cse, --custom-standard-encoding TEXT` | Encoding for custom standard JSON files (auto-detected if omitted). | + +### Controlled Terminology + +| Flag | Description | +|---|---| +| `-ct, --controlled-terminology-package TEXT` | CT package(s) to validate against. Repeatable. If Define-XML 2.1 is provided, CT is taken from the define. Also via `CT` env var (`:` separated on Unix, `;` on Windows). | + +### External Dictionaries + +| Flag | Description | +|---|---| +| `--whodrug TEXT` | Path to WHODrug dictionary files. | +| `--meddra TEXT` | Path to MedDRA dictionary files. | +| `--loinc TEXT` | Path to LOINC dictionary files. | +| `--medrt TEXT` | Path to MedRT dictionary files. | +| `--unii TEXT` | Path to UNII dictionary files. | +| `--snomed-version TEXT` | SNOMED CT version (e.g. `2024-09-01`). | +| `--snomed-url TEXT` | SNOMED CT API base URL. | +| `--snomed-edition TEXT` | SNOMED CT edition (e.g. `SNOMEDCT-US`). | + +### Output + +| Flag | Description | +|---|---| +| `-o, --output TEXT` | Output file path (without extension). Extension is added automatically based on format. | +| `-of, --output-format [JSON\|XLSX]` | Output format. | +| `-rr, --raw-report` | Raw output format (JSON only). | +| `-mr, --max-report-rows INTEGER` | Max rows in the Issue Details tab of Excel output (default: 1000; 0 = unlimited). Also via `MAX_REPORT_ROWS` env var. | +| `-me, --max-errors-per-rule INTEGER BOOLEAN` | Limit errors per rule. Format: `-me `. See below. | +| `-rt, --report-template TEXT` | Path to a custom Excel report template. | + +#### `--max-errors-per-rule` Detail + +```bash +-me 100 False # Cumulative soft limit across all datasets (default per_dataset=False) +-me 100 True # Hard limit per dataset per rule +``` + +- `False` (default): After each dataset, if cumulative errors for a rule meet the limit, that rule stops processing further datasets. +- `True`: Limits reported issues to `` per dataset per rule. The rule still executes on all datasets. +- Can also be set via `MAX_ERRORS_PER_RULE` env var. The larger of the env var and CLI values is used. If either sets `per_dataset_flag` to `True`, it will be `True`. + +### Performance & Behavior + +| Flag | Description | +|---|---| +| `-ca, --cache TEXT` | Relative path to cache files. | +| `-ps, --pool-size INTEGER` | Number of parallel processes. | +| `-dep, --dotenv-path` | Path to `.env` file for environment variables. | +| `-l, --log-level` | Log verbosity: `debug`, `info`, `warn`, `error`, `critical`, `disabled`. | +| `-vx, --validate-xml` | XML validation toggle (default: enabled). Pass a value other than `y` to disable. | +| `-vo, --verbose-output` | Print each rule as it completes. | +| `-p, --progress` | Progress display: `verbose_output`, `disabled`, `percents`, or `bar` (default). | +| `-jcf, --jsonata-custom-functions` | Variable name + path to directory of custom JSONata functions. Repeatable. | +| `--help` | Show the help message and exit. | + +### Understanding Rule Run Statuses + +The **Rules Report** tab in the output summarizes the outcome for each rule: + +| Status | Meaning | +|---|---| +| `SUCCESS` | Rule ran; no issues found. | +| `SKIPPED` | Rule could not run (column/domain not found, schema validation off, out of scope). | +| `ISSUE REPORTED` | Rule ran; issues were found. | +| `EXECUTION ERROR` | Rule failed for an unexpected reason. Details in the Issue Details tab. | + +### Large Dataset Processing (Dask) + +CORE uses Dask instead of pandas for datasets exceeding 1/4 of available RAM. To force Dask for all datasets: + +**Linux/Mac:** +```bash +DATASET_SIZE_THRESHOLD=0 ./core validate -s sdtmig -v 3-4 -d /path/to/datasets +``` + +**Windows (PowerShell):** +```powershell +$env:DATASET_SIZE_THRESHOLD=0; .\core.exe validate -s sdtmig -v 3-4 -d C:\path\to\datasets +``` + +Or create a `.env` file in the root directory: +``` +DATASET_SIZE_THRESHOLD=0 +``` + +--- + +## `update-cache` + +Download and refresh locally cached rules, controlled terminology, and metadata. + +```bash +python core.py update-cache +``` + +An API key is required for metadata and CT. Rules are accessible without a key. Set your key via the `CDISC_LIBRARY_API_KEY` environment variable or in a `.env` file in the root directory (no quotes needed around the value). + +To obtain an API key: [wiki.cdisc.org — Getting Started](https://wiki.cdisc.org/display/LIBSUPRT/Getting+Started%3A+Access+to+CDISC+Library+API+using+API+Key+Authentication) + +> **Firewall note:** CORE connects to `api.library.cdisc.org` on port 443. If you see SSL certificate verification errors, contact your IT department to obtain the corporate CA bundle or request whitelisting for this hostname. + +### Options + +| Flag | Description | +|---|---| +| `-c, --cache-path TEXT` | Relative path to cache (only needed if cache has moved from its default location). | +| `--apikey TEXT` | CDISC Library API key. Also via `CDISC_LIBRARY_API_KEY` env var. | +| `-crd, --custom-rules-directory TEXT` | Path to a directory of local rule YAML/JSON files to add to the cache. | +| `-cr, --custom-rule TEXT` | Path to a single local rule file. Repeatable. | +| `-rcr, --remove-custom-rules TEXT` | Remove rules by ID, comma-separated list, or `ALL`. | +| `-ucr, --update-custom-rule TEXT` | Path to an updated rule file. Replaces the existing rule in cache. | +| `-cs, --custom-standard TEXT` | Path to a JSON file defining a custom standard. | +| `-cse, --custom-standard-encoding TEXT` | Encoding for the custom standard JSON file. | +| `-rcs, --remove-custom-standard TEXT` | Remove a custom standard by `standard/version`. Repeatable. | + +### Custom Rules + +Custom rules are stored in the cache indexed by their CORE ID (e.g. `COMPANY-000123`). + +```bash +# Add a directory of rules +python core.py update-cache --custom-rules-directory path/to/rules/ + +# Add a single rule +python core.py update-cache --custom-rule path/to/rule.yaml + +# Update an existing rule +python core.py update-cache --update-custom-rule path/to/updated_rule.yaml + +# Remove rules +python core.py update-cache --remove-custom-rules RULE-000001 +python core.py update-cache --remove-custom-rules RULE-000001,RULE-000002 +python core.py update-cache --remove-custom-rules ALL +``` + +### Custom Standards + +Custom standards map a standard identifier to a list of applicable rule IDs. Add rules to the cache first, then create a standard that references them. + +**Standard JSON format:** +```json +{ + "cust_standard/1-0": [ + "CUSTOM-000123", + "CUSTOM-000456" + ] +} +``` + +Custom standards can also reference CDISC standard names to inherit library metadata while using custom rules: + +```json +{ + "sdtmig/3-4": ["CUSTOM-000123", "CUSTOM-000456"] +} +``` + +```bash +# Add or update a custom standard +python core.py update-cache --custom-standard path/to/standard.json + +# Remove a custom standard +python core.py update-cache --remove-custom-standard mycustom/1-0 +``` + +--- + +## `list-rules` + +List conformance rules available in the cache. + +```bash +# All published rules +python core.py list-rules + +# Rules for a specific standard and version +python core.py list-rules -s sdtmig -v 3-4 + +# Rules for a TIG substandard +python core.py list-rules -s tig -v 1-0 -ss SDTM + +# Specific rules by ID +python core.py list-rules -r CORE-000351 -r CORE-000591 + +# All custom rules +python core.py list-rules --custom-rules + +# Custom rules for a specific standard +python core.py list-rules --custom-rules -s custom_standard -v 1-0 +``` + +### Options + +| Flag | Description | +|---|---| +| `-s, --standard TEXT` | Filter by standard (e.g. `sdtmig`, `tig`). | +| `-v, --version TEXT` | Filter by standard version (e.g. `3-4`). | +| `-ss, --substandard TEXT` | Filter by substandard for integrated standards (e.g. `SDTM`, `ADaM`). | +| `-r, --rules TEXT` | List specific rule(s) by CORE ID. Repeatable. | +| `--custom-rules` | List custom rules instead of published CDISC rules. | +| `-c, --cache-path TEXT` | Relative path to cache. | +| `--help` | Show the help message and exit. | + +--- + +## `list-rule-sets` + +List all standards and versions for which rules are available in the cache. + +```bash +# CDISC standards +python core.py list-rule-sets + +# Custom standards only +python core.py list-rule-sets --custom +``` + +**Options:** + +| Flag | Description | +|---|---| +| `-c, --cache-path TEXT` | Relative path to cache. | +| `-o, --custom` | List custom standards and versions instead of CDISC standards. | + +--- + +## `list-ct` + +List controlled terminology packages available in the cache. + +```bash +python core.py list-ct + +# Filter by subset type +python core.py list-ct -s sdtmct +``` + +**Options:** + +| Flag | Description | +|---|---| +| `-c, --cache-path TEXT` | Relative path to cache. | +| `-s, --subsets TEXT` | CT subset type filter (e.g. `sdtmct`). Multiple values allowed. | + +--- + +## Environment Variables + +Key environment variables that can substitute for or supplement CLI flags: + +| Variable | Equivalent Flag | +|---|---| +| `CDISC_LIBRARY_API_KEY` | `--apikey` | +| `PRODUCT` | `-s` / `--standard` | +| `VERSION` | `-v` / `--version` | +| `SUBSTANDARD` | `-ss` / `--substandard` | +| `USE_CASE` | `-uc` / `--use-case` | +| `DEFINE_XML` | `-dxp` / `--define-xml-path` | +| `CT` | `-ct` (`:` separated on Unix, `;` on Windows) | +| `MAX_REPORT_ROWS` | `-mr` / `--max-report-rows` | +| `MAX_ERRORS_PER_RULE` | `-me` / `--max-errors-per-rule` | +| `DATASET_SIZE_THRESHOLD` | Dask threshold (set to `0` to force Dask) | + +These can be set in a `.env` file in the root directory. See `env.example`. diff --git a/docs/contributing.md b/docs/contributing.md new file mode 100644 index 000000000..5de318b49 --- /dev/null +++ b/docs/contributing.md @@ -0,0 +1,83 @@ +# Contributing + +Thank you for your interest in contributing to CORE! There are two main ways to contribute: **rule contributions** (via `cdisc-open-rules`) and **engine contributions** (code, tests, documentation in this repository). + +--- + +## Rule Contributions + +Conformance rules are maintained separately in [`cdisc-open-rules`](https://github.com/cdisc-org/cdisc-open-rules). If you want to: + +- Propose a new conformance rule +- Report an issue with an existing rule's logic +- Contribute a rule implementation + +Please open an issue or pull request in that repository. Rule authoring can also be done through the hosted [CORE Rule Editor](https://cdisc-org.github.io/conformance-rules-editor). + +For questions about rule contribution workflows, post in [GitHub Discussions](https://github.com/cdisc-org/cdisc-rules-engine/discussions). + +--- + +## Engine Contributions + +### Setting Up the Development Environment + +Follow the [Development → Environment Setup](development.md#environment-setup) guide to clone the repository and install dependencies. + +### Code Style + +This project enforces consistent formatting and linting via pre-commit hooks. + +**Tools used:** +- [`black`](https://black.readthedocs.io/) — Python code formatter +- [`flake8`](https://flake8.pycqa.org/) — Python linter +- [`prettier`](https://prettier.io/) — JSON, YAML, and Markdown formatter + +Both `black` and `flake8` are included in `requirements-dev.txt`. After installing dependencies, install the pre-commit hooks: + +```bash +pre-commit install +``` + +This installs the hooks into `.git/hooks/` so formatting and linting run automatically on each commit. + +To run the checks manually: +```bash +pre-commit run --all-files +``` + +### Running Tests + +```bash +python -m pytest tests +``` + +This runs both unit and regression tests. All tests must pass before submitting a pull request. + +### Submitting a Pull Request + +1. Fork the repository and create a feature branch from `main`. +2. Make your changes, following the code style guidelines above. +3. Add or update tests for any changed behavior. +4. Ensure all tests pass locally. +5. Open a pull request with a clear description of the change and the motivation behind it. + +For larger changes or new features, consider opening a GitHub Discussion or issue first to align on the approach. + +--- + +## Reporting Bugs & Requesting Features + +Use [GitHub Issues](https://github.com/cdisc-org/cdisc-rules-engine/issues) to report bugs or request features. + +When reporting a bug, please include: +- A clear description of the problem +- Steps to reproduce +- Your operating system and Python version (or executable version) +- Relevant logs or error messages + +--- + +## Questions & Discussion + +For general questions, use the [Q&A discussion board](https://github.com/cdisc-org/cdisc-rules-engine/discussions/categories/q-a). Please search existing discussions before opening a new one. diff --git a/docs/development.md b/docs/development.md new file mode 100644 index 000000000..422a68aaa --- /dev/null +++ b/docs/development.md @@ -0,0 +1,186 @@ +# Development + +This page covers integrating CORE as a library, building from source, running tests, creating executables, and packaging. + +--- + +## PyPI Integration + +CORE is available as a Python package for direct integration into data pipelines. + +```bash +pip install cdisc-rules-engine +``` + +This allows you to: +- Import the rules engine library into your Python projects +- Validate data without requiring XPT format files +- Integrate rules validation into existing pipelines + +For implementation details, see [PYPI.md](./PYPI.md). + +--- + +## Environment Setup + +**Python 3.12 is required.** Other versions are not supported and may produce unexpected errors or incorrect validation results. + +```bash +# Check your Python version +python --version + +# Clone the repository +git clone https://github.com/cdisc-org/cdisc-rules-engine.git +cd cdisc-rules-engine + +# Create a virtual environment +python -m venv venv + +# If you have multiple Python versions, be explicit: +# python3.12 -m venv venv + +# Activate (Linux/Mac) +source venv/bin/activate + +# Activate (Windows) +.\venv\Scripts\Activate + +# Install dependencies +python -m pip install -r requirements-dev.txt +``` + +--- + +## Running Tests + +From the project root, run both unit and regression tests: + +```bash +python -m pytest tests +``` + +--- + +## Creating an Executable + +Pre-built executables are available on the [Releases page](https://github.com/cdisc-org/cdisc-rules-engine/releases). If you need to build your own, see [README_Build_Executable.md](../README_Build_Executable.md) in the repository root. + +For reference, the PyInstaller commands are: + +**Linux:** +```bash +pyinstaller core.py \ + --add-data=venv/lib/python3.12/site-packages/xmlschema/schemas:xmlschema/schemas \ + --add-data=resources/cache:resources/cache \ + --add-data=resources/templates:resources/templates \ + --add-data=resources/jsonata:resources/jsonata +``` + +**Windows:** +```bash +pyinstaller core.py ^ + --add-data=".venv/Lib/site-packages/xmlschema/schemas;xmlschema/schemas" ^ + --add-data="resources/cache;resources/cache" ^ + --add-data="resources/templates;resources/templates" ^ + --add-data="resources/jsonata;resources/jsonata" +``` + +The executable is created in the `dist/` folder and does not require Python to be installed on the target machine. + +--- + +## Creating a Python Package (.whl) + +All non-Python files must be listed in `MANIFEST.in` to be included in the distribution. + +**Unix / Mac:** +```bash +python3 -m pip install --upgrade build +python3 -m build +``` + +Install locally from the `dist/` folder: +```bash +pip3 install dist/cdisc_rules_engine-{version}-py3-none-any.whl +``` + +Upload to PyPI: +```bash +python3 -m pip install --upgrade twine +python3 -m twine upload --repository {repository_name} dist/* +``` + +**Windows:** +```bash +py -m pip install --upgrade build +py -m build +``` + +Install locally: +```bash +pip install dist\cdisc_rules_engine-{version}-py3-none-any.whl +``` + +Upload to PyPI: +```bash +py -m pip install --upgrade twine +py -m twine upload --repository {repository_name} dist/* +``` + +--- + +## Updating the USDM JSON Schema + +CORE validates against USDM JSON Schema versions 3.0 and 4.0. Schema definitions are stored as `.pkl` files in `resources/cache/`: + +- `resources/cache/usdm-3-0-schema.pkl` +- `resources/cache/usdm-4-0-schema.pkl` + +These are derived from the OpenAPI specs in [`cdisc-org/DDF-RA`](https://github.com/cdisc-org/DDF-RA). To update or add a schema version: + +1. Extract the OpenAPI spec for the target tag: + ```bash + git --no-pager --git-dir DDF-RA.git show --format=format:"%B" {tag}:Deliverables/API/USDM_API.json > USDM_API_{version}.json + ``` + Example tag: `v3.0.0` + +2. Convert the OpenAPI spec to JSON Schema: + ```bash + python scripts/openapi-to-json.py + ``` + +3. Convert the JSON Schema to `.pkl`: + ```bash + python scripts/json_pkl_converter.py + ``` + +4. Place the resulting `.pkl` file in `resources/cache/`. + +--- + +## Dataset Format Reference (JSON) + +When validating a single rule with `--local-rules`, JSON datasets must match the Dataset-JSON format used by the rule editor: + +```json +{ + "datasets": [ + { + "filename": "cm.xpt", + "label": "Concomitant/Concurrent medications", + "domain": "CM", + "variables": [ + { + "name": "STUDYID", + "label": "Study Identifier", + "type": "Char", + "length": 10 + } + ], + "records": { + "STUDYID": ["CDISC-TEST", "CDISC-TEST", "CDISC-TEST", "CDISC-TEST"] + } + } + ] +} +``` diff --git a/docs/faq.md b/docs/faq.md new file mode 100644 index 000000000..af0bc16dc --- /dev/null +++ b/docs/faq.md @@ -0,0 +1,165 @@ +# FAQ & Troubleshooting + +> Still stuck? Post in the [Q&A discussion board](https://github.com/cdisc-org/cdisc-rules-engine/discussions/categories/q-a) or [open an issue](https://github.com/cdisc-org/cdisc-rules-engine/issues). + +--- + +## Installation & Setup + +### Which Python version does CORE require? + +**Python 3.12 is required.** Other versions are not supported and may cause unexpected errors or incorrect validation results. + +```bash +python --version +``` + +Download Python 3.12 from [python.org](https://www.python.org/downloads/) if needed. + +### The Mac executable won't open due to a security warning + +Remove the quarantine attribute: +```bash +xattr -rd com.apple.quarantine . +``` + +### I get `[SSL: CERTIFICATE_VERIFY_FAILED]` when running `update-cache` + +This is typically caused by a corporate firewall performing SSL inspection. CORE connects to `api.library.cdisc.org` on port 443. Contact your IT department to either obtain the corporate CA certificate bundle or request whitelisting for that hostname. + +--- + +## Cache & API Key + +### Do I need an API key? + +An API key is required for controlled terminology and library metadata. **Rules are accessible without a key.** Running `update-cache` without a key will still populate conformance rules. + +To obtain a key: [wiki.cdisc.org — Getting Started](https://wiki.cdisc.org/display/LIBSUPRT/Getting+Started%3A+Access+to+CDISC+Library+API+using+API+Key+Authentication) + +> Note: It can take up to an hour after sign-up for a key to be issued. + +### Where do I put my API key? + +Set it as the `CDISC_LIBRARY_API_KEY` environment variable, or add it (without quotes) to a `.env` file in the project root directory: + +``` +CDISC_LIBRARY_API_KEY=your_key_here +``` + +### My validation returned no results or unexpected rules +- **Console output / logs:** By default, engine logs are disabled. Use `-l` / `--log-level` to enable them. Available levels: `info`, `debug`, `warn`, `error`, `critical`. +- **The output report:** Open the results file and review the **Rule Report** tab (XLSX) or the top-level `Rules_Report` array (JSON). Rules with a status of `SKIPPED` will include a reason in the Issue Details — this is often the cause of unexpectedly absent results. +- **Scope flags:** Confirm that your `-s`, `-v`, and for TIG `-ss` arguments match the standard, version, and substandard you intended to validate against. A mismatch will cause rules to be silently out of scope. + +If you're still not seeing expected results after checking the above, post in the [Q&A discussion board](https://github.com/cdisc-org/cdisc-rules-engine/discussions/categories/q-a) and include the relevant log output and the rule IDs you expected to run. + +--- + +## Validation + +### What does each rule run status mean? + +| Status | Meaning | +|---|---| +| `SUCCESS` | Rule ran; no issues found. | +| `SKIPPED` | Rule did not run (column/domain not found, schema validation off, outside scope). | +| `ISSUE_REPORTED` | Rule ran; issues were found. | +| `EXECUTION_ERROR` | Rule failed unexpectedly. Check the Issue Details tab for details. | + +### My dataset fails to load / wrong encoding + +CORE defaults to `utf-8`. If your files use a different encoding, specify it: + +```bash +python core.py validate -s sdtmig -v 3-4 -dp path/to/dataset.xpt -e cp1252 +``` + +>NOTE: you may notice a `'utf-9' codec can't decode byte` error in the logs. This is usually due to Windows Smart Quotes, produced in excel, which are CP1252 encoded, not utf-8. Unfortunately, Windows Smart Quotes produce a file that is mostly utf-8 with some CP1252 for the smart quotes so the -e command will not work to resolve this. You will need to locate these quotes and manually change them before being able to rerun this data. + +### Will using -d pointed at my data directory cause CORE to include my Define-XML file in the validation? + +No. Define-XML must be provided separately via `--define-xml-path` (`-dxp`). + +### Validation is very slow on large files + +Set `DATASET_SIZE_THRESHOLD=0` to force Dask processing for all datasets regardless of size: + +```bash +# Linux/Mac +DATASET_SIZE_THRESHOLD=0 ./core validate -s sdtmig -v 3-4 -d /path/to/datasets + +# Windows PowerShell +$env:DATASET_SIZE_THRESHOLD=0; .\core.exe validate -s sdtmig -v 3-4 -d C:\path\to\datasets +``` + +Or add to a `.env` file in the root directory: +``` +DATASET_SIZE_THRESHOLD=0 +``` + +By default the engine uses Dask automatically for datasets exceeding 1/4 of available RAM. + +### How do I validate against TIG? + +TIG requires `--substandard` and, in the case of custom domains, `--use-case` to identify what use case the custom domains are applicable to. + +```bash +python core.py validate -s tig -v 1-0 -ss SDTM -uc INDH -d /path/to/datasets +``` + +Valid substandards: `SDTM`, `SEND`, `ADaM`, `CDASH` +Valid use cases: `INDH`, `PROD`, `NONCLIN`, `ANALYSIS` + +### How do I run only specific rules? + +Use `-r` to include specific CORE IDs, or `-er` to exclude them: + +```bash +# Only these rules +python core.py validate -s sdtmig -v 3-4 -d /data -r CORE-000001 -r CORE-000002 + +# All rules except these +python core.py validate -s sdtmig -v 3-4 -d /data -er CORE-000001 +``` + +You can view and clone the CDISC CORE rules at [cdisc-open-rules](https://github.com/cdisc-org/cdisc-open-rules) + + + +## Privacy & Data Protection + +### Does CORE transmit my study data anywhere during validation? + +**No. All input data remains local to the machine where CORE is executed.** Specifically: + +- Study files are read directly from the local filesystem (or a specified input path) and are never uploaded or transmitted anywhere. +- Validation runs entirely in-process on the local machine (or whatever environment CORE is deployed to — on-premises server, cloud VM, container, etc.). +- The output report is written locally upon completion (or to a specified output directory). +- All metadata required for rule execution (controlled terminology packages, standard metadata, etc.) is pre-fetched via `update-cache` and bundled at release time. Rule Validation execution itself requires no outbound network calls carrying study data. + +**No patient or personal data ever leaves the environment where CORE is installed**, supporting compliance with data protection requirements such as HIPAA, GDPR, and sponsor data governance policies. + +--- + +## Custom Rules & Standards + +### What's the difference between custom rules and custom standards? + +- **Custom rules** are individual rule definitions stored in the cache by CORE ID. +- **Custom standards** map a standard identifier to a list of rule IDs, acting as a lookup for which rules apply. + +Add your custom rules first, then create a standard that references them. See [CLI Reference → update-cache](cli-reference.md#custom-rules) for details. Custom Rules & Standards continue to be a work in progress, there are tickets within CORE's Issues to full implement further support for them in the future. + +### Can a custom standard use CDISC library metadata? + +Yes. If you name your custom standard after an existing CDISC standard (e.g. `sdtmig/3-4`), CORE will fetch library metadata for that standard while applying your custom rules. + + +--- + +## Still Need Help? + +- **Search existing Q&A:** [GitHub Discussions](https://github.com/cdisc-org/cdisc-rules-engine/discussions/categories/q-a) +- **Open a new discussion:** For questions or usage help +- **Open an issue:** [GitHub Issues](https://github.com/cdisc-org/cdisc-rules-engine/issues) for bugs or feature requests diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 000000000..a41b2c371 --- /dev/null +++ b/docs/index.html @@ -0,0 +1,48 @@ + + + + + + CDISC Rules Engine (CORE) + + + + + + +
Loading CORE documentation…
+ + + + + + + + diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 000000000..ea1600e1a --- /dev/null +++ b/docs/index.md @@ -0,0 +1,52 @@ +# CDISC Rules Engine (CORE) + +

+ CORE Logo +

+ +[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3120) +[![PyPI](https://img.shields.io/pypi/v/cdisc-rules-engine.svg)](https://pypi.org/project/cdisc-rules-engine) +[![Docker](https://img.shields.io/docker/v/cdiscdocker/cdisc-rules-engine?label=docker)](https://hub.docker.com/r/cdiscdocker/cdisc-rules-engine) + +CORE is the open-source offering of the CDISC Conformance Rules Engine — a tool for validating clinical trial data against CDISC data standards. + +--- + +## Scope + +CORE validates study datasets against published CDISC conformance rules for the following standards: + +| Standard | Description | +|---|---| +| **SDTMIG** | Study Data Tabulation Model Implementation Guide | +| **SENDIG** | Standard for Exchange of Nonclinical Data | +| **ADaM** | Analysis Data Model | +| **TIG** | Therapeutic Area Implementation Guide | +| **FDA Business Rules** | FDA submission conformance rules | +| **USDM** | Unified Study Definitions Model | + +CORE validates data *structure and conformance* against published rules. It is not a replacement for clinical review, statistical analysis, or submission readiness assessment. Rule logic is defined in [`cdisc-open-rules`](https://github.com/cdisc-org/cdisc-open-rules). + +--- + +## Getting Started + +| I want to… | Go to… | +|---|---| +| Run CORE without installing Python | [Quick Start → Executable](quick-start.md#option-1-pre-built-executable) | +| Run from source / contribute code | [Quick Start → From Source](quick-start.md#option-2-from-source-code) | +| See all CLI options | [CLI Reference](cli-reference.md) | +| Integrate CORE into my Python project | [Development → PyPI](development.md#pypi-integration) | +| Build or test CORE | [Development](development.md) | +| Contribute rules or code | [Contributing](contributing.md) | +| Get help or ask a question | [FAQ & Troubleshooting](faq.md) | + +--- + +## Community & Support + +- **Questions & Discussions:** [GitHub Discussions — Q&A](https://github.com/cdisc-org/cdisc-rules-engine/discussions/categories/q-a) +- **Bug Reports & Feature Requests:** [GitHub Issues](https://github.com/cdisc-org/cdisc-rules-engine/issues) +- **Rule Contributions:** [cdisc-open-rules](https://github.com/cdisc-org/cdisc-open-rules) +- **CDISC Library API:** [wiki.cdisc.org — Getting Started](https://wiki.cdisc.org/display/LIBSUPRT/Getting+Started%3A+Access+to+CDISC+Library+API+using+API+Key+Authentication) +- **Published CDISC Conformance Rules Github**[cdisc-open-rules](https://github.com/cdisc-org/cdisc-open-rules) diff --git a/docs/quick-start.md b/docs/quick-start.md new file mode 100644 index 000000000..f7be1edc9 --- /dev/null +++ b/docs/quick-start.md @@ -0,0 +1,162 @@ +# Quick Start + +> **Need help?** See [FAQ & Troubleshooting](faq.md) or post in [GitHub Discussions](https://github.com/cdisc-org/cdisc-rules-engine/discussions/categories/q-a). + +--- + +## Option 1: Pre-Built Executable + +**Best for:** Users who want to run CORE without installing Python or managing dependencies. + +### 1. Download + +Download the latest executable for your operating system from the [Releases page](https://github.com/cdisc-org/cdisc-rules-engine/releases) and unzip the downloaded file. + +### 2. Verify the Installation + +Open a terminal in the unzipped directory and run: + +**Windows (PowerShell):** +```powershell +.\core.exe --help +``` + +**Linux / Mac:** +```bash +# Make it executable (one-time setup) +chmod +x ./core + +./core --help +``` + +> **Mac users:** If you see a security warning, remove the quarantine attribute first: +> ```bash +> xattr -rd com.apple.quarantine . +> ``` + +### 3. (Optional) Update the Cache + +Executable releases ship with a pre-populated cache, so you can skip this step and go straight to validation. If you want the latest published rules, see [CLI Reference → update-cache](cli-reference.md#updating-the-cache-update-cache) for API key setup and options. + +> **Note:** Rules published after a release may depend on engine features not present in that executable. When in doubt, wait for the next release rather than updating the cache manually. + +### 4. Run a Validation + +**Windows:** +```powershell +.\core.exe validate -s sdtmig -v 3-4 -d C:\path\to\datasets +``` + +**Linux / Mac:** +```bash +./core validate -s sdtmig -v 3-4 -d /path/to/datasets +``` + +### 5. (Optional) Run a Built-In Test + +CORE ships with a self-test command to confirm the executable is working: + +```bash +# Windows +.\core.exe test-validate json + +# Linux/Mac +./core test-validate json +``` + +Test files are cleaned up automatically after completion. + +--- + +## Option 2: From Source Code + +**Best for:** Developers, contributors, or users who need the latest features. + +### Prerequisites +- **Python 3.12** is required. Other versions are not supported. + Check your version: +```bash + python --version +``` + Install Python 3.12 from [python.org](https://www.python.org/downloads/) if needed. + +- **Git** is required to clone the repository. + Check your version: +```bash + git --version +``` + Install Git from [git-scm.com](https://git-scm.com/downloads) if needed. + +### 1. Clone the Repository + +```bash +git clone https://github.com/cdisc-org/cdisc-rules-engine.git +cd cdisc-rules-engine +``` + +### 2. Create and Activate a Virtual Environment + +**Linux / Mac:** +```bash +python -m venv venv +source venv/bin/activate +``` + +**Windows:** +```bash +python -m venv venv +.\venv\Scripts\Activate +``` + +> If you have multiple Python versions, use `python3.12 -m venv venv` explicitly. + +### 3. Install Dependencies + +```bash +python -m pip install -r requirements-dev.txt +``` + +### 4. Populate the Cache + +```bash +python core.py update-cache +``` + +### 5. Run a Validation + +```bash +python core.py validate -s sdtmig -v 3-4 -d /path/to/datasets +``` + +--- + +## Supported Dataset Formats + +CORE supports the following input formats: + +| Format | Description | +|---|---| +| **XPT** | SAS Transport Format (version 5) | +| **JSON** | Dataset-JSON ≥ v1.1 (CDISC standard) | +| **NDJSON** | Newline Delimited JSON | +| **XLSX** | Microsoft Excel | + +> **Note:** Define-XML files must be provided via `--define-xml-path` (`-dxp`), not through the dataset directory. + +--- + +## CLI Command Reference + +All commands and flags are documented in the [CLI Reference](cli-reference.md). + +Command summary: + +| Command | Purpose | +|---|---| +| `validate` | Run conformance validation | +| `update-cache` | Download/refresh rules, CT, and metadata | +| `list-rules` | List rules available in the cache | +| `list-rule-sets` | List standards and versions in the cache | +| `list-ct` | List controlled terminology packages in the cache | + +> Throughout these docs, examples use `python core.py`. If you're using the executable, replace this with `.\core.exe` (Windows) or `./core` (Linux/Mac). From 289f5d8460376ed7d7e0f5f787e73127b6f51e9f Mon Sep 17 00:00:00 2001 From: Samuel Johnson Date: Mon, 11 May 2026 19:48:43 -0400 Subject: [PATCH 2/8] prettier --- README.md | 15 ++- docs/PYPI.md | 12 +-- docs/README.md | 2 +- docs/build_executable.md | 2 - docs/cli-reference.md | 203 ++++++++++++++++++++------------------- docs/contributing.md | 3 + docs/development.md | 15 ++- docs/faq.md | 24 ++--- docs/index.md | 36 +++---- docs/quick-start.md | 42 +++++--- 10 files changed, 190 insertions(+), 164 deletions(-) diff --git a/README.md b/README.md index 769e5ed68..714f2cc5c 100644 --- a/README.md +++ b/README.md @@ -16,14 +16,13 @@ Full documentation lives in the [`docs/`](docs/index.md) directory and is hosted > **[https://cdisc-org.github.io/cdisc-rules-engine](https://cdisc-org.github.io/cdisc-rules-engine)** -| | | -|---|---| -| [Quick Start](docs/quick-start.md) | Download the executable or run from source | -| [CLI Reference](docs/cli-reference.md) | All commands and flags | -| [Development](docs/development.md) | PyPI integration, building, testing | -| [Contributing](docs/contributing.md) | Code style, tests, rule contributions | -| [FAQ & Troubleshooting](docs/faq.md) | Common issues and questions | - +| | | +| -------------------------------------- | ------------------------------------------ | +| [Quick Start](docs/quick-start.md) | Download the executable or run from source | +| [CLI Reference](docs/cli-reference.md) | All commands and flags | +| [Development](docs/development.md) | PyPI integration, building, testing | +| [Contributing](docs/contributing.md) | Code style, tests, rule contributions | +| [FAQ & Troubleshooting](docs/faq.md) | Common issues and questions | ## Troubleshooting & Support diff --git a/docs/PYPI.md b/docs/PYPI.md index 21b066419..d0d0ffd16 100644 --- a/docs/PYPI.md +++ b/docs/PYPI.md @@ -23,12 +23,12 @@ The package also includes the USDM and Dataset-JSON schemas, available if you us ## Choosing an Approach -| | Option A: Business Rules Engine | Option B: RulesEngine Class | -|---|---|---| -| **Interface** | Low-level, rule-by-rule | High-level, dataset-oriented | -| **Data input** | pandas DataFrame | XPT or other file-based datasets | -| **Setup** | Minimal | More configuration required | -| **Best for** | Simple in-memory validation | Full multi-domain validation pipelines | +| | Option A: Business Rules Engine | Option B: RulesEngine Class | +| -------------- | ------------------------------- | -------------------------------------- | +| **Interface** | Low-level, rule-by-rule | High-level, dataset-oriented | +| **Data input** | pandas DataFrame | XPT or other file-based datasets | +| **Setup** | Minimal | More configuration required | +| **Best for** | Simple in-memory validation | Full multi-domain validation pipelines | --- diff --git a/docs/README.md b/docs/README.md index 62fe5567e..dd8f821a1 100644 --- a/docs/README.md +++ b/docs/README.md @@ -10,4 +10,4 @@ # CDISC Rules Engine (CORE) -Open source offering of the CDISC Conformance Rules Engine — a tool for validating clinical trial data against CDISC data standards. CORE validates study data structure and conformance against both published CDISC conformance rules for the various CDISC standards and custom rules authored in the CORE rule format. \ No newline at end of file +Open source offering of the CDISC Conformance Rules Engine — a tool for validating clinical trial data against CDISC data standards. CORE validates study data structure and conformance against both published CDISC conformance rules for the various CDISC standards and custom rules authored in the CORE rule format. diff --git a/docs/build_executable.md b/docs/build_executable.md index a0bd10502..ccea09cce 100644 --- a/docs/build_executable.md +++ b/docs/build_executable.md @@ -207,7 +207,6 @@ RUN chmod +x /app/dist/output/core-ubuntu-22.04/core/core && \ For the most consistent experience across all platforms, consider using the **GitHub Actions approach (Option 1)**, which handles platform differences automatically and doesn't require local Docker setup. - ## Troubleshooting ### Architecture Issues @@ -218,4 +217,3 @@ You can build executables for different operating systems using GitHub's hosted - https://github.com/actions/runner-images The runner in our workflow currently builds for ubuntu-22.04 but this can be changed to your particular OS, as well as CPU architectures (This will be different for Apple M chips that use ARM architecture versus Intel chips) - diff --git a/docs/cli-reference.md b/docs/cli-reference.md index efd1f25e8..98ad08784 100644 --- a/docs/cli-reference.md +++ b/docs/cli-reference.md @@ -14,64 +14,64 @@ python core.py validate --help ### Required Flags -| Flag | Description | -|---|---| -| `-s, --standard TEXT` | CDISC standard to validate against (e.g. `sdtmig`, `tig`). Also via `PRODUCT` env var. | -| `-v, --version TEXT` | Standard version (e.g. `3-4`). Also via `VERSION` env var. | -| `-ss, --substandard TEXT` | **Required for TIG.** One of `SDTM`, `SEND`, `ADaM`, `CDASH`. Also via `SUBSTANDARD` env var. | -| `-uc, --use-case TEXT` | **Required for TIG.** One of `INDH`, `PROD`, `NONCLIN`, `ANALYSIS`. Also via `USE_CASE` env var. | +| Flag | Description | +| ------------------------- | ------------------------------------------------------------------------------------------------ | +| `-s, --standard TEXT` | CDISC standard to validate against (e.g. `sdtmig`, `tig`). Also via `PRODUCT` env var. | +| `-v, --version TEXT` | Standard version (e.g. `3-4`). Also via `VERSION` env var. | +| `-ss, --substandard TEXT` | **Required for TIG.** One of `SDTM`, `SEND`, `ADaM`, `CDASH`. Also via `SUBSTANDARD` env var. | +| `-uc, --use-case TEXT` | **Required for TIG.** One of `INDH`, `PROD`, `NONCLIN`, `ANALYSIS`. Also via `USE_CASE` env var. | ### Dataset Input -| Flag | Description | -|---|---| -| `-d, --data TEXT` | Path to directory containing dataset files. Only the last value is used if specified multiple times. | -| `-dp, --dataset-path TEXT` | Absolute path to a single dataset file. Can be specified multiple times. | -| `-dxp, --define-xml-path TEXT` | Path to Define-XML. Also via `DEFINE_XML` env var. | -| `-ft, --filetype TEXT` | File extension filter applied to the `-d` directory (e.g. `xpt`). Takes priority over `--dataset-path` when both are provided. | -| `-e, --encoding TEXT` | File encoding for reading datasets (default: `utf-8`). Common values: `cp1252`, `latin-1`, `utf-16`. | -| `-vcp, --variables-csv-path` | Path to `_variables.csv` when using multiple `-dp` paths across different folders. | -| `-dcp, --datasets-csv-path` | Path to `_datasets.csv`. Required when multiple `-dp` paths refer to different folders. | +| Flag | Description | +| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------ | +| `-d, --data TEXT` | Path to directory containing dataset files. Only the last value is used if specified multiple times. | +| `-dp, --dataset-path TEXT` | Absolute path to a single dataset file. Can be specified multiple times. | +| `-dxp, --define-xml-path TEXT` | Path to Define-XML. Also via `DEFINE_XML` env var. | +| `-ft, --filetype TEXT` | File extension filter applied to the `-d` directory (e.g. `xpt`). Takes priority over `--dataset-path` when both are provided. | +| `-e, --encoding TEXT` | File encoding for reading datasets (default: `utf-8`). Common values: `cp1252`, `latin-1`, `utf-16`. | +| `-vcp, --variables-csv-path` | Path to `_variables.csv` when using multiple `-dp` paths across different folders. | +| `-dcp, --datasets-csv-path` | Path to `_datasets.csv`. Required when multiple `-dp` paths refer to different folders. | ### Rules Selection -| Flag | Description | -|---|---| -| `-r, --rules TEXT` | Validate only specific rule(s) by CORE ID (e.g. `CORE-000001`). Repeatable. | -| `-er, --exclude-rules TEXT` | Exclude specific rule(s) by CORE ID. Repeatable. | -| `-lr, --local-rules TEXT` | Path to directory or file containing local rule YAML/JSON files. | -| `-cs, --custom-standard` | Use a custom standard (uploaded to cache via `update-cache`). | -| `-cse, --custom-standard-encoding TEXT` | Encoding for custom standard JSON files (auto-detected if omitted). | +| Flag | Description | +| --------------------------------------- | --------------------------------------------------------------------------- | +| `-r, --rules TEXT` | Validate only specific rule(s) by CORE ID (e.g. `CORE-000001`). Repeatable. | +| `-er, --exclude-rules TEXT` | Exclude specific rule(s) by CORE ID. Repeatable. | +| `-lr, --local-rules TEXT` | Path to directory or file containing local rule YAML/JSON files. | +| `-cs, --custom-standard` | Use a custom standard (uploaded to cache via `update-cache`). | +| `-cse, --custom-standard-encoding TEXT` | Encoding for custom standard JSON files (auto-detected if omitted). | ### Controlled Terminology -| Flag | Description | -|---|---| +| Flag | Description | +| -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `-ct, --controlled-terminology-package TEXT` | CT package(s) to validate against. Repeatable. If Define-XML 2.1 is provided, CT is taken from the define. Also via `CT` env var (`:` separated on Unix, `;` on Windows). | ### External Dictionaries -| Flag | Description | -|---|---| -| `--whodrug TEXT` | Path to WHODrug dictionary files. | -| `--meddra TEXT` | Path to MedDRA dictionary files. | -| `--loinc TEXT` | Path to LOINC dictionary files. | -| `--medrt TEXT` | Path to MedRT dictionary files. | -| `--unii TEXT` | Path to UNII dictionary files. | -| `--snomed-version TEXT` | SNOMED CT version (e.g. `2024-09-01`). | -| `--snomed-url TEXT` | SNOMED CT API base URL. | +| Flag | Description | +| ----------------------- | --------------------------------------- | +| `--whodrug TEXT` | Path to WHODrug dictionary files. | +| `--meddra TEXT` | Path to MedDRA dictionary files. | +| `--loinc TEXT` | Path to LOINC dictionary files. | +| `--medrt TEXT` | Path to MedRT dictionary files. | +| `--unii TEXT` | Path to UNII dictionary files. | +| `--snomed-version TEXT` | SNOMED CT version (e.g. `2024-09-01`). | +| `--snomed-url TEXT` | SNOMED CT API base URL. | | `--snomed-edition TEXT` | SNOMED CT edition (e.g. `SNOMEDCT-US`). | ### Output -| Flag | Description | -|---|---| -| `-o, --output TEXT` | Output file path (without extension). Extension is added automatically based on format. | -| `-of, --output-format [JSON\|XLSX]` | Output format. | -| `-rr, --raw-report` | Raw output format (JSON only). | -| `-mr, --max-report-rows INTEGER` | Max rows in the Issue Details tab of Excel output (default: 1000; 0 = unlimited). Also via `MAX_REPORT_ROWS` env var. | -| `-me, --max-errors-per-rule INTEGER BOOLEAN` | Limit errors per rule. Format: `-me `. See below. | -| `-rt, --report-template TEXT` | Path to a custom Excel report template. | +| Flag | Description | +| -------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- | +| `-o, --output TEXT` | Output file path (without extension). Extension is added automatically based on format. | +| `-of, --output-format [JSON\|XLSX]` | Output format. | +| `-rr, --raw-report` | Raw output format (JSON only). | +| `-mr, --max-report-rows INTEGER` | Max rows in the Issue Details tab of Excel output (default: 1000; 0 = unlimited). Also via `MAX_REPORT_ROWS` env var. | +| `-me, --max-errors-per-rule INTEGER BOOLEAN` | Limit errors per rule. Format: `-me `. See below. | +| `-rt, --report-template TEXT` | Path to a custom Excel report template. | #### `--max-errors-per-rule` Detail @@ -86,44 +86,47 @@ python core.py validate --help ### Performance & Behavior -| Flag | Description | -|---|---| -| `-ca, --cache TEXT` | Relative path to cache files. | -| `-ps, --pool-size INTEGER` | Number of parallel processes. | -| `-dep, --dotenv-path` | Path to `.env` file for environment variables. | -| `-l, --log-level` | Log verbosity: `debug`, `info`, `warn`, `error`, `critical`, `disabled`. | -| `-vx, --validate-xml` | XML validation toggle (default: enabled). Pass a value other than `y` to disable. | -| `-vo, --verbose-output` | Print each rule as it completes. | -| `-p, --progress` | Progress display: `verbose_output`, `disabled`, `percents`, or `bar` (default). | -| `-jcf, --jsonata-custom-functions` | Variable name + path to directory of custom JSONata functions. Repeatable. | -| `--help` | Show the help message and exit. | +| Flag | Description | +| ---------------------------------- | --------------------------------------------------------------------------------- | +| `-ca, --cache TEXT` | Relative path to cache files. | +| `-ps, --pool-size INTEGER` | Number of parallel processes. | +| `-dep, --dotenv-path` | Path to `.env` file for environment variables. | +| `-l, --log-level` | Log verbosity: `debug`, `info`, `warn`, `error`, `critical`, `disabled`. | +| `-vx, --validate-xml` | XML validation toggle (default: enabled). Pass a value other than `y` to disable. | +| `-vo, --verbose-output` | Print each rule as it completes. | +| `-p, --progress` | Progress display: `verbose_output`, `disabled`, `percents`, or `bar` (default). | +| `-jcf, --jsonata-custom-functions` | Variable name + path to directory of custom JSONata functions. Repeatable. | +| `--help` | Show the help message and exit. | ### Understanding Rule Run Statuses The **Rules Report** tab in the output summarizes the outcome for each rule: -| Status | Meaning | -|---|---| -| `SUCCESS` | Rule ran; no issues found. | -| `SKIPPED` | Rule could not run (column/domain not found, schema validation off, out of scope). | -| `ISSUE REPORTED` | Rule ran; issues were found. | -| `EXECUTION ERROR` | Rule failed for an unexpected reason. Details in the Issue Details tab. | +| Status | Meaning | +| ----------------- | ---------------------------------------------------------------------------------- | +| `SUCCESS` | Rule ran; no issues found. | +| `SKIPPED` | Rule could not run (column/domain not found, schema validation off, out of scope). | +| `ISSUE REPORTED` | Rule ran; issues were found. | +| `EXECUTION ERROR` | Rule failed for an unexpected reason. Details in the Issue Details tab. | ### Large Dataset Processing (Dask) CORE uses Dask instead of pandas for datasets exceeding 1/4 of available RAM. To force Dask for all datasets: **Linux/Mac:** + ```bash DATASET_SIZE_THRESHOLD=0 ./core validate -s sdtmig -v 3-4 -d /path/to/datasets ``` **Windows (PowerShell):** + ```powershell $env:DATASET_SIZE_THRESHOLD=0; .\core.exe validate -s sdtmig -v 3-4 -d C:\path\to\datasets ``` Or create a `.env` file in the root directory: + ``` DATASET_SIZE_THRESHOLD=0 ``` @@ -146,17 +149,17 @@ To obtain an API key: [wiki.cdisc.org — Getting Started](https://wiki.cdisc.or ### Options -| Flag | Description | -|---|---| -| `-c, --cache-path TEXT` | Relative path to cache (only needed if cache has moved from its default location). | -| `--apikey TEXT` | CDISC Library API key. Also via `CDISC_LIBRARY_API_KEY` env var. | -| `-crd, --custom-rules-directory TEXT` | Path to a directory of local rule YAML/JSON files to add to the cache. | -| `-cr, --custom-rule TEXT` | Path to a single local rule file. Repeatable. | -| `-rcr, --remove-custom-rules TEXT` | Remove rules by ID, comma-separated list, or `ALL`. | -| `-ucr, --update-custom-rule TEXT` | Path to an updated rule file. Replaces the existing rule in cache. | -| `-cs, --custom-standard TEXT` | Path to a JSON file defining a custom standard. | -| `-cse, --custom-standard-encoding TEXT` | Encoding for the custom standard JSON file. | -| `-rcs, --remove-custom-standard TEXT` | Remove a custom standard by `standard/version`. Repeatable. | +| Flag | Description | +| --------------------------------------- | ---------------------------------------------------------------------------------- | +| `-c, --cache-path TEXT` | Relative path to cache (only needed if cache has moved from its default location). | +| `--apikey TEXT` | CDISC Library API key. Also via `CDISC_LIBRARY_API_KEY` env var. | +| `-crd, --custom-rules-directory TEXT` | Path to a directory of local rule YAML/JSON files to add to the cache. | +| `-cr, --custom-rule TEXT` | Path to a single local rule file. Repeatable. | +| `-rcr, --remove-custom-rules TEXT` | Remove rules by ID, comma-separated list, or `ALL`. | +| `-ucr, --update-custom-rule TEXT` | Path to an updated rule file. Replaces the existing rule in cache. | +| `-cs, --custom-standard TEXT` | Path to a JSON file defining a custom standard. | +| `-cse, --custom-standard-encoding TEXT` | Encoding for the custom standard JSON file. | +| `-rcs, --remove-custom-standard TEXT` | Remove a custom standard by `standard/version`. Repeatable. | ### Custom Rules @@ -183,12 +186,10 @@ python core.py update-cache --remove-custom-rules ALL Custom standards map a standard identifier to a list of applicable rule IDs. Add rules to the cache first, then create a standard that references them. **Standard JSON format:** + ```json { - "cust_standard/1-0": [ - "CUSTOM-000123", - "CUSTOM-000456" - ] + "cust_standard/1-0": ["CUSTOM-000123", "CUSTOM-000456"] } ``` @@ -236,15 +237,15 @@ python core.py list-rules --custom-rules -s custom_standard -v 1-0 ### Options -| Flag | Description | -|---|---| -| `-s, --standard TEXT` | Filter by standard (e.g. `sdtmig`, `tig`). | -| `-v, --version TEXT` | Filter by standard version (e.g. `3-4`). | +| Flag | Description | +| ------------------------- | --------------------------------------------------------------------- | +| `-s, --standard TEXT` | Filter by standard (e.g. `sdtmig`, `tig`). | +| `-v, --version TEXT` | Filter by standard version (e.g. `3-4`). | | `-ss, --substandard TEXT` | Filter by substandard for integrated standards (e.g. `SDTM`, `ADaM`). | -| `-r, --rules TEXT` | List specific rule(s) by CORE ID. Repeatable. | -| `--custom-rules` | List custom rules instead of published CDISC rules. | -| `-c, --cache-path TEXT` | Relative path to cache. | -| `--help` | Show the help message and exit. | +| `-r, --rules TEXT` | List specific rule(s) by CORE ID. Repeatable. | +| `--custom-rules` | List custom rules instead of published CDISC rules. | +| `-c, --cache-path TEXT` | Relative path to cache. | +| `--help` | Show the help message and exit. | --- @@ -262,10 +263,10 @@ python core.py list-rule-sets --custom **Options:** -| Flag | Description | -|---|---| -| `-c, --cache-path TEXT` | Relative path to cache. | -| `-o, --custom` | List custom standards and versions instead of CDISC standards. | +| Flag | Description | +| ----------------------- | -------------------------------------------------------------- | +| `-c, --cache-path TEXT` | Relative path to cache. | +| `-o, --custom` | List custom standards and versions instead of CDISC standards. | --- @@ -282,10 +283,10 @@ python core.py list-ct -s sdtmct **Options:** -| Flag | Description | -|---|---| -| `-c, --cache-path TEXT` | Relative path to cache. | -| `-s, --subsets TEXT` | CT subset type filter (e.g. `sdtmct`). Multiple values allowed. | +| Flag | Description | +| ----------------------- | --------------------------------------------------------------- | +| `-c, --cache-path TEXT` | Relative path to cache. | +| `-s, --subsets TEXT` | CT subset type filter (e.g. `sdtmct`). Multiple values allowed. | --- @@ -293,17 +294,17 @@ python core.py list-ct -s sdtmct Key environment variables that can substitute for or supplement CLI flags: -| Variable | Equivalent Flag | -|---|---| -| `CDISC_LIBRARY_API_KEY` | `--apikey` | -| `PRODUCT` | `-s` / `--standard` | -| `VERSION` | `-v` / `--version` | -| `SUBSTANDARD` | `-ss` / `--substandard` | -| `USE_CASE` | `-uc` / `--use-case` | -| `DEFINE_XML` | `-dxp` / `--define-xml-path` | -| `CT` | `-ct` (`:` separated on Unix, `;` on Windows) | -| `MAX_REPORT_ROWS` | `-mr` / `--max-report-rows` | -| `MAX_ERRORS_PER_RULE` | `-me` / `--max-errors-per-rule` | -| `DATASET_SIZE_THRESHOLD` | Dask threshold (set to `0` to force Dask) | +| Variable | Equivalent Flag | +| ------------------------ | --------------------------------------------- | +| `CDISC_LIBRARY_API_KEY` | `--apikey` | +| `PRODUCT` | `-s` / `--standard` | +| `VERSION` | `-v` / `--version` | +| `SUBSTANDARD` | `-ss` / `--substandard` | +| `USE_CASE` | `-uc` / `--use-case` | +| `DEFINE_XML` | `-dxp` / `--define-xml-path` | +| `CT` | `-ct` (`:` separated on Unix, `;` on Windows) | +| `MAX_REPORT_ROWS` | `-mr` / `--max-report-rows` | +| `MAX_ERRORS_PER_RULE` | `-me` / `--max-errors-per-rule` | +| `DATASET_SIZE_THRESHOLD` | Dask threshold (set to `0` to force Dask) | These can be set in a `.env` file in the root directory. See `env.example`. diff --git a/docs/contributing.md b/docs/contributing.md index 5de318b49..d53a3315e 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -29,6 +29,7 @@ Follow the [Development → Environment Setup](development.md#environment-setup) This project enforces consistent formatting and linting via pre-commit hooks. **Tools used:** + - [`black`](https://black.readthedocs.io/) — Python code formatter - [`flake8`](https://flake8.pycqa.org/) — Python linter - [`prettier`](https://prettier.io/) — JSON, YAML, and Markdown formatter @@ -42,6 +43,7 @@ pre-commit install This installs the hooks into `.git/hooks/` so formatting and linting run automatically on each commit. To run the checks manually: + ```bash pre-commit run --all-files ``` @@ -71,6 +73,7 @@ For larger changes or new features, consider opening a GitHub Discussion or issu Use [GitHub Issues](https://github.com/cdisc-org/cdisc-rules-engine/issues) to report bugs or request features. When reporting a bug, please include: + - A clear description of the problem - Steps to reproduce - Your operating system and Python version (or executable version) diff --git a/docs/development.md b/docs/development.md index 422a68aaa..b1b5684f8 100644 --- a/docs/development.md +++ b/docs/development.md @@ -6,13 +6,14 @@ This page covers integrating CORE as a library, building from source, running te ## PyPI Integration -CORE is available as a Python package for direct integration into data pipelines. +CORE is available as a Python package for direct integration into data pipelines. ```bash pip install cdisc-rules-engine ``` This allows you to: + - Import the rules engine library into your Python projects - Validate data without requiring XPT format files - Integrate rules validation into existing pipelines @@ -68,6 +69,7 @@ Pre-built executables are available on the [Releases page](https://github.com/cd For reference, the PyInstaller commands are: **Linux:** + ```bash pyinstaller core.py \ --add-data=venv/lib/python3.12/site-packages/xmlschema/schemas:xmlschema/schemas \ @@ -77,6 +79,7 @@ pyinstaller core.py \ ``` **Windows:** + ```bash pyinstaller core.py ^ --add-data=".venv/Lib/site-packages/xmlschema/schemas;xmlschema/schemas" ^ @@ -94,34 +97,40 @@ The executable is created in the `dist/` folder and does not require Python to b All non-Python files must be listed in `MANIFEST.in` to be included in the distribution. **Unix / Mac:** + ```bash python3 -m pip install --upgrade build python3 -m build ``` Install locally from the `dist/` folder: + ```bash pip3 install dist/cdisc_rules_engine-{version}-py3-none-any.whl ``` Upload to PyPI: + ```bash python3 -m pip install --upgrade twine python3 -m twine upload --repository {repository_name} dist/* ``` **Windows:** + ```bash py -m pip install --upgrade build py -m build ``` Install locally: + ```bash pip install dist\cdisc_rules_engine-{version}-py3-none-any.whl ``` Upload to PyPI: + ```bash py -m pip install --upgrade twine py -m twine upload --repository {repository_name} dist/* @@ -139,17 +148,21 @@ CORE validates against USDM JSON Schema versions 3.0 and 4.0. Schema definitions These are derived from the OpenAPI specs in [`cdisc-org/DDF-RA`](https://github.com/cdisc-org/DDF-RA). To update or add a schema version: 1. Extract the OpenAPI spec for the target tag: + ```bash git --no-pager --git-dir DDF-RA.git show --format=format:"%B" {tag}:Deliverables/API/USDM_API.json > USDM_API_{version}.json ``` + Example tag: `v3.0.0` 2. Convert the OpenAPI spec to JSON Schema: + ```bash python scripts/openapi-to-json.py ``` 3. Convert the JSON Schema to `.pkl`: + ```bash python scripts/json_pkl_converter.py ``` diff --git a/docs/faq.md b/docs/faq.md index af0bc16dc..11e23579c 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -19,6 +19,7 @@ Download Python 3.12 from [python.org](https://www.python.org/downloads/) if nee ### The Mac executable won't open due to a security warning Remove the quarantine attribute: + ```bash xattr -rd com.apple.quarantine . ``` @@ -48,7 +49,8 @@ CDISC_LIBRARY_API_KEY=your_key_here ``` ### My validation returned no results or unexpected rules -- **Console output / logs:** By default, engine logs are disabled. Use `-l` / `--log-level` to enable them. Available levels: `info`, `debug`, `warn`, `error`, `critical`. + +- **Console output / logs:** By default, engine logs are disabled. Use `-l` / `--log-level` to enable them. Available levels: `info`, `debug`, `warn`, `error`, `critical`. - **The output report:** Open the results file and review the **Rule Report** tab (XLSX) or the top-level `Rules_Report` array (JSON). Rules with a status of `SKIPPED` will include a reason in the Issue Details — this is often the cause of unexpectedly absent results. - **Scope flags:** Confirm that your `-s`, `-v`, and for TIG `-ss` arguments match the standard, version, and substandard you intended to validate against. A mismatch will cause rules to be silently out of scope. @@ -60,12 +62,12 @@ If you're still not seeing expected results after checking the above, post in th ### What does each rule run status mean? -| Status | Meaning | -|---|---| -| `SUCCESS` | Rule ran; no issues found. | -| `SKIPPED` | Rule did not run (column/domain not found, schema validation off, outside scope). | -| `ISSUE_REPORTED` | Rule ran; issues were found. | -| `EXECUTION_ERROR` | Rule failed unexpectedly. Check the Issue Details tab for details. | +| Status | Meaning | +| ----------------- | --------------------------------------------------------------------------------- | +| `SUCCESS` | Rule ran; no issues found. | +| `SKIPPED` | Rule did not run (column/domain not found, schema validation off, outside scope). | +| `ISSUE_REPORTED` | Rule ran; issues were found. | +| `EXECUTION_ERROR` | Rule failed unexpectedly. Check the Issue Details tab for details. | ### My dataset fails to load / wrong encoding @@ -75,7 +77,7 @@ CORE defaults to `utf-8`. If your files use a different encoding, specify it: python core.py validate -s sdtmig -v 3-4 -dp path/to/dataset.xpt -e cp1252 ``` ->NOTE: you may notice a `'utf-9' codec can't decode byte` error in the logs. This is usually due to Windows Smart Quotes, produced in excel, which are CP1252 encoded, not utf-8. Unfortunately, Windows Smart Quotes produce a file that is mostly utf-8 with some CP1252 for the smart quotes so the -e command will not work to resolve this. You will need to locate these quotes and manually change them before being able to rerun this data. +> NOTE: you may notice a `'utf-9' codec can't decode byte` error in the logs. This is usually due to Windows Smart Quotes, produced in excel, which are CP1252 encoded, not utf-8. Unfortunately, Windows Smart Quotes produce a file that is mostly utf-8 with some CP1252 for the smart quotes so the -e command will not work to resolve this. You will need to locate these quotes and manually change them before being able to rerun this data. ### Will using -d pointed at my data directory cause CORE to include my Define-XML file in the validation? @@ -94,6 +96,7 @@ $env:DATASET_SIZE_THRESHOLD=0; .\core.exe validate -s sdtmig -v 3-4 -d C:\path\t ``` Or add to a `.env` file in the root directory: + ``` DATASET_SIZE_THRESHOLD=0 ``` @@ -125,8 +128,6 @@ python core.py validate -s sdtmig -v 3-4 -d /data -er CORE-000001 You can view and clone the CDISC CORE rules at [cdisc-open-rules](https://github.com/cdisc-org/cdisc-open-rules) - - ## Privacy & Data Protection ### Does CORE transmit my study data anywhere during validation? @@ -149,13 +150,12 @@ You can view and clone the CDISC CORE rules at [cdisc-open-rules](https://github - **Custom rules** are individual rule definitions stored in the cache by CORE ID. - **Custom standards** map a standard identifier to a list of rule IDs, acting as a lookup for which rules apply. -Add your custom rules first, then create a standard that references them. See [CLI Reference → update-cache](cli-reference.md#custom-rules) for details. Custom Rules & Standards continue to be a work in progress, there are tickets within CORE's Issues to full implement further support for them in the future. +Add your custom rules first, then create a standard that references them. See [CLI Reference → update-cache](cli-reference.md#custom-rules) for details. Custom Rules & Standards continue to be a work in progress, there are tickets within CORE's Issues to full implement further support for them in the future. ### Can a custom standard use CDISC library metadata? Yes. If you name your custom standard after an existing CDISC standard (e.g. `sdtmig/3-4`), CORE will fetch library metadata for that standard while applying your custom rules. - --- ## Still Need Help? diff --git a/docs/index.md b/docs/index.md index ea1600e1a..f8d6989a8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -16,30 +16,30 @@ CORE is the open-source offering of the CDISC Conformance Rules Engine — a too CORE validates study datasets against published CDISC conformance rules for the following standards: -| Standard | Description | -|---|---| -| **SDTMIG** | Study Data Tabulation Model Implementation Guide | -| **SENDIG** | Standard for Exchange of Nonclinical Data | -| **ADaM** | Analysis Data Model | -| **TIG** | Therapeutic Area Implementation Guide | -| **FDA Business Rules** | FDA submission conformance rules | -| **USDM** | Unified Study Definitions Model | +| Standard | Description | +| ---------------------- | ------------------------------------------------ | +| **SDTMIG** | Study Data Tabulation Model Implementation Guide | +| **SENDIG** | Standard for Exchange of Nonclinical Data | +| **ADaM** | Analysis Data Model | +| **TIG** | Therapeutic Area Implementation Guide | +| **FDA Business Rules** | FDA submission conformance rules | +| **USDM** | Unified Study Definitions Model | -CORE validates data *structure and conformance* against published rules. It is not a replacement for clinical review, statistical analysis, or submission readiness assessment. Rule logic is defined in [`cdisc-open-rules`](https://github.com/cdisc-org/cdisc-open-rules). +CORE validates data _structure and conformance_ against published rules. It is not a replacement for clinical review, statistical analysis, or submission readiness assessment. Rule logic is defined in [`cdisc-open-rules`](https://github.com/cdisc-org/cdisc-open-rules). --- ## Getting Started -| I want to… | Go to… | -|---|---| -| Run CORE without installing Python | [Quick Start → Executable](quick-start.md#option-1-pre-built-executable) | -| Run from source / contribute code | [Quick Start → From Source](quick-start.md#option-2-from-source-code) | -| See all CLI options | [CLI Reference](cli-reference.md) | -| Integrate CORE into my Python project | [Development → PyPI](development.md#pypi-integration) | -| Build or test CORE | [Development](development.md) | -| Contribute rules or code | [Contributing](contributing.md) | -| Get help or ask a question | [FAQ & Troubleshooting](faq.md) | +| I want to… | Go to… | +| ------------------------------------- | ------------------------------------------------------------------------ | +| Run CORE without installing Python | [Quick Start → Executable](quick-start.md#option-1-pre-built-executable) | +| Run from source / contribute code | [Quick Start → From Source](quick-start.md#option-2-from-source-code) | +| See all CLI options | [CLI Reference](cli-reference.md) | +| Integrate CORE into my Python project | [Development → PyPI](development.md#pypi-integration) | +| Build or test CORE | [Development](development.md) | +| Contribute rules or code | [Contributing](contributing.md) | +| Get help or ask a question | [FAQ & Troubleshooting](faq.md) | --- diff --git a/docs/quick-start.md b/docs/quick-start.md index f7be1edc9..f834a3537 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -17,11 +17,13 @@ Download the latest executable for your operating system from the [Releases page Open a terminal in the unzipped directory and run: **Windows (PowerShell):** + ```powershell .\core.exe --help ``` **Linux / Mac:** + ```bash # Make it executable (one-time setup) chmod +x ./core @@ -30,6 +32,7 @@ chmod +x ./core ``` > **Mac users:** If you see a security warning, remove the quarantine attribute first: +> > ```bash > xattr -rd com.apple.quarantine . > ``` @@ -43,11 +46,13 @@ Executable releases ship with a pre-populated cache, so you can skip this step a ### 4. Run a Validation **Windows:** + ```powershell .\core.exe validate -s sdtmig -v 3-4 -d C:\path\to\datasets ``` **Linux / Mac:** + ```bash ./core validate -s sdtmig -v 3-4 -d /path/to/datasets ``` @@ -73,19 +78,24 @@ Test files are cleaned up automatically after completion. **Best for:** Developers, contributors, or users who need the latest features. ### Prerequisites + - **Python 3.12** is required. Other versions are not supported. Check your version: + ```bash python --version ``` - Install Python 3.12 from [python.org](https://www.python.org/downloads/) if needed. + +Install Python 3.12 from [python.org](https://www.python.org/downloads/) if needed. - **Git** is required to clone the repository. Check your version: + ```bash git --version ``` - Install Git from [git-scm.com](https://git-scm.com/downloads) if needed. + +Install Git from [git-scm.com](https://git-scm.com/downloads) if needed. ### 1. Clone the Repository @@ -97,12 +107,14 @@ cd cdisc-rules-engine ### 2. Create and Activate a Virtual Environment **Linux / Mac:** + ```bash python -m venv venv source venv/bin/activate ``` **Windows:** + ```bash python -m venv venv .\venv\Scripts\Activate @@ -134,12 +146,12 @@ python core.py validate -s sdtmig -v 3-4 -d /path/to/datasets CORE supports the following input formats: -| Format | Description | -|---|---| -| **XPT** | SAS Transport Format (version 5) | -| **JSON** | Dataset-JSON ≥ v1.1 (CDISC standard) | -| **NDJSON** | Newline Delimited JSON | -| **XLSX** | Microsoft Excel | +| Format | Description | +| ---------- | ------------------------------------ | +| **XPT** | SAS Transport Format (version 5) | +| **JSON** | Dataset-JSON ≥ v1.1 (CDISC standard) | +| **NDJSON** | Newline Delimited JSON | +| **XLSX** | Microsoft Excel | > **Note:** Define-XML files must be provided via `--define-xml-path` (`-dxp`), not through the dataset directory. @@ -151,12 +163,12 @@ All commands and flags are documented in the [CLI Reference](cli-reference.md). Command summary: -| Command | Purpose | -|---|---| -| `validate` | Run conformance validation | -| `update-cache` | Download/refresh rules, CT, and metadata | -| `list-rules` | List rules available in the cache | -| `list-rule-sets` | List standards and versions in the cache | -| `list-ct` | List controlled terminology packages in the cache | +| Command | Purpose | +| ---------------- | ------------------------------------------------- | +| `validate` | Run conformance validation | +| `update-cache` | Download/refresh rules, CT, and metadata | +| `list-rules` | List rules available in the cache | +| `list-rule-sets` | List standards and versions in the cache | +| `list-ct` | List controlled terminology packages in the cache | > Throughout these docs, examples use `python core.py`. If you're using the executable, replace this with `.\core.exe` (Windows) or `./core` (Linux/Mac). From 835a16dbf098fae6057040b39e70ddbca6a513e2 Mon Sep 17 00:00:00 2001 From: Samuel Johnson Date: Tue, 12 May 2026 15:16:38 -0400 Subject: [PATCH 3/8] moved logo --- README.md | 2 +- {resources/assets => docs}/CORE_logo_sm.png | Bin docs/README.md | 2 +- docs/index.html | 2 +- 4 files changed, 3 insertions(+), 3 deletions(-) rename {resources/assets => docs}/CORE_logo_sm.png (100%) diff --git a/README.md b/README.md index 714f2cc5c..e3a06d922 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@

- +

diff --git a/resources/assets/CORE_logo_sm.png b/docs/CORE_logo_sm.png similarity index 100% rename from resources/assets/CORE_logo_sm.png rename to docs/CORE_logo_sm.png diff --git a/docs/README.md b/docs/README.md index 7f69fe6ce..0dcf0e37c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,6 +1,6 @@

- CORE Logo + CORE Logo

diff --git a/docs/index.html b/docs/index.html index 1761a7cc9..f529f8532 100644 --- a/docs/index.html +++ b/docs/index.html @@ -5,7 +5,7 @@ CDISC Rules Engine (CORE) - +