A sophisticated synthetic data generation system for Identity Governance and Administration (IGA) testing and analytics. Creates statistically realistic datasets with meaningful feature-entitlement associations that mirror production enterprise environments.
- Overview
- Key Features
- Architecture
- Installation
- Quick Start
- Configuration
- Machine Learning Concepts
- Output Files
- Validation
- Advanced Usage
- Technical Details
- Troubleshooting
The IGA Data Generator creates synthetic datasets that include:
- User Identities: 20+ attributes including department, job level, location, manager hierarchy
- Application Entitlements: Role catalogs for AWS, Salesforce, ServiceNow, SAP, and custom apps
- Access Assignments: User-to-entitlement mappings with confidence scores
- Statistical Associations: Feature-to-entitlement rules with configurable support/confidence
Unlike simple random data generators, this system:
- Avoids Overfitting: Uses schema-based rule generation instead of mining patterns from generated data
- Ensures Statistical Validity: Produces data with predictable Cramér's V, support, and confidence metrics
- Models Real Patterns: Cross-application rules reflect how enterprise access actually works
- Enables Testing: Provides ground truth for role mining, anomaly detection, and access analytics
Rules are generated before identities using abstract feature schemas:
Define Rule Schemas → Generate Rules → Generate Identities → Apply Rules → Validate
This prevents the circular dependency of needing data to create rules and needing rules to create realistic data.
Ensures mined confidence matches target confidence using the formula:
mined_confidence = freqUnion / freq
freqUnion_target = confidence × freq
For a rule [Department=Finance] → SAP_FI_001 with 85% confidence:
- 1000 Finance users (freq)
- Exactly 850 get the entitlement (freqUnion_target)
- Mined confidence = 850/1000 = 85% ✓
Models realistic enterprise access patterns:
{
"rule_id": "R001",
"antecedent": {"department": "Finance", "job_level": "Senior"},
"consequent": {
"SAP": ["FI_ACCOUNTANT", "FI_ANALYST"],
"AWS": ["PowerUserAccess"],
"Salesforce": ["Finance_User"]
},
"confidence": 0.85
}Multi-stage filtering using:
- Cardinality Filter: Max 50 unique values
- Cramér's V: Association strength ≥ 0.1
- Chi-Square Test: Statistical significance p < 0.05
- Mutual Information: Predictive power ranking
┌─────────────────────────────────────────────────────────────┐
│ Configuration (synthetic_iga_data_generator_config.json) │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Schema Definition & Rule Generation │
│ • Load feature schemas (departments, job titles, etc.) │
│ • Generate association rules with target metrics │
│ • Configure cross-app vs per-app mode │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Identity Generation │
│ • Extract rule patterns │
│ • Calculate quotas (support × √confidence weighting) │
│ • Generate identities matching patterns │
│ • Fill remaining with random identities │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 3: Entitlement Assignment │
│ • Compute rule quotas (freqUnion_target) │
│ • Deterministically assign entitlements │
│ • Calculate confidence scores │
│ • Enforce distribution targets (80/20 rule) │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 4: Validation & Output │
│ • Feature validation (Cramér's V, chi-square) │
│ • Distribution validation │
│ • Export CSV files │
│ • Generate QA summary │
└─────────────────────────────────────────────────────────────┘
- Python 3.8+
- pip
pip install -r requirements.txtpandas>=1.5.0
numpy>=1.23.0
faker>=18.0.0
mlxtend>=0.21.0
scikit-learn>=1.2.0
scipy
python synthetic_iga_data_generator.py --config synthetic_iga_data_generator_config.jsonpython validate_generated_data.py --identities out/identities.csv --accounts-dir out/python validate_zero_entitlements.py out/{
"global": {
"seed": 42,
"output_directory": "./out",
"rules_directory": "./rules",
"log_level": "INFO"
}
}{
"identity": {
"num_identities": 3500,
"pct_users_without_manager": 0.10,
"distribution_employee_contractor": {
"Employee": 0.70,
"Contractor": 0.20,
"Intern": 0.10
}
}
}{
"dynamic_rules": {
"enabled": true,
"use_cross_app_rules": true,
"num_cross_app_rules": 700,
"num_rules_per_app": 20,
"confidence_distribution": {
"high": 0.50,
"medium": 0.35,
"low": 0.15
},
"confidence_ranges": {
"high": {"min": 0.80, "max": 1.00},
"medium": {"min": 0.50, "max": 0.79},
"low": {"min": 0.20, "max": 0.49}
},
"max_cardinality": 20
}
}{
"confidence": {
"distribution": {
"high": 0.35,
"medium": 0.30,
"low": 0.30,
"none": 0.05
},
"thresholds": {
"high": {"min": 0.70},
"medium": {"min": 0.40},
"low": {"min": 0.01}
},
"pct_modelled_users": 0.93
}
}{
"features": {
"mandatory_features": ["job_level", "business_unit", "department_type"],
"additional_features": ["location_country", "employment_type", "is_manager"],
"num_features_for_rules": 8,
"feature_selection_method": "cramers_v"
}
}The system uses ARM principles in reverse:
Traditional Approach:
Data → Mine Patterns → Discover Rules
This System:
Define Schemas → Generate Rules → Generate Data → Validate Patterns
- Support:
P(features AND entitlements)- frequency of pattern in population - Confidence:
P(entitlements | features)- conditional probability - Lift: How much more likely the consequent is given antecedent vs. random
- Cramér's V: Effect size measuring association strength (0 to 1)
Stage 1: Cardinality Filter (≤50 unique values)
↓
Stage 2: Statistical Significance (Cramér's V ≥ 0.1 or p < 0.05)
↓
Stage 3: Low-Cardinality for MI (preparation)
↓
Stage 4: Mutual Information Calculation
↓
Stage 5: Aggregation & Ranking
# Right-skewed: most employees newer
beta_a = 2.0
beta_b = 5.0
tenure = beta(a, b) × 20 years{
'Junior': 0.20,
'Mid': 0.30,
'Senior': 0.20,
'Manager': 0.10,
'Executive': 0.02
}User profiles with 20+ attributes:
| Column | Description |
|---|---|
| user_id | Unique identifier (U0000001) |
| user_name | Username (firstnamelastname) |
| department | Department name |
| job_level | Junior/Mid/Senior/Manager/Director/VP/Executive |
| business_unit | Industry/business unit |
| location_country | Country code (US, GB, IN, DE, AU) |
| manager | Manager's user_id |
| is_manager | Y/N flag |
| tenure_years | Years of service (Beta distribution) |
Entitlement catalogs per application:
| Column | Description |
|---|---|
| entitlement_id | Unique entitlement identifier |
| entitlement_name | Human-readable name |
| app_name | Application name |
| entitlement_type | standard/License/PermissionSet/linkedTemplates |
| criticality | High/Medium/Low |
User-entitlement assignments:
| Column | Description |
|---|---|
| user_id | User identifier |
| user_name | Username |
| entitlement_grants | Pipe-delimited entitlement IDs |
| confidence_score | Numeric score 0.0-1.0 |
| confidence_bucket | High/Medium/Low/None |
Validation report including:
- Identity distribution statistics
- Entitlement coverage per app
- Confidence distribution breakdown
- Feature validation results
The system includes built-in validation:
# Full validation with confidence distribution check
python validate_generated_data.py \
--identities out/identities.csv \
--accounts-dir out/ \
--config synthetic_iga_data_generator_config.json \
--output validation_report.json-
Feature Quality
- Cardinality within limits
- Cramér's V ≥ threshold
- Non-constant features
-
Confidence Distribution
- Actual vs. target bucket distribution
- Tolerance: ±10%
-
Data Integrity
- No duplicate user_id per app
- No users with zero entitlements
- All mandatory entitlements used
-
Statistical Associations
- Mined rules match target confidence
- Support levels within expected ranges
=== VALIDATION SUMMARY ===
Bucket Target Actual Difference Status
High 35.0% 34.2% -0.8% ✓ PASS
Medium 30.0% 31.1% +1.1% ✓ PASS
Low 30.0% 29.8% -0.2% ✓ PASS
None 5.0% 4.9% -0.1% ✓ PASS
✓ VALIDATION PASSED
Add applications beyond the mandatory four (AWS, Salesforce, ServiceNow, SAP):
{
"applications": {
"num_apps": 6,
"additional_app_pool": ["Workday", "Okta", "GitHub", "Slack"],
"apps": [
{
"app_name": "Workday",
"app_id": "APP_WORKDAY",
"enabled": true,
"num_entitlements": 75,
"criticality_distribution": {
"High": 0.20,
"Medium": 0.50,
"Low": 0.30
}
}
]
}
}Generate rules that reuse the same feature patterns across apps:
{
"dynamic_rules": {
"coordinate_rules_across_apps": true,
"num_unique_feature_patterns": 10
}
}This creates 10 feature patterns (e.g., {department=Finance, job_level=Senior}) and generates one rule per app using each pattern.
Control which features are used in rules by adjusting cardinality limits:
{
"dynamic_rules": {
"max_cardinality": 20,
"min_cardinality": 2
}
}Features with too many unique values (e.g., 198 departments) are automatically excluded.
Enforce that 80% of users have 3+ entitlements per application:
{
"grants": {
"pct_users_with_3_plus_per_app": 0.80
}
}The system implements a two-phase approach:
Phase 1: Pattern-Based Generation (93% of users)
# Extract patterns from rules
patterns = extract_rule_patterns(rules)
# Calculate quotas using weighted allocation
weight = support × √confidence # Square root dampens high-confidence dominance
quota = budget × (weight / total_weight)
# Generate identities matching patterns
for pattern, quota in pattern_quotas:
generate_identities_matching(pattern, quota)Phase 2: Random Generation (7% of users)
Fills remaining slots with random identities to prevent 100% rule coverage.
# 1. Compute exact quota
freq = count_users_matching_antecedent(rule)
freqUnion_target = int(confidence × freq)
# 2. Sort users by ID (CRITICAL for nested patterns)
matching_users = sorted(matching_users, key=lambda u: u.user_id)
# 3. Select first N users deterministically
selected = matching_users[:freqUnion_target]
# 4. Grant entitlements to selected users
for user in selected:
assign_entitlements(user, rule.consequent)Users are pre-determined to follow rules consistently across all apps:
# Single decision per user (not per user-app)
if random() < pct_modelled_users:
rule_following_users.add(user_id)
# Applied consistently to ALL apps
for app in apps:
if user_id in rule_following_users:
apply_rules(user_id, app)Only low-cardinality features are used in rules to ensure statistical power:
# Skip high-cardinality features
if feature_cardinality > max_cardinality:
logger.info(f"Skipping '{feature}': {cardinality} > {max_cardinality}")
continue
# Example: department with 198 values → skipped
# Example: job_level with 8 values → includedThis prevents overly-specific rules that have poor mined confidence.
Symptoms: Validation shows 82% Low confidence instead of target 30%
Causes:
- High-cardinality features in rules
- Probabilistic assignment instead of quota-based
- Mismatched features between rule generation and validation
Solutions:
{
"dynamic_rules": {
"max_cardinality": 20, // Lower to exclude high-cardinality features
"confidence_ranges": {
"high": {"min": 0.80, "max": 1.00} // Raise minimum
}
}
}Symptoms: Some users have no entitlements across all apps
Cause: Edge case in random assignment
Solution: System auto-corrects with final safeguard:
# Automatic fix in generate_all()
_ensure_no_users_without_entitlements()Symptoms: Generated rules don't cover expected feature combinations
Causes:
- Too few rules configured
- Feature excluded due to cardinality
- Random selection didn't pick pattern
Solutions:
{
"dynamic_rules": {
"num_cross_app_rules": 700, // Increase to cover more patterns
"max_features_per_rule": 2, // Lower for broader coverage
"min_features_per_rule": 1
}
}Symptoms: Multiple rows per user_id in accounts file
Cause: Bug in account generation
Solution: System includes deduplication:
# Automatic deduplication in generate_for_app()
accounts = _deduplicate_accounts(accounts, app_name)Enable verbose logging:
python synthetic_iga_data_generator.py --config config.json --verboseOr in configuration:
{
"global": {
"log_level": "DEBUG"
}
}- 3,500 users: ~50 MB
- 10,000 users: ~150 MB
- 100,000 users: ~1.5 GB
- 3,500 users, 6 apps, 700 rules: ~2-3 minutes
- 10,000 users, 6 apps, 700 rules: ~5-7 minutes
- 100,000 users, 10 apps, 1000 rules: ~30-45 minutes
-
Reduce MI calculation overhead:
{"feature_validation": {"validation_sample_size": 5000}} -
Limit cross-app rules:
{"dynamic_rules": {"num_cross_app_rules": 500}} -
Disable validation:
{"feature_validation": {"enabled": false}}
- Association Rule Mining: Agrawal & Srikant (1994)
- Cramér's V: Cramér (1946) - Mathematical Methods of Statistics
- Feature Selection: Guyon & Elisseeff (2003) - An Introduction to Variable and Feature Selection
- Mutual Information: Cover & Thomas (2006) - Elements of Information Theory