|
| 1 | +# Hibernate ORM vs SQL Schema Analysis |
| 2 | + |
| 3 | +This document provides a comprehensive analysis of discrepancies between the Hibernate ORM data model and the SQL-based PostgreSQL schema in TreeBASE. |
| 4 | + |
| 5 | +## Executive Summary |
| 6 | + |
| 7 | +**Finding**: There are discrepancies between the Hibernate ORM layer (source of truth) and the SQL schema instantiated by the database initialization scripts. However, switching to Hibernate-based schema generation introduces significant issues due to seed data dependencies. |
| 8 | + |
| 9 | +**Key Issues Identified**: |
| 10 | +1. **Missing patch in CI/CD**: Patch `0011_increase-citation-column-lengths.sql` was not included in `init_db_uptodate.pg` (now fixed) |
| 11 | +2. **Column length mismatches**: Several columns in the SQL schema have different lengths than defined in Hibernate (fixed by patches) |
| 12 | +3. **Different initialization paths**: CI/CD and Docker use different initialization approaches |
| 13 | +4. **Patch idempotency issues**: Some patches failed when schema already had correct types (now fixed) |
| 14 | + |
| 15 | +**Recommendation**: **Keep the SQL-based schema with patches** (Option A) rather than switching to Hibernate-based generation. See "Test Impact Analysis" section for reasoning. |
| 16 | + |
| 17 | +## Current Architecture |
| 18 | + |
| 19 | +### CI/CD Database Initialization |
| 20 | +- Uses `treebase-core/db/schema/init_db_uptodate.pg` |
| 21 | +- Applies snapshot `0000_SCHEMA_before_patches_start.sql` + `0000_DATA_before_patches_start.sql` |
| 22 | +- Then applies patches 0001 through 0011 sequentially |
| 23 | +- **Fixed**: Patch 0011 is now included |
| 24 | + |
| 25 | +### Docker Database Initialization |
| 26 | +- Uses `docker-compose.yml` volume mounts |
| 27 | +- Applies: |
| 28 | + 1. `docker/00-init-roles.sql` - Role initialization |
| 29 | + 2. `treebase-core/src/main/resources/TBASE2_POSTGRES_CREATION.sql` - Schema creation |
| 30 | + 3. `treebase-core/src/main/resources/initTreebase.sql` - Initial data |
| 31 | + 4. `docker/03-migration-hibernate-sequence.sql` - Hibernate sequence migration |
| 32 | + |
| 33 | +### Hibernate Configuration |
| 34 | +- `hibernate.hbm2ddl.auto=` (empty/disabled) |
| 35 | +- Uses annotation-based mapping (`@Entity`, `@Table`, `@Column`) |
| 36 | +- Entities defined in `org.cipres.treebase.domain.*` |
| 37 | + |
| 38 | +## Detailed Schema Discrepancies |
| 39 | + |
| 40 | +### 1. Citation Table |
| 41 | + |
| 42 | +| Column | Hibernate Definition | SQL Schema (snapshot) | SQL Schema (TBASE2_POSTGRES_CREATION) | Patch Applied | |
| 43 | +|--------|---------------------|----------------------|---------------------------------------|---------------| |
| 44 | +| `title` | VARCHAR(500) | VARCHAR(500) | VARCHAR(500) | - | |
| 45 | +| `abstract` | VARCHAR(10000) | VARCHAR(10000) | VARCHAR(10000) | - | |
| 46 | +| `keywords` | VARCHAR(1000) | VARCHAR(255) | VARCHAR(1000) | Patch 0011 | |
| 47 | +| `journal` | VARCHAR(500) | VARCHAR(255) | VARCHAR(500) | Patch 0011 | |
| 48 | + |
| 49 | +**Notes**: |
| 50 | +- The snapshot has outdated column lengths for `keywords` (255 vs 1000) and `journal` (255 vs 500) |
| 51 | +- Patch 0011 fixes these in the snapshot-based initialization |
| 52 | +- `TBASE2_POSTGRES_CREATION.sql` already has correct values |
| 53 | + |
| 54 | +### 2. TaxonLabel Table |
| 55 | + |
| 56 | +| Column | Hibernate Definition | SQL Schema (snapshot) | Patch Applied | |
| 57 | +|--------|---------------------|----------------------|---------------| |
| 58 | +| `linked` | BOOLEAN | BOOLEAN | Patch 0010 | |
| 59 | + |
| 60 | +**Notes**: |
| 61 | +- Earlier SQL had `linked` as `smallint`, but this was fixed by Patch 0010 |
| 62 | +- Both snapshot (after patches) and TBASE2_POSTGRES_CREATION.sql now use BOOLEAN |
| 63 | + |
| 64 | +### 3. Help Table |
| 65 | + |
| 66 | +| Column | Hibernate Definition | SQL Schema | |
| 67 | +|--------|---------------------|------------| |
| 68 | +| `tag` | VARCHAR(255) (implicit) | VARCHAR(255) | |
| 69 | +| `helptext` | TEXT (LOB, 65536) | TEXT | |
| 70 | + |
| 71 | +**Notes**: Schema matches. |
| 72 | + |
| 73 | +### 4. PasswordResetToken Table |
| 74 | + |
| 75 | +| Column | Hibernate Definition | SQL Schema (Patch 0009) | |
| 76 | +|--------|---------------------|------------------------| |
| 77 | +| `token_id` | BIGINT (auto-increment) | BIGINT (sequence) | |
| 78 | +| `token` | VARCHAR(100), unique, NOT NULL | VARCHAR(100), unique, NOT NULL | |
| 79 | +| `user_id` | BIGINT, NOT NULL | BIGINT, NOT NULL, FK | |
| 80 | +| `expiry_date` | TIMESTAMP, NOT NULL | TIMESTAMP, NOT NULL | |
| 81 | +| `used` | BOOLEAN, NOT NULL | BOOLEAN, NOT NULL | |
| 82 | + |
| 83 | +**Notes**: |
| 84 | +- Hibernate uses `@GeneratedValue(strategy = GenerationType.IDENTITY)` |
| 85 | +- SQL uses a sequence - these are compatible in PostgreSQL |
| 86 | +- Schema matches structurally |
| 87 | + |
| 88 | +### 5. AnalysisStep Table |
| 89 | + |
| 90 | +| Column | Hibernate Definition | SQL Schema (snapshot) | |
| 91 | +|--------|---------------------|----------------------| |
| 92 | +| `tb_analysisid` | Not in Hibernate entity | VARCHAR(34) | |
| 93 | + |
| 94 | +**Notes**: The SQL schema has a `tb_analysisid` column that doesn't appear to be mapped in Hibernate. This is a legacy TB1 field. |
| 95 | + |
| 96 | +### 6. MatrixRow Table (CRITICAL - LOB Type Mismatch) |
| 97 | + |
| 98 | +| Column | Hibernate Definition (Before) | Hibernate Definition (After) | SQL Schema | |
| 99 | +|--------|-------------------------------|------------------------------|------------| |
| 100 | +| `symbolstring` | `@Lob` + `@Column(length=524288)` | `@Column(columnDefinition="text")` | TEXT | |
| 101 | + |
| 102 | +**Root Cause of "Bad value for type long" Error**: |
| 103 | +The `@Lob` annotation on `MatrixRow.symbolString` caused Hibernate to use OID-based CLOB handling in PostgreSQL. However, the data is inserted via direct JDBC in `DiscreteMatrixJDBC.batchUpdateRowSymbol()` using `setString()`, which writes plain text. When Hibernate tried to read the data back using `getClob()`, it attempted to interpret the text data as a CLOB OID, causing the error: |
| 104 | +``` |
| 105 | +PSQLException: Bad value for type long : 0002000000000000000000000-0-0000000---00-000000 |
| 106 | +``` |
| 107 | + |
| 108 | +**Solution**: Removed `@Lob` and used `@Column(columnDefinition = "text")` to ensure TEXT column type without CLOB semantics. |
| 109 | + |
| 110 | +### 7. PhyloTree Table (Preventive Fix) |
| 111 | + |
| 112 | +| Column | Hibernate Definition (Before) | Hibernate Definition (After) | SQL Schema | |
| 113 | +|--------|-------------------------------|------------------------------|------------| |
| 114 | +| `newickstring` | `@Lob` + `@Column(length=4194304)` | `@Column(columnDefinition="text")` | TEXT | |
| 115 | + |
| 116 | +**Notes**: Same pattern as MatrixRow. Removed `@Lob` to prevent potential similar issues. |
| 117 | + |
| 118 | +## Root Cause Analysis |
| 119 | + |
| 120 | +### Why Discrepancies Exist |
| 121 | + |
| 122 | +1. **Dual Maintenance**: The SQL schema and Hibernate annotations are maintained separately |
| 123 | +2. **Historical Evolution**: The SQL schema evolved over time with patches while Hibernate annotations were updated independently |
| 124 | +3. **Different Base Files**: |
| 125 | + - `TBASE2_POSTGRES_CREATION.sql` appears more aligned with Hibernate |
| 126 | + - The snapshot approach uses older schema + patches |
| 127 | +4. **Missing Patch**: Patch 0011 wasn't added to the patch inclusion list |
| 128 | + |
| 129 | +### The Two Initialization Paths |
| 130 | + |
| 131 | +``` |
| 132 | +CI/CD Path: |
| 133 | + snapshot → patches (0001-0011) → final schema |
| 134 | +
|
| 135 | +Docker Path: |
| 136 | + TBASE2_POSTGRES_CREATION.sql → initTreebase.sql → final schema |
| 137 | +``` |
| 138 | + |
| 139 | +These paths should produce equivalent schemas but use different mechanisms. |
| 140 | + |
| 141 | +## Recommendations |
| 142 | + |
| 143 | +### Option A: Continue with SQL-Based Schema (RECOMMENDED) |
| 144 | +**Pros**: |
| 145 | +- Known working approach - all tests pass |
| 146 | +- Explicit control over schema |
| 147 | +- Migration scripts for production |
| 148 | +- Seed data loading order is controlled |
| 149 | + |
| 150 | +**Cons**: |
| 151 | +- Dual maintenance burden |
| 152 | +- Potential for drift between Hibernate and SQL |
| 153 | +- Requires diligent patch management |
| 154 | + |
| 155 | +**Action Items**: |
| 156 | +1. ✅ Add patch 0011 to `init_db_uptodate.pg` (DONE) |
| 157 | +2. ✅ Make patches idempotent (DONE) |
| 158 | +3. Consider creating a new schema snapshot that includes all patches |
| 159 | +4. Consider adding `hibernate.hbm2ddl.auto=validate` in production |
| 160 | + |
| 161 | +### Option B: Switch to Hibernate-Based Schema Generation |
| 162 | +**Pros**: |
| 163 | +- Single source of truth (Hibernate entities) |
| 164 | +- No impedance mismatch |
| 165 | +- Automatic schema updates with `hbm2ddl.auto=update` |
| 166 | + |
| 167 | +**Cons**: |
| 168 | +- ❌ **NOT VIABLE** - 21 test failures/errors (see Test Impact Analysis) |
| 169 | +- Requires significant refactoring of seed data loading |
| 170 | +- Risk of data loss in production if not carefully managed |
| 171 | +- Less control over exact DDL |
| 172 | + |
| 173 | +### Option C: Hybrid Approach |
| 174 | +**Strategy**: Use Hibernate for tests/development, SQL for production |
| 175 | + |
| 176 | +**Status**: Not recommended due to: |
| 177 | +- Seed data dependency issues |
| 178 | +- Would require rewriting test data setup |
| 179 | +- Maintenance burden of two approaches |
| 180 | + |
| 181 | +## Test Impact Analysis |
| 182 | + |
| 183 | +### Actual Test Results |
| 184 | + |
| 185 | +#### SQL-Based Schema (Current Approach) |
| 186 | +**Result**: ✅ All tests pass |
| 187 | +``` |
| 188 | +Tests run: 301, Failures: 0, Errors: 0, Skipped: 43 |
| 189 | +BUILD SUCCESS |
| 190 | +``` |
| 191 | + |
| 192 | +#### Hibernate-Based Schema (`hbm2ddl.auto=create`) |
| 193 | +**Result**: ❌ 12 failures, 9 errors |
| 194 | +``` |
| 195 | +Tests run: 301, Failures: 12, Errors: 9, Skipped: 53 |
| 196 | +BUILD FAILURE |
| 197 | +``` |
| 198 | + |
| 199 | +**Failed Tests (Missing Seed Data)**: |
| 200 | +- `ItemDefinitionDAOTest.testFindByDescription` |
| 201 | +- `ItemDefinitionDAOTest.testFindPredefinedItemDefinition` |
| 202 | +- `MatrixDAOTest.testfindKindByDescription` |
| 203 | +- `MatrixDataTypeDAOTest.testFindByDescription` |
| 204 | +- `AlgorithmDAOTest.testFinalAllUniqueAlgorithmDescriptions` |
| 205 | +- `StudyStatusDAOTest.testFindStatusInProgress/Published/Ready` |
| 206 | +- `PhyloTreeDAOTest.testFindTypeByDescription/findKindByDescription/findQualityByDescription` |
| 207 | +- `SubmissionServiceImplTest.testProcessNexusFile` |
| 208 | + |
| 209 | +**Error Tests (Foreign Key Violations)**: |
| 210 | +- `EnvironmentTest.testGetGeneratedKey` - null value in column violation |
| 211 | +- `EnvironmentTest.testSelectFromInsert` - null value in column violation |
| 212 | +- `MatrixServiceImplTest.testAddDelete` - NullPointerException (missing service beans) |
| 213 | +- `AnalysisServiceImplTest.testAddDelete` - NullPointerException |
| 214 | +- `StudyServiceImplTest.*` - Multiple NullPointerExceptions |
| 215 | + |
| 216 | +### Root Cause Analysis |
| 217 | + |
| 218 | +The test failures with Hibernate-based schema generation are caused by: |
| 219 | + |
| 220 | +1. **Missing Seed Data**: Hibernate creates empty tables. Tests expect pre-populated reference data for: |
| 221 | + - `ItemDefinition` (predefined item definitions) |
| 222 | + - `MatrixDataType` (DNA, RNA, Protein, etc.) |
| 223 | + - `StudyStatus` (In Progress, Ready, Published) |
| 224 | + - `TreeType`, `TreeKind`, `TreeQuality` |
| 225 | + - `Algorithm` (reference algorithm types) |
| 226 | + - `User`, `Person` (test user accounts) |
| 227 | + |
| 228 | +2. **Foreign Key Ordering**: When loading `initTreebase.sql` after Hibernate schema creation: |
| 229 | + - Hibernate creates FK constraints immediately |
| 230 | + - SQL script inserts data in wrong order (e.g., `user` before `person`) |
| 231 | + - Results in FK constraint violations |
| 232 | + |
| 233 | +3. **Schema Already Exists Errors**: |
| 234 | + - `password_reset_token` table created by Hibernate conflicts with SQL script |
| 235 | + |
| 236 | +### Conclusion |
| 237 | + |
| 238 | +**Hibernate-based schema generation is NOT suitable** for the current test setup because: |
| 239 | +1. Tests depend on seed data that must be loaded in a specific order |
| 240 | +2. The `initTreebase.sql` script was designed for SQL-based schema creation |
| 241 | +3. Significant refactoring of test data setup would be required |
| 242 | + |
| 243 | +**Recommendation**: Continue with **Option A (SQL-Based Schema)** with the following improvements: |
| 244 | +1. ✅ Keep patches idempotent (fixed in this PR) |
| 245 | +2. ✅ Ensure all patches are included in `init_db_uptodate.pg` |
| 246 | +3. Consider adding `hibernate.hbm2ddl.auto=validate` in production to catch drift |
| 247 | + |
| 248 | +### Tests to Monitor |
| 249 | + |
| 250 | +Based on the codebase, these test categories should be monitored: |
| 251 | + |
| 252 | +1. **DAO Tests** (`org.cipres.treebase.dao.*`) |
| 253 | + - Test CRUD operations against database |
| 254 | + - May be affected by constraint differences |
| 255 | + |
| 256 | +2. **Service Tests** (`org.cipres.treebase.service.*`) |
| 257 | + - Test business logic with database |
| 258 | + - Should be unaffected by schema generation method |
| 259 | + |
| 260 | +3. **Domain Tests** (`org.cipres.treebase.domain.*`) |
| 261 | + - Test entity relationships |
| 262 | + - May be affected by cascade/fetch settings |
| 263 | + |
| 264 | +## Implementation Plan |
| 265 | + |
| 266 | +### Phase 1: Fix Immediate Issues (COMPLETED) |
| 267 | +1. ✅ Add patch 0011 to `init_db_uptodate.pg` |
| 268 | + |
| 269 | +### Phase 2: Validate Current State |
| 270 | +1. Run full test suite with current SQL-based initialization |
| 271 | +2. Document any existing test failures |
| 272 | + |
| 273 | +### Phase 3: Test Hibernate-Based Initialization |
| 274 | +1. Create test configuration with `hbm2ddl.auto=create` |
| 275 | +2. Run tests to identify failures |
| 276 | +3. Document failures and root causes |
| 277 | + |
| 278 | +### Phase 4: Implement Chosen Strategy |
| 279 | +1. Based on test results, implement Option A, B, or C |
| 280 | +2. Update CI/CD configuration |
| 281 | +3. Update Docker configuration |
| 282 | +4. Update documentation |
| 283 | + |
| 284 | +## Appendix: Column Length Constants |
| 285 | + |
| 286 | +From `org.cipres.treebase.domain.TBPersistable`: |
| 287 | + |
| 288 | +```java |
| 289 | +public static final int COLUMN_LENGTH_30 = 30; |
| 290 | +public static final int COLUMN_LENGTH_50 = 50; |
| 291 | +public static final int COLUMN_LENGTH_100 = 100; |
| 292 | +public static final int COLUMN_LENGTH_STRING = 255; |
| 293 | +public static final int COLUMN_LENGTH_500 = 500; |
| 294 | +public static final int COLUMN_LENGTH_STRING_1K = 1000; |
| 295 | +public static final int COLUMN_LENGTH_STRING_NOTES = 2000; |
| 296 | +public static final int COLUMN_LENGTH_STRING_MAX = 5000; |
| 297 | + |
| 298 | +public static final int CITATION_TITLE_COLUMN_LENGTH = 500; |
| 299 | +public static final int CITATION_ABSTRACT_COLUMN_LENGTH = 10000; |
| 300 | +public static final int CITATION_KEYWORDS_COLUMN_LENGTH = 1000; |
| 301 | +public static final int CITATION_JOURNAL_COLUMN_LENGTH = 500; |
| 302 | +``` |
| 303 | + |
| 304 | +These constants define the expected column lengths in Hibernate and should be used as the reference when comparing with SQL schemas. |
0 commit comments