Skip to content

Commit a7179ee

Browse files
authored
Merge pull request #420 from TreeBASE/master
Merge recent refactorings and bug fixes
2 parents 3433367 + fe65e6a commit a7179ee

70 files changed

Lines changed: 36101 additions & 15755 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Dockerfile

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ RUN mvn dependency:go-offline -B || true
1616

1717
# Copy source code
1818
COPY treebase-core/src treebase-core/src
19+
COPY treebase-core/lib treebase-core/lib
1920
COPY treebase-web/src treebase-web/src
2021
COPY treebase-web/lib treebase-web/lib
2122
COPY oai-pmh_data_provider oai-pmh_data_provider
@@ -45,8 +46,13 @@ COPY --from=builder /build/treebase-web/target/treebase-web.war /usr/local/tomca
4546
RUN curl -o /usr/local/tomcat/lib/postgresql.jar \
4647
https://jdbc.postgresql.org/download/postgresql-42.7.7.jar
4748

48-
# Create a directory for Mesquite (placeholder)
49-
RUN mkdir -p /usr/local/mesquite
49+
# Create a directory for Mesquite and copy the headless Mesquite library
50+
# The treebase-core/lib folder contains the headless Mesquite distribution with:
51+
# - mesquite/ - Mesquite core classes
52+
# - headless/ - Headless AWT implementation
53+
# - com/apple/ - Apple API stubs (required by Mesquite even on non-Mac platforms)
54+
# - Other supporting libraries
55+
COPY --from=builder /build/treebase-core/lib /usr/local/mesquite
5056

5157
# Set environment variables for Tomcat
5258
# Java 17 compatibility flags based on GitHub Actions workflow

Dockerfile.dev

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ RUN curl -o /usr/local/tomcat/lib/postgresql.jar \
1919
https://jdbc.postgresql.org/download/postgresql-42.7.7.jar
2020

2121
# Create Mesquite directory placeholder
22+
# Note: In development mode, the entrypoint script will copy
23+
# the Mesquite library from the mounted /app/treebase-core/lib
2224
RUN mkdir -p /usr/local/mesquite
2325

2426
# Set up environment variables

WEB_UI_ANALYSIS.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -416,9 +416,6 @@ styles/
416416
| `autocomplete.js` | 3.7KB | Autocomplete widget | High - Prototype-dependent |
417417
| `menuExpandable.js` | 6.1KB | Expandable menu navigation | Medium - Pure JS possible |
418418
| `ajaxProgress.js` | 1KB | Progress indicators | Medium - Uses DWR |
419-
| `d3.phylogram.js` | 12KB | D3-based tree rendering | Low - Uses D3 v7 |
420-
| `newick.js` | 3.2KB | Newick format parsing | Low - Pure JavaScript |
421-
| `sha1.js` | 4.4KB | SHA1 hashing | Low - Pure JavaScript |
422419
| `googleAnalytics.js` | 3.4KB | Analytics integration | Low - Standard GA |
423420
| `multiFileUpload.js` | 1KB | File upload handling | Medium |
424421
| `xp_progress.js` | 2.5KB | Progress bars | Medium |
Lines changed: 304 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,304 @@
1+
# Hibernate ORM vs SQL Schema Analysis
2+
3+
This document provides a comprehensive analysis of discrepancies between the Hibernate ORM data model and the SQL-based PostgreSQL schema in TreeBASE.
4+
5+
## Executive Summary
6+
7+
**Finding**: There are discrepancies between the Hibernate ORM layer (source of truth) and the SQL schema instantiated by the database initialization scripts. However, switching to Hibernate-based schema generation introduces significant issues due to seed data dependencies.
8+
9+
**Key Issues Identified**:
10+
1. **Missing patch in CI/CD**: Patch `0011_increase-citation-column-lengths.sql` was not included in `init_db_uptodate.pg` (now fixed)
11+
2. **Column length mismatches**: Several columns in the SQL schema have different lengths than defined in Hibernate (fixed by patches)
12+
3. **Different initialization paths**: CI/CD and Docker use different initialization approaches
13+
4. **Patch idempotency issues**: Some patches failed when schema already had correct types (now fixed)
14+
15+
**Recommendation**: **Keep the SQL-based schema with patches** (Option A) rather than switching to Hibernate-based generation. See "Test Impact Analysis" section for reasoning.
16+
17+
## Current Architecture
18+
19+
### CI/CD Database Initialization
20+
- Uses `treebase-core/db/schema/init_db_uptodate.pg`
21+
- Applies snapshot `0000_SCHEMA_before_patches_start.sql` + `0000_DATA_before_patches_start.sql`
22+
- Then applies patches 0001 through 0011 sequentially
23+
- **Fixed**: Patch 0011 is now included
24+
25+
### Docker Database Initialization
26+
- Uses `docker-compose.yml` volume mounts
27+
- Applies:
28+
1. `docker/00-init-roles.sql` - Role initialization
29+
2. `treebase-core/src/main/resources/TBASE2_POSTGRES_CREATION.sql` - Schema creation
30+
3. `treebase-core/src/main/resources/initTreebase.sql` - Initial data
31+
4. `docker/03-migration-hibernate-sequence.sql` - Hibernate sequence migration
32+
33+
### Hibernate Configuration
34+
- `hibernate.hbm2ddl.auto=` (empty/disabled)
35+
- Uses annotation-based mapping (`@Entity`, `@Table`, `@Column`)
36+
- Entities defined in `org.cipres.treebase.domain.*`
37+
38+
## Detailed Schema Discrepancies
39+
40+
### 1. Citation Table
41+
42+
| Column | Hibernate Definition | SQL Schema (snapshot) | SQL Schema (TBASE2_POSTGRES_CREATION) | Patch Applied |
43+
|--------|---------------------|----------------------|---------------------------------------|---------------|
44+
| `title` | VARCHAR(500) | VARCHAR(500) | VARCHAR(500) | - |
45+
| `abstract` | VARCHAR(10000) | VARCHAR(10000) | VARCHAR(10000) | - |
46+
| `keywords` | VARCHAR(1000) | VARCHAR(255) | VARCHAR(1000) | Patch 0011 |
47+
| `journal` | VARCHAR(500) | VARCHAR(255) | VARCHAR(500) | Patch 0011 |
48+
49+
**Notes**:
50+
- The snapshot has outdated column lengths for `keywords` (255 vs 1000) and `journal` (255 vs 500)
51+
- Patch 0011 fixes these in the snapshot-based initialization
52+
- `TBASE2_POSTGRES_CREATION.sql` already has correct values
53+
54+
### 2. TaxonLabel Table
55+
56+
| Column | Hibernate Definition | SQL Schema (snapshot) | Patch Applied |
57+
|--------|---------------------|----------------------|---------------|
58+
| `linked` | BOOLEAN | BOOLEAN | Patch 0010 |
59+
60+
**Notes**:
61+
- Earlier SQL had `linked` as `smallint`, but this was fixed by Patch 0010
62+
- Both snapshot (after patches) and TBASE2_POSTGRES_CREATION.sql now use BOOLEAN
63+
64+
### 3. Help Table
65+
66+
| Column | Hibernate Definition | SQL Schema |
67+
|--------|---------------------|------------|
68+
| `tag` | VARCHAR(255) (implicit) | VARCHAR(255) |
69+
| `helptext` | TEXT (LOB, 65536) | TEXT |
70+
71+
**Notes**: Schema matches.
72+
73+
### 4. PasswordResetToken Table
74+
75+
| Column | Hibernate Definition | SQL Schema (Patch 0009) |
76+
|--------|---------------------|------------------------|
77+
| `token_id` | BIGINT (auto-increment) | BIGINT (sequence) |
78+
| `token` | VARCHAR(100), unique, NOT NULL | VARCHAR(100), unique, NOT NULL |
79+
| `user_id` | BIGINT, NOT NULL | BIGINT, NOT NULL, FK |
80+
| `expiry_date` | TIMESTAMP, NOT NULL | TIMESTAMP, NOT NULL |
81+
| `used` | BOOLEAN, NOT NULL | BOOLEAN, NOT NULL |
82+
83+
**Notes**:
84+
- Hibernate uses `@GeneratedValue(strategy = GenerationType.IDENTITY)`
85+
- SQL uses a sequence - these are compatible in PostgreSQL
86+
- Schema matches structurally
87+
88+
### 5. AnalysisStep Table
89+
90+
| Column | Hibernate Definition | SQL Schema (snapshot) |
91+
|--------|---------------------|----------------------|
92+
| `tb_analysisid` | Not in Hibernate entity | VARCHAR(34) |
93+
94+
**Notes**: The SQL schema has a `tb_analysisid` column that doesn't appear to be mapped in Hibernate. This is a legacy TB1 field.
95+
96+
### 6. MatrixRow Table (CRITICAL - LOB Type Mismatch)
97+
98+
| Column | Hibernate Definition (Before) | Hibernate Definition (After) | SQL Schema |
99+
|--------|-------------------------------|------------------------------|------------|
100+
| `symbolstring` | `@Lob` + `@Column(length=524288)` | `@Column(columnDefinition="text")` | TEXT |
101+
102+
**Root Cause of "Bad value for type long" Error**:
103+
The `@Lob` annotation on `MatrixRow.symbolString` caused Hibernate to use OID-based CLOB handling in PostgreSQL. However, the data is inserted via direct JDBC in `DiscreteMatrixJDBC.batchUpdateRowSymbol()` using `setString()`, which writes plain text. When Hibernate tried to read the data back using `getClob()`, it attempted to interpret the text data as a CLOB OID, causing the error:
104+
```
105+
PSQLException: Bad value for type long : 0002000000000000000000000-0-0000000---00-000000
106+
```
107+
108+
**Solution**: Removed `@Lob` and used `@Column(columnDefinition = "text")` to ensure TEXT column type without CLOB semantics.
109+
110+
### 7. PhyloTree Table (Preventive Fix)
111+
112+
| Column | Hibernate Definition (Before) | Hibernate Definition (After) | SQL Schema |
113+
|--------|-------------------------------|------------------------------|------------|
114+
| `newickstring` | `@Lob` + `@Column(length=4194304)` | `@Column(columnDefinition="text")` | TEXT |
115+
116+
**Notes**: Same pattern as MatrixRow. Removed `@Lob` to prevent potential similar issues.
117+
118+
## Root Cause Analysis
119+
120+
### Why Discrepancies Exist
121+
122+
1. **Dual Maintenance**: The SQL schema and Hibernate annotations are maintained separately
123+
2. **Historical Evolution**: The SQL schema evolved over time with patches while Hibernate annotations were updated independently
124+
3. **Different Base Files**:
125+
- `TBASE2_POSTGRES_CREATION.sql` appears more aligned with Hibernate
126+
- The snapshot approach uses older schema + patches
127+
4. **Missing Patch**: Patch 0011 wasn't added to the patch inclusion list
128+
129+
### The Two Initialization Paths
130+
131+
```
132+
CI/CD Path:
133+
snapshot → patches (0001-0011) → final schema
134+
135+
Docker Path:
136+
TBASE2_POSTGRES_CREATION.sql → initTreebase.sql → final schema
137+
```
138+
139+
These paths should produce equivalent schemas but use different mechanisms.
140+
141+
## Recommendations
142+
143+
### Option A: Continue with SQL-Based Schema (RECOMMENDED)
144+
**Pros**:
145+
- Known working approach - all tests pass
146+
- Explicit control over schema
147+
- Migration scripts for production
148+
- Seed data loading order is controlled
149+
150+
**Cons**:
151+
- Dual maintenance burden
152+
- Potential for drift between Hibernate and SQL
153+
- Requires diligent patch management
154+
155+
**Action Items**:
156+
1. ✅ Add patch 0011 to `init_db_uptodate.pg` (DONE)
157+
2. ✅ Make patches idempotent (DONE)
158+
3. Consider creating a new schema snapshot that includes all patches
159+
4. Consider adding `hibernate.hbm2ddl.auto=validate` in production
160+
161+
### Option B: Switch to Hibernate-Based Schema Generation
162+
**Pros**:
163+
- Single source of truth (Hibernate entities)
164+
- No impedance mismatch
165+
- Automatic schema updates with `hbm2ddl.auto=update`
166+
167+
**Cons**:
168+
-**NOT VIABLE** - 21 test failures/errors (see Test Impact Analysis)
169+
- Requires significant refactoring of seed data loading
170+
- Risk of data loss in production if not carefully managed
171+
- Less control over exact DDL
172+
173+
### Option C: Hybrid Approach
174+
**Strategy**: Use Hibernate for tests/development, SQL for production
175+
176+
**Status**: Not recommended due to:
177+
- Seed data dependency issues
178+
- Would require rewriting test data setup
179+
- Maintenance burden of two approaches
180+
181+
## Test Impact Analysis
182+
183+
### Actual Test Results
184+
185+
#### SQL-Based Schema (Current Approach)
186+
**Result**: ✅ All tests pass
187+
```
188+
Tests run: 301, Failures: 0, Errors: 0, Skipped: 43
189+
BUILD SUCCESS
190+
```
191+
192+
#### Hibernate-Based Schema (`hbm2ddl.auto=create`)
193+
**Result**: ❌ 12 failures, 9 errors
194+
```
195+
Tests run: 301, Failures: 12, Errors: 9, Skipped: 53
196+
BUILD FAILURE
197+
```
198+
199+
**Failed Tests (Missing Seed Data)**:
200+
- `ItemDefinitionDAOTest.testFindByDescription`
201+
- `ItemDefinitionDAOTest.testFindPredefinedItemDefinition`
202+
- `MatrixDAOTest.testfindKindByDescription`
203+
- `MatrixDataTypeDAOTest.testFindByDescription`
204+
- `AlgorithmDAOTest.testFinalAllUniqueAlgorithmDescriptions`
205+
- `StudyStatusDAOTest.testFindStatusInProgress/Published/Ready`
206+
- `PhyloTreeDAOTest.testFindTypeByDescription/findKindByDescription/findQualityByDescription`
207+
- `SubmissionServiceImplTest.testProcessNexusFile`
208+
209+
**Error Tests (Foreign Key Violations)**:
210+
- `EnvironmentTest.testGetGeneratedKey` - null value in column violation
211+
- `EnvironmentTest.testSelectFromInsert` - null value in column violation
212+
- `MatrixServiceImplTest.testAddDelete` - NullPointerException (missing service beans)
213+
- `AnalysisServiceImplTest.testAddDelete` - NullPointerException
214+
- `StudyServiceImplTest.*` - Multiple NullPointerExceptions
215+
216+
### Root Cause Analysis
217+
218+
The test failures with Hibernate-based schema generation are caused by:
219+
220+
1. **Missing Seed Data**: Hibernate creates empty tables. Tests expect pre-populated reference data for:
221+
- `ItemDefinition` (predefined item definitions)
222+
- `MatrixDataType` (DNA, RNA, Protein, etc.)
223+
- `StudyStatus` (In Progress, Ready, Published)
224+
- `TreeType`, `TreeKind`, `TreeQuality`
225+
- `Algorithm` (reference algorithm types)
226+
- `User`, `Person` (test user accounts)
227+
228+
2. **Foreign Key Ordering**: When loading `initTreebase.sql` after Hibernate schema creation:
229+
- Hibernate creates FK constraints immediately
230+
- SQL script inserts data in wrong order (e.g., `user` before `person`)
231+
- Results in FK constraint violations
232+
233+
3. **Schema Already Exists Errors**:
234+
- `password_reset_token` table created by Hibernate conflicts with SQL script
235+
236+
### Conclusion
237+
238+
**Hibernate-based schema generation is NOT suitable** for the current test setup because:
239+
1. Tests depend on seed data that must be loaded in a specific order
240+
2. The `initTreebase.sql` script was designed for SQL-based schema creation
241+
3. Significant refactoring of test data setup would be required
242+
243+
**Recommendation**: Continue with **Option A (SQL-Based Schema)** with the following improvements:
244+
1. ✅ Keep patches idempotent (fixed in this PR)
245+
2. ✅ Ensure all patches are included in `init_db_uptodate.pg`
246+
3. Consider adding `hibernate.hbm2ddl.auto=validate` in production to catch drift
247+
248+
### Tests to Monitor
249+
250+
Based on the codebase, these test categories should be monitored:
251+
252+
1. **DAO Tests** (`org.cipres.treebase.dao.*`)
253+
- Test CRUD operations against database
254+
- May be affected by constraint differences
255+
256+
2. **Service Tests** (`org.cipres.treebase.service.*`)
257+
- Test business logic with database
258+
- Should be unaffected by schema generation method
259+
260+
3. **Domain Tests** (`org.cipres.treebase.domain.*`)
261+
- Test entity relationships
262+
- May be affected by cascade/fetch settings
263+
264+
## Implementation Plan
265+
266+
### Phase 1: Fix Immediate Issues (COMPLETED)
267+
1. ✅ Add patch 0011 to `init_db_uptodate.pg`
268+
269+
### Phase 2: Validate Current State
270+
1. Run full test suite with current SQL-based initialization
271+
2. Document any existing test failures
272+
273+
### Phase 3: Test Hibernate-Based Initialization
274+
1. Create test configuration with `hbm2ddl.auto=create`
275+
2. Run tests to identify failures
276+
3. Document failures and root causes
277+
278+
### Phase 4: Implement Chosen Strategy
279+
1. Based on test results, implement Option A, B, or C
280+
2. Update CI/CD configuration
281+
3. Update Docker configuration
282+
4. Update documentation
283+
284+
## Appendix: Column Length Constants
285+
286+
From `org.cipres.treebase.domain.TBPersistable`:
287+
288+
```java
289+
public static final int COLUMN_LENGTH_30 = 30;
290+
public static final int COLUMN_LENGTH_50 = 50;
291+
public static final int COLUMN_LENGTH_100 = 100;
292+
public static final int COLUMN_LENGTH_STRING = 255;
293+
public static final int COLUMN_LENGTH_500 = 500;
294+
public static final int COLUMN_LENGTH_STRING_1K = 1000;
295+
public static final int COLUMN_LENGTH_STRING_NOTES = 2000;
296+
public static final int COLUMN_LENGTH_STRING_MAX = 5000;
297+
298+
public static final int CITATION_TITLE_COLUMN_LENGTH = 500;
299+
public static final int CITATION_ABSTRACT_COLUMN_LENGTH = 10000;
300+
public static final int CITATION_KEYWORDS_COLUMN_LENGTH = 1000;
301+
public static final int CITATION_JOURNAL_COLUMN_LENGTH = 500;
302+
```
303+
304+
These constants define the expected column lengths in Hibernate and should be used as the reference when comparing with SQL schemas.

docker-compose.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ services:
2222
- ./docker/00-init-roles.sql:/docker-entrypoint-initdb.d/00-init-roles.sql
2323
- ./treebase-core/src/main/resources/TBASE2_POSTGRES_CREATION.sql:/docker-entrypoint-initdb.d/01-schema.sql
2424
- ./treebase-core/src/main/resources/initTreebase.sql:/docker-entrypoint-initdb.d/02-init.sql
25+
- ./docker/03-migration-hibernate-sequence.sql:/docker-entrypoint-initdb.d/03-migration-hibernate-sequence.sql
2526
healthcheck:
2627
test: ["CMD-SHELL", "pg_isready -U treebase"]
2728
interval: 10s
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
-- Migration script to add hibernate_sequence if it doesn't exist
2+
-- This sequence is required by Hibernate's @CollectionId annotation
3+
-- for generating collection_id values in sub_matrix and sub_treeblock tables
4+
5+
DO $$
6+
BEGIN
7+
IF NOT EXISTS (SELECT FROM pg_catalog.pg_sequences WHERE sequencename = 'hibernate_sequence') THEN
8+
CREATE SEQUENCE hibernate_sequence;
9+
RAISE NOTICE 'Created hibernate_sequence';
10+
ELSE
11+
RAISE NOTICE 'hibernate_sequence already exists';
12+
END IF;
13+
END
14+
$$;

0 commit comments

Comments
 (0)