Skip to content

Bug: MS-SQL unquoted collation names cause ParseError in type parser #100

@myyong

Description

@myyong

Problem

When datafaker reads a table schema from an MS-SQL source, string columns that
carry a collation (e.g. VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS)
cause a parsy.ParseError when the orm.yaml is later loaded:

Failed to parse VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS
parsy.ParseError: ...

Root cause

The string_type parser in datafaker/serialize_metadata.py
only handled quoted collation names — the PostgreSQL dialect style:

COLLATE "name"

MS-SQL renders collation names without quotes:

COLLATE SQL_Latin1_General_CP1_CI_AS

so the parser always failed on any MS-SQL string column with a collation.

Fix

Extended the collation clause in string_type to accept both forms using
parsy.alt:

collation: str | None = yield parsy.alt(
    # PostgreSQL: COLLATE "name" (quoted)
    parsy.string(' COLLATE "') >> parsy.regex(r'[^"]*') << parsy.string('"'),
    # MS-SQL: COLLATE name (unquoted identifier)
    parsy.string(" COLLATE ") >> parsy.regex(r'\S+'),
).optional()

The quoted path is tried first, so PostgreSQL behaviour is unchanged.

Tests added

Four new tests in tests/test_serialize_metadata_mssql.py:

  • test_varchar_with_mssql_collationVARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS
  • test_nvarchar_with_mssql_collationNVARCHAR(100) COLLATE Latin1_General_CI_AS
  • test_char_with_mssql_collationCHAR(10) COLLATE SQL_Latin1_General_CP1_CI_AS
  • test_varchar_with_quoted_collation_still_works — regression test confirming the PostgreSQL quoted form still works

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions