Skip to content

add a quick way of tokenizing by character #21

@philipp

Description

@philipp

Because the "tokenize" parameter is tested for existence, it's challenging to tokenize on "nothing" (which would split everything into individual characters)

Notably, there is also a difference in behavior between the Python and Perl implementations, in that distribution.py will successfully split on "0", while Perl will act as though I hadn't passed anything tokenize parameter in at all, with "-t=0"

The Perl-with-zero behavior should be easy to fix, but I'd suggest adding another special "tokenize" value (along with the existing "white" and "word") of "char" or something similar.

I'm not very experienced with Python, and while in Perl you can simply add a line like
elsif ($tokenize eq 'char') { $tokenize = ''; }
as far as I can tell, Python will not behave that way with splitting on an empty regex. And it's also beyond me how to properly test for "None" vs. some other existence thing to see if it was defined at all on the command line.

Anyway, there's always a work-around for now to split the entire thing before it even gets in.
e.g.
cat theFile | perl -ne 'print join "\n", split //' | distribution
But it feels like something that should be available more easily.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions