Skip to content

[ENC] Align .txt file uploads across String Similarity, Phonotactic Probability, and Neighbourhood Density #782

@kchall

Description

@kchall

Currently, SS, PP, and ND each allow .txt file uploads, but they have different behaviours and allow different inputs. Ideally, these would all be aligned.

Currently, the behaviour is:

  1. PhonProb
    Words in corpus, spelling: calculates
    Words in corpus, transcription: can’t do this — wants spelling
    Words not in corpus, spelling: can’t do this — not in corpus
    Words not in corpus, transcription: can’t do this — wants spelling and in corpus
    Words not in corpus, both: can’t do this, even though old docs said you could!
  • Ideally, PCT would calculate PP regardless of whether spelling or transcription is provided, and if there are words not in the corpus, it would skip them (reporting them to the user and returning N/A), while still calculating the rest of the list.
  1. ND
    Words in corpus, spelling: calculates (must specify that file contains spelling)
    Words in corpus, transcription: calculates (must specify that file contains trans)
    Words not in corpus, spelling: calculates, giving NA for words not in corpus and telling you which they are
    Words not in corpus, transcription: calculates for all, explaining that some words aren’t in corpus
  • This one is currently the closest to the ideal solution for all!
  1. String Sim
    Word pairs in corpus, spelling: calculates
    Word pairs in corpus, transcription: gives NA for all, explaining that some words (all words) are not in corpus, and tells you which they are
    Word pairs not in corpus, spelling: calculates, giving either result if it can or NA for words not in corpus, and tells you which they are
    Word pairs not in corpus, transcription: gives NA for all, explaining that some words (all words) are not in corpus, and tells you which they are
  • This behaviour is basically fine, but there's no principled reason why the algorithm couldn’t calculate SS for word pairs given in transcription, even if not in the corpus — this just might be problematic with phonological edit distance? But we could make it like ND and just grey out that option for that algorithm.

Sample files to test all of this can be found in:
~/Dropbox/Phonological_CorpusTools_Public/PCT_text_file_upload_tests

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions