Skip to content

Support for Arabic #9

@spookyQubit

Description

@spookyQubit

Hi @bowbowbow, thanks a lot for putting this together. Was wondering if it will be easy to extend the content in main.py to support Arabic.

In my initial trials, I tried the following:

  1. Created data_list_arabic.csv file to include train/dev/test splits. An example of the first few lines of the file looks like the following:
type,path
train,nw/adj/ALH20001201.1900.0126
train,nw/adj/ALH20001201.1300.0071
dev,nw/adj/ALH20001128.1300.0081
test,nw/adj/ALH20001125.0700.0024
test,nw/adj/ALH20001124.1900.0127
  1. Built Arabic properties following the info in https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties:
arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
                         'tokenize.language': 'ar',
                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}
  1. Created the nlp_res_raw object as:
nlp_res_raw = nlp.annotate(item['sentence'], properties=arabic_properties)
  1. Downloaded the Arabic models:
cd stanford-corenlp-full-2018-10-05
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar

Now when I run the script, I keep getting the following error: Failed to load segmenter edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz.

I must be making a mistake somewhere of not downloading the correct package or pointing an env_variable to the correct location. Any help to add support for Arabic is greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions