Skip to content

shcabin/sqlite-fts5-icu-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Why Build Another Tokenizer?

While GitHub hosts a wealth of tokenizers, most suffer from two key limitations: they either support only specific languages or demand explicit language specification upfront. My goal is to create a tokenizer that works seamlessly in uncertain language environments—particularly in mixed-language scenarios.

sqlite-fts5-icu-tokenizer

The SQLite FTS5 extension provides International Components for Unicode (ICU) based tokenization for full-text search, support non-space-separated languages such as Chinese, Japanese.

Furthermore, it references the Unicode61 Tokenizer 'remove_diacritics=2' feature, where by default, diacritics are removed from all Latin script characters. for example, "A", "a", "À", "à", "Â" and "â" are all considered to be equivalent.

The implementation fully complies with the FTS5 v2 API specifications and is written in C++.

Prerequisites

  1. Option 1
  • Install ICU and SQLite3 development libraries. On Ubuntu, you can use:
sudo apt-get install libicu-dev libsqlite3-dev 
  1. Option 2: build from source
tar -zxvf icu-release-78.1.tar.gz
cd icu-release-78.1/icu4c/source
./runConfigureICU Linux --enable-static=no --enable-shared=yes
make -j4
unzip sqlite-src-3500400.zip
cd sqlite-src-3500400
CPPFLAGS="-I ../icu-release-78.1/icu4c/source/common/ -I ../icu-release-78.1/icu4c/source/i18n/" LDFLAGS="-L ../icu-release-78.1/icu4c/source/lib" ./configure --enable-fts5 --with-icu-ldflags="-licui18n -licuuc -licudata"
make

Building

make 

output: libicu.so

Usage

Load the extension in SQLite

SQLite CLI

.load libicu
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize=icu)

Demo Test

make test
LD_LIBRARY_PATH=./ ./test icu

TODO

  • Option stopwords
  • Use locale-specific tokenizers

About

 SQLite FTS5 ICU Tokenizer Extension for CJK languages, and support character normalization.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors