nlpo3

Thai natural language processing library in Rust, with Python and Node bindings.

Rust

SPDX-FileCopyrightText: 2024 PyThaiNLP Project
SPDX-License-Identifier: Apache-2.0

nlpO3

Thai natural language processing library in Rust,
with Python and Node bindings. Formerly oxidized-thainlp.

To use as a library in a Rust project:

cargo add nlpo3

To use as a library in a Python project:

pip install nlpo3

Features
Use
Build
Develop
License

Features

Thai word tokenizer
- Use maximal-matching dictionary-based tokenization algorithm
  and honor Thai Character Cluster boundaries
  - 2.5x faster
    than similar pure Python implementation (PyThaiNLP’s newmm)
- Load a dictionary from a plain text file (one word per line)
  or from Vec<String>

Use

Node.js binding

See nlpo3-nodejs.

Python binding

Example:

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")

See more at nlpo3-python.

Rust library

Add to dependency

To use as a library in a Rust project:

cargo add nlpo3

It will add “nlpo3” to Cargo.toml:

[dependencies]
# ...
nlpo3 = "1.4.0"

Example

Create a tokenizer using a dictionary from file,
then use it to tokenize a string (safe mode = true, and parallel mode = false):

use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();

Create a tokenizer using a dictionary from a vector of Strings:

let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

tokenizer.add_word(&["มิวเซียม"]);

Remove words from an existing tokenizer:

tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);

Command-line interface

Example:

echo "ฉันกินข้าว" | nlpo3 segment

See more at nlpo3-cli.

Dictionary

For the interest of library size, nlpO3 does not assume what dictionary the
user would like to use, and it does not come with a dictionary.
A dictionary is needed for the dictionary-based word tokenizer.
For tokenization dictionary, try
- words_th.tx from PyThaiNLP
  - ~62,000 words
  - CC0-1.0
- word break dictionary from libthai
  - consists of dictionaries in different categories, with a make script
  - LGPL-2.1

Build

Requirements

Rust 2018 Edition

Steps

Generic test:

cargo test

Build API document and open it to check:

cargo doc --open

Build (remove --release to keep debug information):

cargo build --release

Check target/ for build artifacts.

Develop

Development document

Notes on custom string

Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues

License

nlpO3 is copyrighted by its authors
and licensed under terms of the Apache Software License 2.0 (Apache-2.0).
See file LICENSE for details.