EncodingRs

Character encoding and decoding for Elixir. Convert text between UTF-8 and legacy encodings like Shift_JIS, GBK, Big5, EUC-KR, Windows-1252, and more. Supports all 40 encodings from the WHATWG Encoding Standard (with 200+ label aliases).

Powered by Rust's encoding_rs - the same encoding library used by Firefox.

Use Cases

Processing Japanese text files - Shift_JIS, EUC-JP, ISO-2022-JP
Processing Chinese text files - GBK, GB18030, Big5
Processing Korean text files - EUC-KR
Importing legacy data - Windows-1252, ISO-8859-1, legacy code pages
Web scraping non-UTF-8 sites - decode HTML in any encoding
Converting file encodings - batch convert legacy files to UTF-8
Reading CSV/text with mixed encodings - detect and decode automatically

Supported Encodings

Japanese: Shift_JIS, EUC-JP, ISO-2022-JP

Chinese: GBK, GB18030, Big5

Korean: EUC-KR

Unicode: UTF-8, UTF-16LE, UTF-16BE

Western European: Windows-1252, ISO-8859-1, ISO-8859-15, macintosh

Central/Eastern European: Windows-1250, ISO-8859-2, Windows-1257

Cyrillic: Windows-1251, KOI8-R, KOI8-U, ISO-8859-5, x-mac-cyrillic

Greek: Windows-1253, ISO-8859-7

Turkish: Windows-1254, ISO-8859-9

Hebrew: Windows-1255, ISO-8859-8

Arabic: Windows-1256, ISO-8859-6

Vietnamese: Windows-1258

Thai: Windows-874

Baltic: ISO-8859-4, ISO-8859-13

And more - see the full list at encoding.spec.whatwg.org.

Features

High performance - SIMD-optimized Rust NIF, 3-15x faster than alternatives (see benchmarks)
Batch processing - encode/decode multiple items in a single NIF call for throughput
Streaming decoder - handle large files and chunked data without corrupting multibyte characters
BOM detection - automatically detect UTF-8, UTF-16LE, UTF-16BE from byte order marks
WHATWG compliant - implements the Encoding Standard used by browsers
Precompiled binaries - no Rust toolchain required for common platforms
Dirty schedulers - configurable threshold for offloading large operations (default 64KB)

Installation

def deps do
  [
    {:encoding_rs, "~> 0.2.3"}
  ]
end

The module is still named EncodingRs for API compatibility with the original package.

Precompiled binaries are available for common platforms. If a precompiled binary isn't available for your platform, you'll need Rust installed (use rustup).

Usage

One-Shot Encoding/Decoding

For complete binaries where all data is available at once:

# Decode from Shift_JIS to UTF-8
{:ok, string} = EncodingRs.decode(binary, "shift_jis")
string = EncodingRs.decode!(binary, "shift_jis")

# Encode from UTF-8 to Windows-1252
{:ok, binary} = EncodingRs.encode(string, "windows-1252")
binary = EncodingRs.encode!(string, "windows-1252")

# Check if encoding is supported
EncodingRs.encoding_exists?("utf-8")  # true

# Get canonical name for an alias
EncodingRs.canonical_name("latin1")  # {:ok, "windows-1252"}

Streaming Decoding

For chunked data (file streams, network data), use EncodingRs.Decoder to properly handle multibyte characters that may be split across chunk boundaries:

# Stream a Shift_JIS file to UTF-8
File.stream!("data.txt", [], 4096)
|> EncodingRs.Decoder.stream("shift_jis")
|> Enum.join()

# Manual chunked decoding
{:ok, decoder} = EncodingRs.Decoder.new("shift_jis")
{:ok, out1, _errors} = EncodingRs.Decoder.decode_chunk(decoder, chunk1, false)
{:ok, out2, _errors} = EncodingRs.Decoder.decode_chunk(decoder, chunk2, false)
{:ok, out3, _errors} = EncodingRs.Decoder.decode_chunk(decoder, final_chunk, true)
result = out1 <> out2 <> out3

Why streaming matters: Multibyte encodings like Shift_JIS use 2+ bytes per character. If a chunk boundary splits a character, the one-shot decode/2 would see invalid bytes and produce replacement characters (�). The streaming decoder buffers incomplete sequences until the next chunk completes them.

BOM Detection

Detect encoding from a Byte Order Mark (BOM) at the start of a file:

# Detect BOM and get encoding
{:ok, "UTF-8", 3} = EncodingRs.detect_bom(<<0xEF, 0xBB, 0xBF, "hello">>)
{:ok, "UTF-16LE", 2} = EncodingRs.detect_bom(<<0xFF, 0xFE, ...>>)
{:ok, "UTF-16BE", 2} = EncodingRs.detect_bom(<<0xFE, 0xFF, ...>>)
{:error, :no_bom} = EncodingRs.detect_bom("no bom here")

# Detect and strip BOM in one step
{:ok, encoding, data_without_bom} = EncodingRs.detect_and_strip_bom(file_content)
{:ok, decoded} = EncodingRs.decode(data_without_bom, encoding)

Batch Processing

For processing many items efficiently, use batch operations to amortize NIF dispatch overhead:

# Decode multiple binaries in one call
items = [
  {<<72, 101, 108, 108, 111>>, "windows-1252"},
  {<<0x82, 0xA0>>, "shift_jis"}
]
results = EncodingRs.decode_batch(items)
# => [{:ok, "Hello"}, {:ok, "あ"}]

# Encode multiple strings in one call
items = [{"Hello", "windows-1252"}, {"あ", "shift_jis"}]
results = EncodingRs.encode_batch(items)
# => [{:ok, "Hello"}, {:ok, <<130, 160>>}]

See the Batch Processing Guide for more details.

Dirty Schedulers

The BEAM VM has a limited number of normal schedulers, and long-running NIFs can block them, causing latency for other processes. Operations on binaries larger than the configured threshold automatically use dirty CPU schedulers, keeping the normal schedulers available for other work.

The default threshold is 64KB. You can configure it in your config.exs:

# Using multiplication for readability
config :encoding_rs, dirty_threshold: 128 * 1024

# Or using Elixir's underscore notation
config :encoding_rs, dirty_threshold: 131_072

Increasing the threshold reduces context switching overhead, which benefits batch processing and throughput-focused workloads. However, larger operations will block normal schedulers longer, potentially causing latency for other processes.

Decreasing the threshold keeps normal schedulers more available, which benefits latency-sensitive and high-concurrency applications. However, more frequent context switching adds overhead that may reduce throughput.

Maximum Input Size

To prevent excessive memory allocation from untrusted or unexpectedly large inputs, encoding and decoding operations enforce a configurable maximum input size. Inputs exceeding this limit return {:error, :input_too_large} before reaching the NIF. Batch operations reject oversized items individually — valid items in the same batch still succeed.

The default limit is 100MB. This is a runtime setting — it can be changed in runtime.exs or dynamically without recompiling:

# Increase the limit
config :encoding_rs, max_input_size: 500 * 1024 * 1024

# Disable entirely (trusted inputs only)
config :encoding_rs, max_input_size: :infinity

Why a size limit? A single large input can cause up to 3x memory amplification in the NIF (input buffer + output buffer + BEAM binary copy). A 500MB input could transiently allocate over 1.5GB, potentially destabilizing the BEAM node even on a dirty scheduler.

For large files, use the streaming decoder (EncodingRs.Decoder.stream/2) with bounded chunk sizes instead of loading the entire file into memory. The streaming API is not subject to the input size limit since each chunk is validated independently.

Benchmarks

Comparison against codepagex (pure Elixir) and iconv (Erlang NIF wrapping libiconv):

Encoding	Input Size	encoding_rs	codepagex	iconv
ISO-8859-1	100 B	347 ns	487 ns (1.4x)	2.0 μs (5.6x)
ISO-8859-1	10 KB	9.2 μs	118 μs (13x)	130 μs (14x)
ISO-8859-1	1 MB	3.0 ms	12.6 ms (4x)	13.1 ms (4x)
Shift_JIS	10 KB	13 μs	N/A	196 μs (15x)
UTF-16LE	10 KB	8.1 μs	N/A	98 μs (12x)

Benchmarks on Apple Silicon M1. See comparison guide for full methodology, more encodings, and when to use each library.

Quick Start

# Decode a Shift_JIS file to UTF-8
{:ok, content} = File.read("japanese.txt")
{:ok, utf8_string} = EncodingRs.decode(content, "shift_jis")

# Encode a UTF-8 string to Windows-1252
{:ok, binary} = EncodingRs.encode("Hello world", "windows-1252")

Acknowledgments

excoding - The original project by Kevin Seidel
encoding_rs - Mozilla's Rust encoding library

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
bench		bench
guides		guides
lib		lib
native/encoding_rs		native/encoding_rs
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
checksum-Elixir.EncodingRs.Native.exs		checksum-Elixir.EncodingRs.Native.exs
mix.exs		mix.exs
mix.lock		mix.lock
usage-rules.md		usage-rules.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EncodingRs

Use Cases

Supported Encodings

Features

Installation

Usage

One-Shot Encoding/Decoding

Streaming Decoding

BOM Detection

Batch Processing

Dirty Schedulers

Maximum Input Size

Benchmarks

Quick Start

Acknowledgments

License

About

Uh oh!

Releases 5

Packages

Contributors 2

Uh oh!

Languages

License

jeffhuen/encoding_rs

Folders and files

Latest commit

History

Repository files navigation

EncodingRs

Use Cases

Supported Encodings

Features

Installation

Usage

One-Shot Encoding/Decoding

Streaming Decoding

BOM Detection

Batch Processing

Dirty Schedulers

Maximum Input Size

Benchmarks

Quick Start

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 2

Uh oh!

Languages

Packages