Porting Kuzushiji Character Detection and Classical-Japanese OCR to JS/Next.js Without Python — ONNX Quantization and Library Extraction

This article is co-authored with a generative AI. Facts have been cross-checked against official documentation where possible, but errors may remain. Please verify against primary sources before making any important decisions.

What this is about

I had been running my kuzushiji / historical-document image pipeline in Python (a Gradio Hugging Face Space). This is a record of porting it to Python-free JS/Node (onnxruntime-node). Concretely, four pieces:

YOLOv11x character detection exported to ONNX and served from Next.js (an API Route)
NDL koten-OCR-Lite line detection (RTMDet) ported to JS
int8 quantization, compared against fp32 for accuracy, size, and speed
Extracting the NDL OCR core into a framework-agnostic library

Along the way I also hit an easily-misunderstood deployment point: I found that running this on Vercel (serverless) is difficult — so I'll lay out the reasons too (it runs fine on a container or a plain Node server).

The models used are nakamura196/yolov11x-codh-char (MIT, character bounding-box detection) and NDL koten-OCR-Lite (CC BY 4.0, rtmdet-s line detection + parseq character recognition).

1. YOLOv11 → ONNX export

Ultralytics has ONNX export built in, so the conversion itself is one line.

from ultralytics import YOLO
m = YOLO("best.pt")
m.export(format="onnx", imgsz=1280, opset=12,
         dynamic=True,   # make input H/W dynamic (baseline=1280, SAHI tiles=1024)
         simplify=True,
         nms=False)       # NMS is implemented on the JS side, so don't bake it in

The spec I confirmed:

input images [batch, 3, H, W] float32 (RGB, /255, letterboxed)
output output0 [1, 5, N] = (cx, cy, w, h, score) × N anchors (single class; N=33600 at imgsz=1280)
No accuracy loss: .pt and .onnx produce the same number of boxes (251 on a sample image)

One caveat: exporting as fp32 yields about twice the size of the original .pt (109MB) → 217MB. That becomes the deployment bottleneck later.

2. Server-side inference in Next.js (onnxruntime-node)

I went with "UI in Next.js, inference in the server's Node process." Pre-processing (letterbox), post-processing (NMS), SAHI (tiling), and reading-order grouping are all reimplemented in TypeScript.

Pre-processing: letterbox

Building the ONNX input tensor. This works with canvas on both the server and the browser.

const scale = imgsz / Math.max(W, H);
const newW = Math.round(W * scale), newH = Math.round(H * scale);
const padLeft = Math.floor((imgsz - newW) / 2);
const padTop = Math.floor((imgsz - newH) / 2);

const ctx = createCanvas(imgsz, imgsz).getContext("2d");
ctx.fillStyle = "rgb(114,114,114)";  // ultralytics' gray padding
ctx.fillRect(0, 0, imgsz, imgsz);
ctx.drawImage(src, padLeft, padTop, newW, newH);

const { data } = ctx.getImageData(0, 0, imgsz, imgsz); // RGBA
const plane = imgsz * imgsz;
const tensor = new Float32Array(3 * plane);
for (let i = 0; i < plane; i++) {
  tensor[i]             = data[i * 4]     / 255; // R
  tensor[plane + i]     = data[i * 4 + 1] / 255; // G
  tensor[2 * plane + i] = data[i * 4 + 2] / 255; // B
}

Post-processing walks the [1,5,N] output, keeps score >= conf, converts xywh→xyxy, undoes the letterbox back to original coordinates, and applies class-agnostic NMS. It matched the Python behavior almost exactly (canvas resizing interpolates slightly differently from PIL, so it's off by a handful — 251 → 249).

SAHI (tiling)

For high resolution (downscale = long-edge/1280 > 2.0), naively shrinking the image crushes small characters. So I split the image into 1024px tiles with 20% overlap, infer near native resolution, map back to original coordinates, and run a global NMS — a SAHI-style approach. Since tiles are at most 1024, inferring at 1024 avoids upscaling them to 1280 needlessly (84s → 58s on a 6144×4096 image).

Gotcha: `sharp` won't install on Node 25

I first intended to use the de-facto standard sharp for image processing, but npm install on Node 25.6 failed to fetch the prebuilt, fell back to a source build, failed, and rolled back the whole install. This is a known npm bug where it doesn't install sharp's optional platform packages (@img/*) — not that sharp itself is incompatible with Node 25 (on the library side described below, sharp runs fine on the same Node 25).

On this Next.js side I worked around it by switching image processing to @napi-rs/canvas. Its N-API prebuilts are robust, and because canvas uses the same API as the browser, the pre-processing code carries over almost unchanged.

Measured (CPU): baseline 2.3s, SAHI 58s.

3. Porting NDL koten-OCR-Lite line detection (RTMDet)

Separately from character boxes, I also ported NDL's RTMDet line/layout detection to JS. With it you can display line boxes and remove false detections that fall outside the line regions (rulers, color charts, etc.).

Key points from reading the original Python:

The filename says rtmdet-s-1280x1280.onnx, but the actual input is 1024×1024 (the code reads it from the model, so it worked anyway)
Outputs are dets [1,N,5]=(x1,y1,x2,y2,score) and labels [1,N], with NMS baked into the model (post-processing is just a score threshold)
Unlike YOLO, pre-processing pads to a square with black (top-left aligned) → resizes to 1024 → RGB→BGR → (px - mean) / std (mean/std are in BGR order) → CHW

const MEAN = [103.53, 116.28, 123.675]; // BGR
const STD  = [57.375, 57.12, 58.395];
// CHW, channel order is BGR
tensor[i]             = (b - MEAN[0]) / STD[0];
tensor[plane + i]     = (g - MEAN[1]) / STD[1];
tensor[2 * plane + i] = (r - MEAN[2]) / STD[2];

In post-processing, boxes are scaled back to original coordinates by ÷1024 × (square padding size), and each box is extended vertically by 2%. Against the Python reference, I got 16–17 lines with coordinates within about 2px — evidence that the pre-processing (BGR, mean/std, padding, scale-back) was ported correctly.

4. int8 quantization and accuracy comparison

217MB is large, so I tried int8 quantization and compared it with fp32. The bottom line: you should use quantize_dynamic + QUInt8; nothing else worked.

Method	weight_type	Result
`quantize_dynamic`	QUInt8	✅ Adopted
`quantize_dynamic`	QInt8	❌ Won't load (`ConvInteger` (opset 10) is `NOT_IMPLEMENTED` on ORT 1.23 CPU)
`quantize_static` (per-channel)	QInt8	❌ Catastrophic: 0 detections / QDQ produces an invalid graph

Comparison with QUInt8 dynamic (fp32 as ground truth, IoU≥0.5, CPU, median of 5 runs per image):

Image	fp32 boxes	int8 boxes	F1	recall	mIoU	t_fp32	t_int8
Taketari Monogatari (printed)	251	250	0.990	0.988	0.959	2.35s	0.99s
Itō Hirobumi letter (correspondence)	546	536	0.969	0.960	0.964	2.95s	1.15s
Tōji Hyakugō documents (historical document)	319	252	0.862	0.771	0.963	2.47s	1.79s

Overall: size 228→58MB (3.9× smaller), CPU inference 2.15× faster, F1 0.945, conf error 0.048.

The notable point is that only the historical document drops in recall, to 0.771 (319→252; it misses low-confidence characters). The IoU of matched boxes is 0.96, so it's "missed detections," not "positional drift." For printed books and modern correspondence, int8 is essentially fine, but the practical conclusion is: for kuzushiji historical documents where recall matters, keep fp32.

5. Deployment: running on Vercel (serverless) is hard

When I tried to deploy to Vercel, I ran into its serverless constraints.

Vercel's Serverless Function has a 250MB (unzipped) size limit, and:

best.onnx (fp32) alone is 217MB → the model nearly hits the limit by itself
onnxruntime-node ships binaries for all OSes in the package — 254MB (the linux native alone is ~36MB, but file tracing struggles to drop the rest and tends to overflow)
And execution time: SAHI takes 58s, which collides with Vercel's maxDuration (Hobby max 60s / Pro 300s)
Being stateless, every cold start loads a tens-to-200MB model

…so it isn't practical as a Vercel serverless function. Even with int8 (58MB), the size is right at the edge and fragile, and the SAHI execution-time issue remains.

On the other hand, running next build && next start on a container or a plain Node server (Fly.io / Render / Cloud Run / VPS, etc.) handles the 217MB model and onnxruntime-node with no trouble. The difficulty is specifically running it in Vercel's serverless environment; choosing a different host avoids it.

What about WASM (onnxruntime-web)?

I also considered running fully in the browser. The code is light — the model is already ONNX, the pre/post-processing TS carries over directly, canvas is native in the browser, and onnxruntime-web's API is nearly identical to onnxruntime-node.

The hard parts are speed and size. On WASM CPU, YOLO11x is heavy: a single baseline run takes several to a dozen-plus seconds, and SAHI (dozens of tiles) takes minutes — effectively unusable. The WebGPU backend brings baseline down to 1–3s and into practical range, but it depends on browser support. Multithreading also requires COOP/COEP (cross-origin isolation). And the model is a 58MB (int8) first-load download. A "baseline-only, browser-contained demo" is realistic — that was the landing point.

6. Extracting the core into a library

It turned out a full JS implementation of NDL koten OCR (RTMDet line detection + PARSeq recognition + XY-cut reading order + ALTO/NDL XML output) already existed — in a separate Tropy plugin I had built. Since I had ended up writing RTMDet twice (once on the Next.js side above), I decided to extract the core into a shared library.

Fortunately the core was already separated into a framework-agnostic shape — engine.recognize(sharpImage) → { blocks, lines, text } — so the extraction was smooth.

New package @nakamura196/ndl-koten-ocr (its own repository)
src/: engine / rtmdet / parseq / ndl-parser / xy-cut / alto / buffer-to-input
config YAMLs bundled; models (.onnx) gitignored and symlinked
The Tropy plugin depends on it via a local file:../ndl-koten-ocr link
Tests: 81 in the library / 51 in Tropy, all green

Reprise: the reliable way to install `sharp`

The library uses sharp to match Tropy. When the npm optional-dependency bug strikes, this order installed it reliably:

npm install --ignore-scripts
npm install --no-save --ignore-scripts @img/sharp-darwin-arm64@0.34.5 @img/sharp-libvips-darwin-arm64@1.2.4

--ignore-scripts prevents the rollback caused by sharp's failing build, and then you add the prebuilt (@img/*) explicitly (match the versions to your sharp).

Summary

YOLOv11 → ONNX is one line, with no accuracy loss
Porting to JS is mostly just rewriting pre/post-processing (letterbox, NMS, SAHI, reading order, RTMDet), and it matched the Python reference at the coordinate level
int8 quantization is QUInt8 dynamic only. 3.9× smaller and 2× faster, but recall drops on historical documents, so keep fp32 in production depending on the use case
Deployment: it's specifically hard to run on Vercel (serverless) (size and execution-time limits). A container or Node server runs it fine
WASM: the code is light but speed is the bottleneck. Baseline is in range with WebGPU; SAHI is impractical
To avoid duplicate implementations, the NDL OCR core was extracted into a shared library

This was a full pass at "how far can a heavy Python OCR/detection pipeline run in JS alone, and where do you host it?" I hope it helps anyone considering a JS port in the kuzushiji / classical-Japanese space.

📜Porting Kuzushiji Character Detection and Classical-Japanese OCR to JS/Next.js Without Python — ONNX Quantization and Library Extraction

What this is about

1. YOLOv11 → ONNX export

2. Server-side inference in Next.js (onnxruntime-node)

Pre-processing: letterbox

SAHI (tiling)

Gotcha: `sharp` won't install on Node 25

3. Porting NDL koten-OCR-Lite line detection (RTMDet)

4. int8 quantization and accuracy comparison

5. Deployment: running on Vercel (serverless) is hard

What about WASM (onnxruntime-web)?

6. Extracting the core into a library

Reprise: the reliable way to install `sharp`

Summary

📜KotenOCR v1.3.0: Dual OCR Modes for Classical and Modern Japanese Text

📜KotenOCR: An Offline iOS App for Recognizing Classical Japanese Cursive Script

💬The Pitfall of JavaScript Operator Precedence - Investigating a Vercel Build Error

Comments

What this is about

1. YOLOv11 → ONNX export

2. Server-side inference in Next.js (onnxruntime-node)

Pre-processing: letterbox

SAHI (tiling)

Gotcha: sharp won't install on Node 25

3. Porting NDL koten-OCR-Lite line detection (RTMDet)

4. int8 quantization and accuracy comparison

5. Deployment: running on Vercel (serverless) is hard

What about WASM (onnxruntime-web)?

6. Extracting the core into a library

Reprise: the reliable way to install sharp

Summary

Related Articles

📜KotenOCR v1.3.0: Dual OCR Modes for Classical and Modern Japanese Text

📜KotenOCR: An Offline iOS App for Recognizing Classical Japanese Cursive Script

💬The Pitfall of JavaScript Operator Precedence - Investigating a Vercel Build Error

Comments

Gotcha: `sharp` won't install on Node 25

Reprise: the reliable way to install `sharp`