This article is co-authored with a generative AI. Facts have been cross-checked against official documentation where possible, but errors may remain. Please verify against primary sources before making any important decisions.
What this is about
I had been running my kuzushiji / historical-document image pipeline in Python (a Gradio Hugging Face Space). This is a record of porting it to Python-free JS/Node (onnxruntime-node). Concretely, four pieces:
- YOLOv11x character detection exported to ONNX and served from Next.js (an API Route)
- NDL koten-OCR-Lite line detection (RTMDet) ported to JS
- int8 quantization, compared against fp32 for accuracy, size, and speed
- Extracting the NDL OCR core into a framework-agnostic library
Along the way I also hit an easily-misunderstood deployment point: I found that running this on Vercel (serverless) is difficult — so I'll lay out the reasons too (it runs fine on a container or a plain Node server).
The models used are nakamura196/yolov11x-codh-char (MIT, character bounding-box detection) and NDL koten-OCR-Lite (CC BY 4.0, rtmdet-s line detection + parseq character recognition).
1. YOLOv11 → ONNX export
Ultralytics has ONNX export built in, so the conversion itself is one line.
from ultralytics import YOLO
m = YOLO("best.pt")
m.export(format="onnx", imgsz=1280, opset=12,
dynamic=True, # make input H/W dynamic (baseline=1280, SAHI tiles=1024)
simplify=True,
nms=False) # NMS is implemented on the JS side, so don't bake it in
The spec I confirmed:
- input
images [batch, 3, H, W]float32 (RGB, /255, letterboxed) - output
output0 [1, 5, N]=(cx, cy, w, h, score)× N anchors (single class; N=33600 at imgsz=1280) - No accuracy loss:
.ptand.onnxproduce the same number of boxes (251 on a sample image)
One caveat: exporting as fp32 yields about twice the size of the original .pt (109MB) → 217MB. That becomes the deployment bottleneck later.
2. Server-side inference in Next.js (onnxruntime-node)
I went with "UI in Next.js, inference in the server's Node process." Pre-processing (letterbox), post-processing (NMS), SAHI (tiling), and reading-order grouping are all reimplemented in TypeScript.
Pre-processing: letterbox
Building the ONNX input tensor. This works with canvas on both the server and the browser.
const scale = imgsz / Math.max(W, H);
const newW = Math.round(W * scale), newH = Math.round(H * scale);
const padLeft = Math.floor((imgsz - newW) / 2);
const padTop = Math.floor((imgsz - newH) / 2);
const ctx = createCanvas(imgsz, imgsz).getContext("2d");
ctx.fillStyle = "rgb(114,114,114)"; // ultralytics' gray padding
ctx.fillRect(0, 0, imgsz, imgsz);
ctx.drawImage(src, padLeft, padTop, newW, newH);
const { data } = ctx.getImageData(0, 0, imgsz, imgsz); // RGBA
const plane = imgsz * imgsz;
const tensor = new Float32Array(3 * plane);
for (let i = 0; i < plane; i++) {
tensor[i] = data[i * 4] / 255; // R
tensor[plane + i] = data[i * 4 + 1] / 255; // G
tensor[2 * plane + i] = data[i * 4 + 2] / 255; // B
}
Post-processing walks the [1,5,N] output, keeps score >= conf, converts xywh→xyxy, undoes the letterbox back to original coordinates, and applies class-agnostic NMS. It matched the Python behavior almost exactly (canvas resizing interpolates slightly differently from PIL, so it's off by a handful — 251 → 249).
SAHI (tiling)
For high resolution (downscale = long-edge/1280 > 2.0), naively shrinking the image crushes small characters. So I split the image into 1024px tiles with 20% overlap, infer near native resolution, map back to original coordinates, and run a global NMS — a SAHI-style approach. Since tiles are at most 1024, inferring at 1024 avoids upscaling them to 1280 needlessly (84s → 58s on a 6144×4096 image).
Gotcha: sharp won't install on Node 25
I first intended to use the de-facto standard sharp for image processing, but npm install on Node 25.6 failed to fetch the prebuilt, fell back to a source build, failed, and rolled back the whole install. This is a known npm bug where it doesn't install sharp's optional platform packages (@img/*) — not that sharp itself is incompatible with Node 25 (on the library side described below, sharp runs fine on the same Node 25).
On this Next.js side I worked around it by switching image processing to @napi-rs/canvas. Its N-API prebuilts are robust, and because canvas uses the same API as the browser, the pre-processing code carries over almost unchanged.
Measured (CPU): baseline 2.3s, SAHI 58s.
3. Porting NDL koten-OCR-Lite line detection (RTMDet)
Separately from character boxes, I also ported NDL's RTMDet line/layout detection to JS. With it you can display line boxes and remove false detections that fall outside the line regions (rulers, color charts, etc.).
Key points from reading the original Python:
- The filename says
rtmdet-s-1280x1280.onnx, but the actual input is 1024×1024 (the code reads it from the model, so it worked anyway) - Outputs are
dets [1,N,5]=(x1,y1,x2,y2,score)andlabels [1,N], with NMS baked into the model (post-processing is just a score threshold) - Unlike YOLO, pre-processing pads to a square with black (top-left aligned) → resizes to 1024 → RGB→BGR →
(px - mean) / std(mean/std are in BGR order) → CHW
const MEAN = [103.53, 116.28, 123.675]; // BGR
const STD = [57.375, 57.12, 58.395];
// CHW, channel order is BGR
tensor[i] = (b - MEAN[0]) / STD[0];
tensor[plane + i] = (g - MEAN[1]) / STD[1];
tensor[2 * plane + i] = (r - MEAN[2]) / STD[2];
In post-processing, boxes are scaled back to original coordinates by ÷1024 × (square padding size), and each box is extended vertically by 2%. Against the Python reference, I got 16–17 lines with coordinates within about 2px — evidence that the pre-processing (BGR, mean/std, padding, scale-back) was ported correctly.
4. int8 quantization and accuracy comparison
217MB is large, so I tried int8 quantization and compared it with fp32. The bottom line: you should use quantize_dynamic + QUInt8; nothing else worked.
| Method | weight_type | Result |
|---|---|---|
quantize_dynamic | QUInt8 | ✅ Adopted |
quantize_dynamic | QInt8 | ❌ Won't load (ConvInteger (opset 10) is NOT_IMPLEMENTED on ORT 1.23 CPU) |
quantize_static (per-channel) | QInt8 | ❌ Catastrophic: 0 detections / QDQ produces an invalid graph |
Comparison with QUInt8 dynamic (fp32 as ground truth, IoU≥0.5, CPU, median of 5 runs per image):
| Image | fp32 boxes | int8 boxes | F1 | recall | mIoU | t_fp32 | t_int8 |
|---|---|---|---|---|---|---|---|
| Taketari Monogatari (printed) | 251 | 250 | 0.990 | 0.988 | 0.959 | 2.35s | 0.99s |
| Itō Hirobumi letter (correspondence) | 546 | 536 | 0.969 | 0.960 | 0.964 | 2.95s | 1.15s |
| Tōji Hyakugō documents (historical document) | 319 | 252 | 0.862 | 0.771 | 0.963 | 2.47s | 1.79s |
Overall: size 228→58MB (3.9× smaller), CPU inference 2.15× faster, F1 0.945, conf error 0.048.
The notable point is that only the historical document drops in recall, to 0.771 (319→252; it misses low-confidence characters). The IoU of matched boxes is 0.96, so it's "missed detections," not "positional drift." For printed books and modern correspondence, int8 is essentially fine, but the practical conclusion is: for kuzushiji historical documents where recall matters, keep fp32.
5. Deployment: running on Vercel (serverless) is hard
When I tried to deploy to Vercel, I ran into its serverless constraints.
Vercel's Serverless Function has a 250MB (unzipped) size limit, and:
best.onnx(fp32) alone is 217MB → the model nearly hits the limit by itselfonnxruntime-nodeships binaries for all OSes in the package — 254MB (the linux native alone is ~36MB, but file tracing struggles to drop the rest and tends to overflow)- And execution time: SAHI takes 58s, which collides with Vercel's
maxDuration(Hobby max 60s / Pro 300s) - Being stateless, every cold start loads a tens-to-200MB model
…so it isn't practical as a Vercel serverless function. Even with int8 (58MB), the size is right at the edge and fragile, and the SAHI execution-time issue remains.
On the other hand, running next build && next start on a container or a plain Node server (Fly.io / Render / Cloud Run / VPS, etc.) handles the 217MB model and onnxruntime-node with no trouble. The difficulty is specifically running it in Vercel's serverless environment; choosing a different host avoids it.
What about WASM (onnxruntime-web)?
I also considered running fully in the browser. The code is light — the model is already ONNX, the pre/post-processing TS carries over directly, canvas is native in the browser, and onnxruntime-web's API is nearly identical to onnxruntime-node.
The hard parts are speed and size. On WASM CPU, YOLO11x is heavy: a single baseline run takes several to a dozen-plus seconds, and SAHI (dozens of tiles) takes minutes — effectively unusable. The WebGPU backend brings baseline down to 1–3s and into practical range, but it depends on browser support. Multithreading also requires COOP/COEP (cross-origin isolation). And the model is a 58MB (int8) first-load download. A "baseline-only, browser-contained demo" is realistic — that was the landing point.
6. Extracting the core into a library
It turned out a full JS implementation of NDL koten OCR (RTMDet line detection + PARSeq recognition + XY-cut reading order + ALTO/NDL XML output) already existed — in a separate Tropy plugin I had built. Since I had ended up writing RTMDet twice (once on the Next.js side above), I decided to extract the core into a shared library.
Fortunately the core was already separated into a framework-agnostic shape — engine.recognize(sharpImage) → { blocks, lines, text } — so the extraction was smooth.
- New package
@nakamura196/ndl-koten-ocr(its own repository) src/: engine / rtmdet / parseq / ndl-parser / xy-cut / alto / buffer-to-input- config YAMLs bundled; models (.onnx) gitignored and symlinked
- The Tropy plugin depends on it via a local
file:../ndl-koten-ocrlink - Tests: 81 in the library / 51 in Tropy, all green
Reprise: the reliable way to install sharp
The library uses sharp to match Tropy. When the npm optional-dependency bug strikes, this order installed it reliably:
npm install --ignore-scripts
npm install --no-save --ignore-scripts @img/sharp-darwin-arm64@0.34.5 @img/sharp-libvips-darwin-arm64@1.2.4
--ignore-scripts prevents the rollback caused by sharp's failing build, and then you add the prebuilt (@img/*) explicitly (match the versions to your sharp).
Summary
- YOLOv11 → ONNX is one line, with no accuracy loss
- Porting to JS is mostly just rewriting pre/post-processing (letterbox, NMS, SAHI, reading order, RTMDet), and it matched the Python reference at the coordinate level
- int8 quantization is
QUInt8dynamic only. 3.9× smaller and 2× faster, but recall drops on historical documents, so keep fp32 in production depending on the use case - Deployment: it's specifically hard to run on Vercel (serverless) (size and execution-time limits). A container or Node server runs it fine
- WASM: the code is light but speed is the bottleneck. Baseline is in range with WebGPU; SAHI is impractical
- To avoid duplicate implementations, the NDL OCR core was extracted into a shared library
This was a full pass at "how far can a heavy Python OCR/detection pipeline run in JS alone, and where do you host it?" I hope it helps anyone considering a JS port in the kuzushiji / classical-Japanese space.


Comments
…