#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# latex2unicode.py: Convert a simple inline TeX/LaTeX (aimed at ArXiv abstracts) into Unicode+HTML+CSS, using the OA API.
# Author: Gwern Branwen
# Date: 2023-06-28
# When: Time-stamp: "2025-05-05 17:34:00 gwern"
# License: CC-0
#
# Usage: $ OPENAI_API_KEY="sk-XXX" xclip -o | python latex2unicode.py
#
# Typesetting TeX/LaTeX for web browsers is typically a heavyweight operation; even if done server-side, display often requires a lot of CSS+fonts. And then the result looks highly unnatural and clearly 'alien', interrupting reading flow. This is worthwhile for complex equations, where browser typesetting is not up to snuff, but for many in-the-wild TeX uses, the use is often as simple as `$X$`, which would look better as `X` & take megabytes less to render. So it is desirable for simple TeX expressions to convert them to 'native' Unicode/HTML (augmented with a bit of custom CSS to handle things like superscripts-over-subscripts which pop up in integrals/summations/binomials/matrices etc).
# Unfortunately, TeX is an irregular macro language which is hard to parse and 'compile' to Unicode: it's easy to do many examples, but there's a long tail of weird variables, formatting commands etc, which means that I wind up defining lots of rewrites by hand, even though they are usually pretty 'obvious'. So, quite tedious and unrewarding.
# However, this is a perfect use-case for GPT models: it is hard to write comprehensive rules for, but is an extremely constrained problem in a domain it knows well which requires processing few tokens, where I can give it many few-shot examples, interrogate it for edge-cases to then write rules/examples for, and the harm of an error is relatively minimal (anyone seriously using an equation will need to read the original anyway, so won't be fooled by a wrong translation).
# So we write down a list of general rules, then a bunch of specific examples, then ask GPT-4 to translate from TeX to Unicode/HTML/CSS.
#
# eg.
# $ echo 'a + b = c^2' | python3 latex2unicode.py
# a + b = c2
#
# Bonus feature: LLMs are smart enough to generalize, so free-form natural language inputs may also work:
#
# $ echo 'x times 2 but raised to 1/3rds' | latex2unicode.py
# x × 21⁄3
# $ echo 'asymptotically square root n' | latex2unicode.py
# 𝒪(√n)
#
# NOTE: this is intended only for using clean TeX and compiling to something usable in HTML/Markdown. For converting from an image or screenshot to TeX, see tools like or or (or prompting a VLM like Claude-3 or GPT-4o-V with an image & request)
import sys
from openai import OpenAI
client = OpenAI()
if len(sys.argv) == 1:
target = sys.stdin.read().strip()
else:
target = sys.argv[1]
prompt = """
Task: Convert LaTeX inline expressions from ArXiv-style TeX math to inline Unicode+HTML+CSS, for easier reading in web browsers.
Task example:
Input to convert: \\(H\\gg1\\)
Converted output: H ≫ 1
Details:
- Convert only if the result is unambiguous.
- Note that inputs may be very short, because each LaTeX fragment in an abstract is processed individually. Many inputs will be as short as a single letter (which are variables).
- Assume only default environment settings with no redefinitions or uses like `\\newcommand` or `\\begin`. Skip custom operators.
- Do not modify block-level equations, or complex structures such as diagrams or tables or arrays or matrices (eg `\\begin{bmatrix}`), or illustrations such as drawn by TikZ or `\\draw` , as those require special processing (eg. matrixes must be converted into HTML tables). Do not convert them & simply repeat it if the input is not an inline math expression.
- If a TeX command has no reasonable Unicode equivalent, such as the `\\overrightarrow{AB}`/`\\vec{AB}` or `\\check{a}` or `\\underline`/`\\overline` commands in LaTeX, simply repeat it.
- If a TeX command merely adjusts positioning, size, or margin (such as `\\big`/`\\raisebox`/`\\big`/`\\Big`), always omit it from the conversion (as it is probably unnecessary & would need to be handled specially if it was).
- The TeX/LaTeX special glyphs (`\\TeX` & `\\LaTeX`) are handled elsewhere; do not convert them, but simply repeat it.
- Use Unicode entities, eg. MATHEMATICAL CAPITAL SCRIPT O `𝒪` in place of `\\mathcal{O}`, and likewise for the Fraktur ones (`\\mathfrak`) and bold ones (`\\mathbb`). Convert to the closest Unicode entity that exists. Convert symbols, special symbols, mathematical operators, and Greek letters. Convert even if the Unicode is rare (such as `𝒪`). If there is no Unicode equivalent (such as because there is not a matching letter in that font family, or no appropriate combining character), then do not convert it.
- If there are multiple reasonable choices, such as `\\approx` which could be represented as `≈` or `~`, choose the simpler-looking one. Do not choose the complex one unless there is some good specific reason for that.
- For superimposed subscript+superscript, use a predefined CSS class `subsup`, eg. `(\\Delta^0_n)` → `Δ0n`; `\\Xi_{cc}^{++} = ccu` → `Ξcc++ = ccu`; `\\,\\Lambda_c \\Lambda_c \\to \\Xi_{cc}^{++}\\,n\\,` → `Λc Λc → Ξcc++,n`. This is also useful for summations or integrals, such as `\\int_a^b f(x) dx` → `∫ab f(x) dx`.
- For small fractions, where both numbers are 3 integer digits or less, use FRACTION SLASH (⁄) to convert (eg. `1/2` or `\\frac{1}{2}` → `1⁄2`). Do not use the Unicode fractions like VULGAR FRACTION ONE HALF `½`.
- For symbolic or large fractions, where one argument is a letter or symbol or >3 integer digits, use U+29F8 BIG SOLIDUS (⧸) instead, like '_a_⧸_b_'.
- For complex fractions which use superscripts or subscripts, multiple arguments etc, do not convert them & simply repeat them. eg. do not convert `\\(\\frac{a^{b}}{c^{d}}\\)`, as it is too complex.
- Convert roots such as square or cube roots if that would be unambiguous. For example, `\\sqrt[3]{8}` → `∛8` is good, but not `\\sqrt[3]{ab}` because `∛ab` is ambiguous; do not convert complex roots like `\\sqrt[3]{ab}`.
- Color & styling: if necessary, you may use simple CSS inline with a `` declaration, such as to color something blue using ``.
- Outlines/boxes: you may use simple inline CSS to draw borders.
- Be careful about dash use: correctly use MINUS SIGN (−) vs EM DASH (—) vs EN DASH (–) vs HYPHEN-MINUS (-).
More rules/examples for edge-cases:
- ' O(1)'
𝒪(1)
- '\\(\\mathsf{TC}^0\\)'
TC0
- '\\(\\approx\\)'
~
- '\\(1-\\tilde \\Omega(n^{-1/3})\\)'
1 − Ω̃(n−1⁄3)
- '\\(\\mathbf{R}^3\\)'
𝐑3
- '\\(\\ell_p\\)'
𝓁p
- '\\textcircled{r}'
ⓡ
- '(\\nabla \\log p_t\\)'
∇ log pt
- '\\(\\partial_t u = \\Delta u + \\tilde B(u,u)\\)'
∂tu = Δu + B̃(u, u)
- '\\(1 - \\frac{1}{e}\\)'
1 − 1⧸e
- 'O(\\sqrt{T}'
𝒪(√T)
- '\\(^\\circ\\)'
°
- '\\(^\\bullet\\)'
•
- '6\\times 10^{-6}\\)'
6×10−6
- '5\\div10'
5 ÷ 10
- '\\Pr(\\text{text} | \\alpha)'
Pr(text | α)
- '\\(\\hbar\\)'
ℏ
- '\\frac{1}{2}→ 1⁄2'
- \\nabla
∇
- '\\(r \\to\\infty\\)'
r → ∞
- '\\hat{a}'
â
- '\\textit{zero-shot}'
zero-shot
- '\\(f(x) = x \\cdot \\text{sigmoid}(\\beta x)\\)'
f(x) = x × sigmoid(β x)
- '\\clubsuit'
♣
- '\\textcolor{red}{x}'
x
- '\\textcolor{red}{X}'
X
- '\\textbf{bolding}'
bolding
- '\\textit{emphasis}'
emphasis
- 'B'
B
- 'u'
u
- 'X + Y'
X + Y
- '\\,\\Lambda_b \\Lambda_b \\to \\Xi_{bb}\\,N\\,'
, Λb Λb → Ξbb N,
- 'x \\in (-\\infty, \\infty)'
x ∈ (-∞, ∞)
- 'p\\bar{p} \\to \\mu^+\\mu^-'
pp̅ → μ+μ−
- '\\alpha\\omega\\epsilon\\S\\om\\in'
αωε§øm∈
- '^2H ^6Li ^{10}B ^{14}N'
2H 6Li 10B 14N
- '\\mathcal{L} \\mathcal{H} \\mathbb{R} \\mathbb{C}'
ℒ ℋ ℝ ℂ
- '\\textrm{M}_\\odot'
M☉−16–10−10M☉
- '200+'
200+
- 'M = M_a \\cup M_b \\subseteq \\mathbb{R}^d'
M = Ma ∪ Mb ⊆ ℝd
- 'f : \\mathbb{R}^d \\to \\mathbb{R}^p'
f : ℝd → ℝp
- 'M_a'
Ma
- 'β_k\\bigl(f(M_i)\\bigr) = 0'
βk(f(Mi)) = 0
- 'k \\ge 1'
k ≥ 1
- 'β_0\\bigl(f(M_i)\\bigr) = 1'
β0(f(Mi)) = 1
- 'i =a, b'
i = a, b
- '(n,d,\\lambda)'
(n, d, λ)
- '\\Lambda'
Λ
- '\\not\\approx'
≉
- '\\left\\langle A \\middle| B \\right\\rangle'
⟨A|B⟩
# note: : "In Unicode, a few of the more common blackboard bold characters (ℂ, ℍ, ℕ, ℙ, ℚ, ℝ, and ℤ) are encoded in the Basic Multilingual Plane (BMP) in the Letterlike Symbols (2100–214F) area, named DOUBLE-STRUCK CAPITAL C etc. The rest, however, are encoded outside the BMP, in Mathematical Alphanumeric Symbols (1D400–1D7FF), specifically from 1D538–1D550 (uppercase, excluding those encoded in the BMP), 1D552–1D56B (lowercase) and 1D7D8–1D7E1 (digits). Blackboard bold Arabic letters are encoded in Arabic Mathematical Alphabetic Symbols (1EE00–1EEFF), specifically 1EEA1–1EEBB."
- '\\mathcal{R}'
ℛ
- '\\mathbb{R}'
ℝ
- '\\mathbb{N}'
ℕ
- '\\cancel{x}'
x̸
- '\\left{\\frac{1}{2} \\right}'
\\left{\\frac{1}{2} \\right}
- '\\dot{x}'
ẋ
- '\\ddot{x}'
ẍ
- 'x^{y^{z}}'
xyz
- '\\lim_{x \\to \\infty} f(x)'
limx → ∞ f(x)
- '\\boxed{A}'
A
- '\\'
- '\\:'
- '\\;'
- '\\quad'
- '\\qquad'
- '!'
- '\\!'
- En space
- Figure space
- Punctuation space
- 'O(m' \\log^2 m')'
𝒪(m′ log2 m′)
- 'n''
n′
- '$%$'
%
- '%'
%q
- "\\(0.90, 0.91, 0.94\\)"
0.90, 0.91, 0.94
- '123/456'
123⁄456
- '123/4567'
123⧸4,567
- '1234/765'
1,234⧸765
- '5610/987980'
5,610⧸987,980
- '504827'
50,4827
- '($(\\frac{202680742}{582771} \\cdot 0.1) \\cdot 100$)'
((202,680,742⧸582,771) × 0.1 × 100)
- '740/618'
740⁄618
- '$\\frac{1910}{209} = 9.14$'
1,910⧸209 = 9.14
- '(504827⁄1800) × 1.0 × 100'
(504,827⧸1,800) × 1.0 × 100
- '$n/({\\pi\\over 8}$ lg $n)\\sp{1/2}$'
_n_⧸(𝜋⧸8 log _n_)1⁄2
- 'O(\\log n \\operatorname{polyloglog} n)'
𝒪(⟨logn⟩ polyloglog n)
- 'r1,... rm'
r1, ..., rm
- '\\(LCSPACE[s,c,e] = CSPACE[\\Theta(s + e \\log c), \\Theta(c)]\\)'
LCSPACE[S, c, e] = CSPACE[Θ(s + e log c), Θ(c)]
- 'M_{PBH} > 1.4 \\times 10^{17} {\\rm g}'
MPBH > 1.4 × 1017 g
- \\(<n\\)
<n
- '$DyT($x$) = \\tanh(α$x$)$'
DyT(x) = tanh(αx)
- '\\hat r'
r̂
- '$x = \\frac{o \\cdot e - (1 - e)}{o}$'
x = o · e − (1 − e) ⧸ o
- '$\\mathcal{V}$'
𝒱
- '\\(\\sim 10^6 \\mathrm{\\mu Lenat/word}\\)'
~3 × 106 μLenat⧸word
- '\\322\\
322
Task:
- '""" + target + "'\n"
completion = client.chat.completions.create(
model="gpt-4.1-mini", # we use GPT-4 because the outputs are short, we want the highest accuracy possible, we provide a lot of examples & instructions which may overload dumber models, and reviewing for correctness can be difficult, so we are willing to spend a few pennies to avoid the risk of a lower model
messages=[
{"role": "system", "content": "You are a skilled mathematician & tasteful typographer, expert in LaTeX."},
{"role": "user", "content": prompt }
]
)
output = completion.choices[0].message.content.rstrip()
print(output, end='') # avoid trailing newline because we might be cleaning inline text & want to avoid injecting newlines
He interrupted her. Close at hand is a stable where two beautiful ponies are kept. They are snowy white, and are consecrated to the goddess Ku-wanon, the deity of mercy, who is the presiding genius of the temple. They are in the care of a young girl, and it is considered a pious duty to feed them. Pease and beans are for sale outside, and many devotees contribute a few cash for the benefit of the sacred animals. If the poor beasts should eat a quarter of what is offered to them, or, rather, of what is paid for, they would soon die of overfeeding. It is shrewdly suspected that the grain is sold many times over, in consequence of a collusion between the dealers and the keeper of the horses. At all events, the health of the animals is regarded, and it would never do to give them all that is presented. On their return from the garden they stopped at a place where eggs are hatched by artificial heat. They are placed over brick ovens or furnaces, where a gentle heat is kept up, and a man is constantly on watch to see that the fire neither burns too rapidly nor too slowly. A great heat would kill the vitality of the egg by baking it, while if the temperature falls below a certain point, the hatching process does not go on. When the little chicks appear, they are placed under the care of an artificial mother, which consists of a bed of soft down and feathers, with a cover three or four inches above it. This cover has strips of down hanging from it, and touching the bed below, and the chickens nestle there quite safe from outside cold. The Chinese have practised this artificial hatching and rearing for thousands of years, and relieved the hens of a great deal of the monotony of life. He would not have it in the scabbard, and when I laid it naked in his hand he kissed the hilt. Charlotte sent Gholson for Ned Ferry. Glancing from the window, I noticed that for some better convenience our scouts had left the grove, and the prisoners had been marched in and huddled close to the veranda-steps, under their heavy marching-guard of Louisianians. One of the blue-coats called up to me softly: "Dying--really?" He turned to his fellows--"Boys, Captain's dying." Assuming an air of having forgotten all about Dicks rhyme, he went to his place in the seat behind Jeff and the instant his safety belt was snapped Jeff signaled to a farmer who had come over to investigate and satisfy himself that the airplane had legitimate business there; the farmer kicked the stones used as chocks from under the landing tires and Jeff opened up the throttle. Yes, Dick supplemented Larrys new point. Another thing, Sandy, that doesnt explain why hed take three boys and fly a ship he could never use on waterwith an amphibian right here. Should you leave me too, O my faithless ladie? And years of remorse and despair been your fate, That night was a purging. From thenceforward Reuben was to press on straight to his goal, with no more slackenings or diversions. "Is that you, Robin?" said a soft voice; and a female face was seen peeping half way down the stairs. HoMEl
ENTER NUMBET 0016www.jdlyuch.com.cn
lnfdsl.org.cn
www.jmwc.net.cn
www.hxgxih.com.cn
www.wchjsb.com.cn
v0kwfy.net.cn
www.wfybie.com.cn
www.npnvh.net.cn
www.odchsl.com.cn
www.oizsml.com.cn