利用 TeX 編譯器編寫 TeX 到 UTF8 轉換器

Question

首先，如果您喜歡ĄąĆćĘę£łŃńÓóŚś-źŻż在.tex文件中輸入，那麼您只需鍵入（或貼上）即可\usepackage[utf8]{inputenc}。）。例如，以下程式碼有效（使用編譯時xelatex）：

\documentclass{article}
\begin{document}
ĄąĆćĘę£łŃńÓóŚś-źŻż
\end{document}

如果問題是您沒有方便（或難忘）的鍵盤佈局來輸入該內容，因此您更願意使用 TeX 巨集進行輸入（但仍然希望文件包含上述字元），那麼這只是設定編輯器或輸入系統的問題。例如（建議在評論透過使用者循環空間），Emacs 可以做到這一點M-x set-input-method RET TeX：當你敲擊鍵盤上的按鍵時\=o，輸入到文件中的內容是ō.您不必使用 Emacs； UIM等輸入法也提供這種功能（例子）。

因此，如果您要建立文件，我認為沒有理由使用 TeX 本身來進行此類轉換.tex：最好先找到插入您喜歡的字元的方法。

但是，如果您正在使用.tex其他人創建的文件（並且您可以更改該文件），或者在您擁有此首選項之前由您自己創建的文件，那麼這個問題可能是有意義的。

使用 TeX（而不是編輯器中簡單的搜尋和替換）提供的主要功能是能夠知道巨集的定義何時\L發生\O變化。這也是問題中所說明的問題。

因此，為了解決這個問題，我使用內省（又名反光的）LuaTeX 附帶的功能：具體來說，token.get_macro它讓我們可以看到巨集的定義，以及process_input_buffer讓我們檢查每一行輸入的回呼（如果我們願意的話可以更改它）。這個想法是：

在正文開始之前，記錄所有已知字元替換巨集（\L、\"、\c等）的「原始」定義。這讓我們知道它們何時被重新定義。
對於輸入中的每一行，請尋找該行中出現的那些宏，檢查它們的定義是否沒有改變，並且（如果是）用適當的替換項替換它們及其參數。

因此，按照問題中的示例，在一個名為 say 的文件中mwe.tex：

\documentclass{article}
\directlua{dofile('rewrite.lua')}

\newcommand\zzz{hello}

\begin{document}

\L\"{o}\"{o}\c{k} \zzz

\renewcommand\L{LLL}
\renewcommand\"[1]{#1#1}
\renewcommand\c{c}

\L\"{o}\"{o}\c{k} \zzz

\end{document}

（注意\directlua{dofile(...)}新增的行），您可以運行lualatex mwe.tex（剪掉了一些行）：

9:41:29:~/tmp% lualatex mwe.tex
This is LuaTeX, Version 1.0.4 (TeX Live 2017) 
...
The original definition of #\L# is \TU-cmd \L \TU\L 
The original definition of #\c# is \TU-cmd \c \TU\c 
The original definition of #\"# is \TU-cmd \"\TU\" 
...
Processing line: \begin{document}
 --> Rewrote line to \begin{document}
...
Processing line: \L\"{o}\"{o}\c{k} \zzz
 --> Rewrote line to Łööķ \zzz
Processing line: 
 --> Rewrote line to 
Processing line: \renewcommand\L{LLL}
 ^ This line contains a \def or \newcommand or \renewcommand. Not rewriting.
...
Processing line: \L\"{o}\"{o}\c{k} \zzz
 --> Rewrote line to \L\"{o}\"{o}\c{k} \zzz

你會發現一個mwe.rewritten.tex文件包含：

\newcommand\zzz{hello}

\begin{document}
\relax

Łööķ \zzz

\renewcommand\L{LLL}
\renewcommand\"[1]{#1#1}
\renewcommand\c{c}

\L\"{o}\"{o}\c{k} \zzz

\end{document}
\relax

您可以看到只發生了應該發生的替換。rewrite.lua上面實現此操作的Lua 檔案（上面稱為）是：

print('')
rewritten_file = io.open(tex.jobname .. '.rewritten.tex', 'w')

funny_noarg = {
   ["\\L"] = "Ł",
   -- Define similarly for \oe \OE \ae \AE \aa \AA \o \O \l \i \j
}
funny_nonletter = {
   ['\\"'] = function(c) return c .. "̈" end,
   -- Define similarly for \` \' \^ \~ \= \.
}
funny_letter = {
   ["\\c"] = function(c) return c .. "̧" end,
   -- Define similarly for \u \v \H \c \d \b \t
}

orig_defs = {}
function populate_orig_defs()
   function set_def(s)
      definition = token.get_macro(s:sub(2))
      orig_defs[s] = definition
      print('The original definition of #' .. s .. '# is ' .. definition)
   end
   for s, v in pairs(funny_noarg) do set_def(s) end
   for s, v in pairs(funny_letter) do set_def(s) end
   for s, v in pairs(funny_nonletter) do set_def(s) end
end
populate_orig_defs()

function literalize(s)
   -- The string s, with special characters escaped, in a format safe for using inside gsub.
   -- https://stackoverflow.com/questions/1745448/lua-plain-string-gsub#comment18401212_1746473
   return s:gsub("[%(%)%.%%%+%-%*%?%[%]%^%$]", "%%%0")
end
function replace(s)
   print('Processing line: ' .. s)
   if s:find([[\def]]) ~= nil or s:find([[\newcommand]]) ~= nil or s:find([[\renewcommand]]) ~= nil then
      print(' ^ This line contains a \\def or \\newcommand or \\renewcommand. Not rewriting.')
     rewritten_file:write(s .. '\n')
     return nil
   end
   for k, v in pairs(funny_noarg) do
      -- followed by a nonletter. TODO: Can use the catcode tables.
      if token.get_macro(k:sub(2)) == orig_defs[k] then
         s = s:gsub(literalize(k) .. '([^a-zA-Z])', function(capture) return v .. capture end)
      end
   end
   for k, v in pairs(funny_letter) do
      -- followed by a letter inside {}. TODO: Can use the catcode tables, also can support \c c, for example.
      if token.get_macro(k:sub(2)) == orig_defs[k] then
         s = s:gsub(literalize(k) .. '{(.)}', v)
      end
   end
   for k, v in pairs(funny_nonletter) do
      -- followed by a letter inside {}. TODO: We could also support \"o for example.
      if token.get_macro(k:sub(2)) == orig_defs[k] then
         s = s:gsub(literalize(k) .. '{(.)}', v)
      end
   end
   print(' --> Rewrote line to ' .. s)
   rewritten_file:write(s .. '\n')
   return nil
end

luatexbase.add_to_callback('process_input_buffer', replace, 'Replace some macros with UTF-8 equivalents')

由於這只是一個概念驗證，而不是一個生產品質系統，因此我採取了一些快捷方式，如果您有興趣採用這種方法，您可以填寫這些快捷方式：

僅列出了一些 TeX 的重音或特殊字元巨集的 Unicode 等效項
您需要重新插入該\documentclass{article}行（實際上是該\directlua{dofile(…)}行之前的任何內容）。（為了好玩，你可以嘗試移動線前 \documentclass看看會發生什麼事。
您可能希望在所有\usepackage行之後（也許在的開頭）擁有這一行\begin{document}。（如果您嘗試過上述操作，您就會知道原因。）
您需要刪除\relax末尾的行（我們也許可以讓它不出現...）
它假設輸入檔包含 LaTeX-convention\={o}而不是\=o;再多幾行，我們也可以支持後者。類似地，如果\c{k}我們用\c kor代替\c {k}，等等。
它完全忽略（不替換任何內容）包含\def或的行\newcommand；相反，如果我們想要（如果輸入檔案寫得這麼糟糕！），我們可以跳到結尾\def或其他什麼，然後處理其餘部分。
它假設（為了知道控制序列何時\o結束）“字母”是a-zA-Z；你可能想添加@到這個清單中，實際上我們可以使用當時活躍的 catcode 制度下「字母」的確切定義——LuaTeX 也提供了這一點。

請注意，即使您通常使用 pdfTeX 或 XeTeX 編譯文件，您也可以只使用 LuaTeX 進行此轉換，然後在轉換後的文件上傳回使用 pdfTeX/XeTeX。

Answer 1